To help you prepare for machine learning interviews, we've compiled a list of real deep learning interview questions.
Each question is accompanied by a sample answer and an explanation of the core deep learning concepts involved.
This guide was written and compiled by Derrick Mwiti, a senior data scientist and course instructor. Satyajit Gupte, an interview coach and senior machine learning engineer, reviewed it.
Neural networks learn by iteratively adjusting their internal parameters to minimize errors, allowing them to make increasingly accurate predictions over time.
Neural networks learn by comparing their predictions to the actual values and adjusting the weights accordingly. The goal is to find the weights that minimize the error, known as the loss or cost function.
Selecting a loss function depends on the specific task: classification problems use classification loss functions, while regression problems use regression loss functions.
As the network continues to learn, the loss function should decrease.
An epoch refers to one complete pass through the entire dataset. A batch is a subset of the dataset created because processing the dataset as a whole at once is impractical.
An iteration is a single update of the model’s weights.
For example, if you have 10,000 images and a batch size of 200, each epoch will consist of 50 iterations as the model processes the entire dataset in portions.
Gradient descent is a method used to minimize errors during neural network training.
It works by adjusting the model’s parameters to reduce the cost function, which measures the error.
The goal is to find the point where the error is minimized, known as the local minimum. This can be visualized as descending a slope to reach the lowest point, the global minimum. The process involves calculating the slope at a specific point on the "hill" through differentiation.
Batch normalization is a technique for improving neural network training by normalizing the inputs. It ensures that the mean output is close to 0
and the standard deviation is close to 1
.
During training, batch normalization normalizes the input using the mean and standard deviation of the current batch.
When making predictions, it uses the moving average of the mean and standard deviation computed during training. This method is especially effective in deep neural networks, as it helps accelerate the training process.
The batch normalization layer also has learnable parameters to scale and shift the input after normalization.
An activation function in a neural network determines the final output of a model.
For instance, in a regression problem, the output should be a continuous value that predicts the quantity in question.
In contrast, for classification problems, the output is typically a probability that indicates the likelihood of each category.
For this reason, sigmoid or softmax activation functions are commonly used in classification tasks, as they constrain the output to a range between 0 and 1.
The neural network learns by comparing its predictions to the true values and adjusting the weights accordingly. The goal is to find the weights that minimize the error, measured by the loss or cost function.
The choice of a loss function depends on the specific problem. For example, classification tasks often use binary cross-entropy, while regression tasks typically use mean squared error.
Weights and biases in neural networks are typically randomly initialized when data is passed through the network. The initialization method can significantly affect the model’s ability to converge.
Common initialization methods include:
The weights of a neural network are usually initialized by drawing from a normal distribution with zero mean and variance, which is 2/n, where n = number of neurons in the layer.
Overfitting occurs when a model performs exceptionally well on training data but fails to generalize to unseen data. This means the model achieves high accuracy on the training set but poor accuracy on new data, indicating its inability to generalize.
To prevent overfitting, several approaches can be used, including:
To set the batch size:
Yes, batch size affects other hyperparameters:
A good rule of thumb is to use the largest possible batch size that fits on the GPU memory. Increasing the batch size speeds up training and usually means more stable weight updates. Batch size affects other hyper-parameters like learning rate and momentum. In practice, we first set the batch size before doing any hyperparameter tuning because the optimal hyperparameters depend on the chosen batch size.
Hyperparameter tuning is critical for optimizing model performance. The most important parameters to prioritize in deep learning are:
The tuning budget depends on your performance goals and available resources.
For example, if your goal is to achieve 90% accuracy, you can continue tuning until this threshold is reached. However, tuning must also be balanced with the project’s financial budget and deadlines.
Training throughput can be diagnosed using tools like TensorBoard or PyTorch Profiler to monitor metrics such as epoch duration, inference time, and GPU utilization.
To improve throughput:
Unstable training is characterized by erratic fluctuations in loss and accuracy. Here are some steps to stabilize training:
The learning rate in a neural network is a critical hyperparameter that determines how much the model's weights are adjusted during each update in response to the gradient. It essentially controls the step size to minimize the loss function.
Autoencoders are a neural network designed to learn efficient data representations, typically for dimensionality reduction.
An autoencoder takes an input, such as an image, and compresses it into a low-dimensional representation, known as a latent vector.
This latent vector is then used to reconstruct the original input. Regular autoencoders are trained to output the same image they receive as input.
The convolution layer is key to reducing the image size, making processing more efficient and less computationally intensive.
It applies a feature detector (also known as a kernel or filter) over the input image. Typically, the kernel is a 3x3 matrix, though other sizes can be used. The convolution operation involves element-wise multiplication of the kernel with the input image and summing the results to produce a feature map.
This layer is fundamental to extracting features such as edges, textures, and patterns from the input image.
Given the input image and filter, the convolution operation is calculated as follows:
3 × 1 + 5 × 0 + 2 × 0 + 9 × 1 + 7 × 1 + 5 × 0 + 2 × 0 + 0 × 0 + 6 × 1 = 3 + 0 + 0 + 9 + 7 + 0 + 0 + 6 = 25
The kernel is then slid across the entire input image to compute all the values as shown above.
The kernel moves over the input image in steps known as strides. The stride length is defined when designing the network. The size of the resulting feature map depends on the stride and kernel size.
Padding ensures that the feature map size remains the same as the input image size. Padding involves adding zeros around the border of the input image so that, after the kernel is applied, the output retains the exact dimensions as the input. The padding type is specified during network creation.
The two common options are:
The uncolored area in a padded image represents the padded (zero-filled) region.
The Rectified Linear Unit (ReLU) layer is applied during the convolution operation to introduce non-linearity. ReLU forces all negative values to zero, while positive values are retained as they are.
The Pooling Layer further reduces the size of the feature map. This is done by applying a pooling filter, such as max pooling, which slides over the feature map and selects the maximum value in each region. For example, a 2x2 pooling filter applied to a 4x4 feature map reduces it to a 2x2 pooled feature map.
Dropout is often added to the network as a regularization technique to prevent overfitting. It randomly sets a fraction of the input units to zero during training.
The Flattening Layer converts the pooled feature map into a single column (a one-dimensional array), which is then passed to the fully connected (Dense) layer. This Dense layer is responsible for generating the final output. Depending on the problem type, an activation function is applied:
The Transformer architecture was introduced in the influential paper Attention Is All You Need.
Unlike traditional models, the Transformer doesn’t rely on recurrence or convolutions. Instead, it uses an attention mechanism, allowing faster training through parallelization.
The architecture consists of stacked self-attention and fully connected layers in the encoder and decoder, eliminating the need for recurrent networks.
The Transformer is widely used in natural language processing (NLP) tasks because:
Gradient clipping is a technique for preventing exploding gradients by limiting them to a maximum value. When the gradient's norm exceeds a specified threshold, it is clipped, which helps maintain stable training.
Gradient clipping in Keras using clipvalue
:
# Gradients less than -0.5 will be capped to -0.5, and gradients above 0.5 will be capped to 0.5
model.compile(
optimizer=tf.keras.optimizers.Adam(clipvalue=0.5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
Gradient clipping in Keras using clipnorm
:
# If the vector norm of a gradient exceeds 1, the vector's values will be rescaled so that the norm equals 1.
model.compile(
optimizer=tf.keras.optimizers.Adam(clipnorm=1.0),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
Here’s how to implement a 2D convolutional filter in PyTorch:
import torch
import torch.nn as nn
def conv2d(input, weight, bias=None, stride=1, padding=0):
"""
Computes the 2D convolution of an input tensor with a weight tensor.
Args:
input: A 4D tensor of shape (batch_size, in_channels, in_height, in_width).
weight: A 4D tensor of shape (out_channels, in_channels, kernel_height, kernel_width).
bias: A 1D tensor of shape (out_channels,) for bias values (optional).
stride: The stride of the convolution (default=1).
padding: The padding for the convolution (default=0).
Returns:
A 4D tensor of shape (batch_size, out_channels, out_height, out_width).
"""
# Get input and weight dimensions
batch_size, in_channels, in_height, in_width = input.shape
out_channels, _, kernel_height, kernel_width = weight.shape
# Create and initialize the 2D convolution layer
conv_layer = nn.Conv2d(in_channels, out_channels, kernel_size=(kernel_height, kernel_width), bias=bias, stride=stride, padding=padding)
conv_layer.weight = nn.Parameter(weight)
# Perform the convolution
return conv_layer(input)
A transformer-based language model predicts the next word based on the context of previous words. The process starts with a prompt, which is converted into numerical tokens representing words or subwords.
These tokens are passed through an embedding layer that captures their semantic meaning, and position encoding is added to retain word order since transformers don't inherently understand sequence.
The tokens then go through self-attention layers, allowing the model to weigh each token's importance relative to others. This captures long-range dependencies in the text. The model outputs the probability of each word being the next word, and the next word is selected using strategies like greedy decoding, beam search, or top-k sampling.
The generated token is added to the sequence and fed back to the model to generate the next one. This process continues until the desired length is reached or an end-of-sequence token is generated. The final step is detokenization, where the tokens are converted back into text.
zero_grad()
in PyTorch, and when is it used?The zero_grad()
function in PyTorch resets all the gradients to zero.
Before starting a new backward pass, ensure that previous gradients do not interfere with the current training iteration.
Example of zero_grad()
in PyTorch:
import torch
import torch.optim as optim
# Define a simple model and optimizer
model = torch.nn.Linear(1, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Initialize input data and target
inputs = torch.randn(1, 1, requires_grad=True)
target = torch.randn(1, 1)
# Starting training loop
for _ in range(5): # run for 5 iterations
# Perform forward pass
output = model(inputs)
loss = torch.nn.functional.mse_loss(output, target)
# Perform backward pass and update model parameters
optimizer.zero_grad() # Zero the gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
Class weights can be introduced and passed to the model during training to handle imbalanced datasets.
weights = [1.013, 0.889, 1.007, 0.978, 1.027, 1.106, 1.013, 0.957, 1.025, 1.008]
model.compile(optimizer=tf.keras.optimizers.SGD(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
loss_weights=weights,
metrics=['accuracy'])
In addition to class weights, we can also undersample the majority class.
A custom loss function is defined by creating a function that accepts true and predicted values, returning the computed loss.
import tensorflow as tf
def custom_loss_function(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=custom_loss_function)
from tqdm.notebook import trange
import tensorflow as tf
# Track results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = 10
for epoch in trange(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# Training loop - using batches of 32
for x, y in training_data:
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Update current batch loss
epoch_accuracy.update_state(y, model(x, training=True)) # Update accuracy
# End of epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
print(f"Epoch {epoch + 1}: Loss: {epoch_loss_avg.result():.3f}, Accuracy: {epoch_accuracy.result():.3%}")
Transfer learning is a technique where a model developed for one task is repurposed for a different but related task.
This approach is prevalent in computer vision and natural language processing, where pre-trained models serve as a foundation for new tasks, significantly reducing training time.
Pre-trained models are usually trained on large benchmark datasets, and their learned weights are applied to a new task with a smaller dataset. This method not only speeds up training but also helps to reduce the generalization error.
Steps to implement transfer learning:
Example implementation in TensorFlow:
import tensorflow as tf
# Define image size
IMAGE_SIZE = 224
# Load a pre-trained model
base_model = tf.keras.applications.MobileNetV3Small(
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
alpha=1.0,
include_top=True,
weights="imagenet",
input_tensor=None,
pooling=None,
classes=1000,
classifier_activation="softmax"
)
# Freeze the base model's layers
base_model.trainable = False
# Standardize the input
inputs = keras.Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 3))
x = data_augmentation(inputs) # Apply random data augmentation
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(0.2)(x) # Regularize with dropout
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
# Fine-tune the model
base_model.trainable = False
# Unfreeze the last 5 layers for training
for layer in model.layers[:-5]:
layer.trainable = True
RLHF is a method for aligning a large language model (LLM) with human preferences to make it safer and more valuable.
The steps include:
The reward model is created by:
Challenges include:
Retrieval-augmented generation (RAG) enhances large language models by incorporating external data sources to reduce hallucinations and improve factual accuracy.
RAG retrieves relevant documents to inform the model’s output, ensuring it remains grounded in real data.
The RAG process involves:
Problems solved by RAG:
CNNs are generally preferred over ANNs for image classification because:
Data augmentation is a method used to increase the diversity of images without collecting new data, making models more robust and capable of generalizing to unseen datasets. Common techniques include:
You can implement learning rate scheduling with torch.optim.lr_scheduler
.
Common schedulers include:
StepLR
, MultiStepLR
, ExponentialLR
, ReduceLROnPlateau
.Example:
import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
import torch.nn as nn
# Model and optimizer
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Define LR scheduler
scheduler = lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
# Training loop
for epoch in range(20):
optimizer.step()
scheduler.step()
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Lambda, Reshape, Conv2DTranspose
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import numpy as np
# Define input shape and latent dimension
latent_dim = 2
input_shape = (img_size, img_size, 3)
# Encoder network
inputs = Input(shape=input_shape)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(inputs)
x = Conv2D(32, (3, 3), activation='relu', strides=(2, 2), padding='same')(x)
x = Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = Conv2D(256, (3, 3), activation='relu', padding='same')(x)
shape_before_flattening = K.int_shape(x)
x = Flatten()(x)
z_mean = Dense(latent_dim)(x)
z_log_var = Dense(latent_dim)(x)
# Sampling function
def sampling(args):
z_mean, z_log_var = args
epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim))
return z_mean + K.exp(z_log_var / 2) * epsilon
# Reparameterization trick
z = Lambda(sampling)([z_mean, z_log_var])
# Decoder network
decoder_input = Input(K.int_shape(z)[1:])
x = Dense(np.prod(shape_before_flattening[1:]), activation='relu')(decoder_input)
x = Reshape(shape_before_flattening[1:])(x)
x = Conv2DTranspose(128, (2, 2), activation='relu', padding='same')(x)
x = Conv2DTranspose(64, (2, 2), activation='relu', padding='same', strides=(2, 2))(x)
x = Conv2DTranspose(32, (2, 2), activation='relu', padding='same')(x)
x = Conv2DTranspose(16, (2, 2), activation='relu', padding='same')(x)
x = Conv2D(3, (3, 3), activation='sigmoid', padding='same')(x)
# Define the VAE model
encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder')
decoder = Model(decoder_input, x, name='decoder')
outputs = decoder(encoder(inputs)[2])
vae = Model(inputs, outputs, name='vae')
Federated Learning allows devices to collaboratively improve a shared model without exchanging raw data.
Initially, the model is trained on proxy data, downloaded to individual devices, and refined using local data.
Key benefits include:
Model quantization is a technique used to reduce large models' memory and computational requirements by lowering their weights' precision, such as converting from float32
to float16
.
This enables the deployment of large models on resource-constrained environments like edge devices and mobile platforms.
Quantized models offer several benefits:
There are two main types of quantization:
Fundamental quantization techniques include:
Challenges of quantization include potential accuracy degradation and limited support for certain quantization methods on specific hardware.
It is impossible to cover all the possible questions since deep learning is a broad and varied topic!
Hopefully, these questions have given you a glimpse into what to expect in deep-learning interviews.
Good luck with your upcoming machine learning interview!
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account