Below is a list of the most common questions asked in data science interviews, covering topics in statistics, machine learning (ML) coding, and SQL.
The list was compiled with real questions from data science candidates and input from senior data scientists and machine learning engineers at Amazon, Google, Tinder, BCG, and Instacart.
Haziqa Sajid, an advocate for ML and data science developers, wrote this post. Raj Singh, a senior machine learning engineer at Amazon and interview coach, reviewed it.
Statistical experimentation is a key component of data science.
This interview section assesses your understanding of statistical concepts and ability to analyze data using statistical methods.
The statistics portion of the interview typically includes three types of questions:
Here are some examples of common interview questions:
Measures of central tendency describe the center of a data distribution.
These descriptive statistics summarize data based on its numerical values. The three most common measures are:
A positively skewed distribution has a larger tail towards the right.
The longer tail means the distribution contains extreme values (outliers) that stretch it to the right. In such a case, the median of a series is unaffected; however, the mean is sensitive to outliers.
Hence, it is more likely to be large in positively skewed data.
Correlation describes a linear relationship between two variables, X and Y.
It is a standardized value that provides the direction and magnitude of the relationship. The correlation value is bound between -1 and +1, where -1 means the variables are perfectly opposite.
When one increases in value, the other decreases by the same amount, and vice versa for +1.
A correlation of 0 means the two variables are entirely independent.
A correlation of +0.8 means that the two variables have an almost perfectly positive correlation. 0.8 indicates that when X increases by a particular value, Y will gain 80% of the increase of X.
Outliers are extreme values that deviate from the overall pattern in the data. In data science, they are often flagged as anomalies and removed before further analysis.
A common method of detecting outliers is the 1.5 interquartile range (IQR) rule. This involves calculating the data's 1st quartile (25th percentile) and 3rd quartile (75th percentile).
Outliers are defined as any values that fall below 1.5 times the IQR below the 1st quartile or above 1.5 times the IQR above the 3rd quartile.
A hypothesis test follows a standard 7-step process.
A p-value represents the probability of obtaining a test statistic at least as extreme as the observed one, assuming the null hypothesis is true.
A p-value less than 0.05 suggests the result is unlikely due to chance, leading to rejection of the null hypothesis. A higher p-value indicates that the observed data is consistent with the null hypothesis.
Type I and Type II errors occur in hypothesis testing:
Reducing one error increases the likelihood of the other, requiring a balance between the two.
The central limit theorem (CLT) states that as the sample size increases, the distribution of the sample means approaches the population mean, forming a normal distribution.
This is crucial in hypothesis testing because it allows data scientists to confidently generalize results from a sample to the entire population, assuming a normal distribution for many statistical tests.
A confidence interval defines the range within which we expect a population parameter to fall when repeating an experiment using random samples.
The confidence level represents the probability that the true value falls within the interval. For example, a 95% confidence level means that if the experiment is repeated 100 times, the true value will fall within the confidence interval 95 times.
There is a trade-off between the confidence interval and confidence level: Narrowing the interval increases precision but lowers the confidence level, reducing it from 95% to 90%.
During your interview, expect questions about your past projects and situational scenarios.
These questions are designed to evaluate your problem-solving approach, assess your experience, and gauge your communication and presentation skills. Common questions include:
This question assesses your experience based on your chosen project and how effectively you explain it.
Interviewers will focus on:
Choose a project that aligns with the role you're applying for, highlighting the skills that will benefit the potential employer.
Show how your experience is a good fit for their ongoing projects.
This situational question tests your ability to break down a problem and solve it step by step.
Here’s an approach:
Following this framework will help you diagnose and address the issue.
This scenario challenges you to identify how data can improve a widely used application.
Here’s a suggested approach:
This approach emphasizes the use of data-driven insights to improve user experience.
Python is a core programming language for data scientists and is featured in nearly every data science interview.
Python-based coding questions assess your ability to manipulate data and familiarity with popular frameworks like Pandas.
Some common questions include:
This entry-level question tests your ability to work with DataFrames in Python.
id | first_name | last_name | salary | department_id |
---|---|---|---|---|
int | varchar | varchar | int | int |
Here’s a pseudo-code solution using Pandas:
import pandas as pd
def lowest_earning_employees(employees: pd.DataFrame) -> pd.DataFrame:
# Select relevant columns
selected_columns = employees[['id', 'first_name', 'last_name', 'salary']]
# Sort by salary in ascending order
sorted_employees = selected_columns.sort_values(by='salary', ascending=True)
# Limit to the first 3 entries (the lowest 3 salaries)
lowest_earning_employees = sorted_employees.head(3)
return lowest_earning_employees
This medium-level question assesses your ability to manage data across multiple DataFrames.
The task requires joining the correct tables and extracting the desired result.
id | first_name | last_name | user_city | |
---|---|---|---|---|
int | varchar | varchar | int | int |
id | customer_id | product_id | amount | currency_code | date |
---|---|---|---|---|---|
int | int | int | int | varchar | date |
id | name | product_line_id | stock |
---|---|---|---|
int | varchar | int | int |
id | name |
---|---|
int | varchar |
id | source_currency_code | target_currency_code | rate |
---|---|---|---|
int | varchar | varchar | numeric |
Here’s a pseudo-code solution using Pandas:
import pandas as pd
def find_revenue_by_city(transactions: pd.DataFrame,
users: pd.DataFrame,
exchange_rate: pd.DataFrame) -> pd.DataFrame:
# Merge Transactions with Users
transactions_with_users = pd.merge(transactions, users, left_on='customer_id', right_on='id')
# Merge non-USD transactions with exchange rates
transactions_with_exchange = pd.merge(
transactions_with_users,
exchange_rate[exchange_rate['target_currency_code'] == 'USD'],
how='left',
left_on='currency_code',
right_on='source_currency_code'
)
# Assign an exchange rate of 1.0 for USD transactions
transactions_with_exchange['rate'] = transactions_with_exchange['rate'].fillna(1.0)
# Calculate the amount in USD
transactions_with_exchange['amount_usd'] = transactions_with_exchange['amount'] * transactions_with_exchange['rate']
# Aggregate revenue by city
revenue_by_city = transactions_with_exchange.groupby('user_city')['amount_usd'].sum().reset_index()
# Sort by revenue
revenue_by_city = revenue_by_city.sort_values(by='amount_usd', ascending=False).reset_index(drop=True)
# Rename columns for clarity
revenue_by_city.columns = ['user_city', 'total_revenue']
return revenue_by_city
city_pair
and average_distance
. The city_pair
column should list the cities in alphabetical order (e.g., “CityA-CityB”), and the average_distance
should be rounded to two decimal places. Sort the results by average_distance
in ascending order.This medium-difficulty question tests your ability to aggregate and manipulate data.
Here’s a pseudo-code solution using Pandas:
import pandas as pd
def find_average_distance(gps_data: pd.DataFrame) -> pd.DataFrame:
# Create a new column with city pairs in alphabetical order
gps_data['city_pair'] = gps_data.apply(
lambda row: '-'.join(sorted([row['origin', 'destination']])), axis=1
)
# Group by the city pair and calculate the average distance
result = gps_data.groupby('city_pair')['distance'].mean().reset_index()
# Round the average distances to two decimal places
result['distance'] = result['distance'].round(2)
# Rename columns
result.columns = ['city_pair', 'average_distance']
# Sort by average_distance in ascending order
result = result.sort_values(by='average_distance').reset_index(drop=True)
return result
This difficult question assesses your advanced data manipulation skills, requiring you to work with timestamps and pivot tables.
user_id | event | timestamp |
---|---|---|
1 | login | 2024-07-25 10:00:00 |
1 | like | 2024-07-25 10:01:10 |
Here’s a pseudo-code solution using Pandas:
import pandas as pd
def find_fastest_like(log: pd.DataFrame) -> pd.DataFrame:
# Create pivot table to get the earliest login and like timestamps for each user_id
pivot_df = pd.pivot_table(
data=log,
index='user_id',
columns='event',
values='timestamp',
aggfunc='min'
).reset_index()
# Drop unnecessary columns
pivot_df.columns.name = None
pivot_df = pivot_df[['user_id', 'login', 'like']]
# Calculate the time difference in minutes
pivot_df['time_between'] = ((pivot_df['like'] - pivot_df['login']).dt.total_seconds() / 60).round().astype(int)
# Sort by time_between
pivot_df = pivot_df.sort_values(by=['time_between'])
# Return the user with the shortest time_between
result = pivot_df.head(1)
return result
This medium-level question tests your ability to clean and preprocess data.
Here’s a pseudo-code solution using Pandas:
import pandas as pd
def interpolate_data(log: pd.DataFrame) -> pd.DataFrame:
# Remove duplicate rows
log = log.drop_duplicates()
# Interpolate missing timestamps
log = log.interpolate(method='linear', limit_direction='forward')
return log
Data scientists frequently use SQL to interact with relational databases (RDBMS), and a basic understanding of core SQL functionalities is essential.
SQL interview questions typically focus on functions like JOIN, GROUP BY, and aggregations.
Common questions include:
The GROUP BY
clause groups rows that have the same values in specified columns. It is commonly used with aggregate functions to summarize data. Typical aggregate functions used with GROUP BY
include:
SUM
AVG
COUNT
MIN
MAX
Example Syntax:
SELECT column1, SUM(column2)
FROM table_name
GROUP BY column1;
A JOIN
clause combines rows from two or more tables based on a related column (key).
The syntax for a JOIN
is:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t1.column = t2.column;
Types of JOIN
:
The post
and post_user
tables have the following schemas. The output should include:
Order the results by descending success rate.
This is a medium-level SQL question that tests your ability to use aggregation, GROUP BY
, JOIN
, and sorting functions. Here’s a pseudo-code solution:
SELECT
pu.user_type,
SUM(p.is_successful_post) AS post_success,
COUNT(p.is_successful_post) AS post_attempt,
ROUND(SUM(p.is_successful_post) * 1.0 / COUNT(p.is_successful_post), 2) AS post_success_rate
FROM post AS p
JOIN post_user AS pu ON p.user_id = pu.user_id
WHERE p.post_date BETWEEN '2023-11-01' AND '2023-11-30'
GROUP BY pu.user_type
ORDER BY post_success_rate DESC;
This medium-difficulty question requires knowledge of basic SQL concepts, including joins and aggregations.
Here's a pseudo-code solution using SQL:
SELECT
e.id AS employee_id,
e.name AS employee_name,
SUM(max_score) AS total_score
FROM employees e
INNER JOIN (
SELECT employee_id, MAX(score) AS max_score
FROM test_results
GROUP BY employee_id, test_id
) r
ON r.employee_id = e.id
GROUP BY e.id, e.name
ORDER BY total_score DESC, e.id ASC;
Window functions are advanced SQL tools that perform calculations across a set of rows related to the current row.
They allow operations over partitions of data without collapsing rows into groups.
Common window functions include:
RANK
DENSE RANK
ROW NUMBER
Basic syntax:
SELECT column1, column2,
RANK() OVER (PARTITION BY column1 ORDER BY column2 DESC) AS window_column
FROM table_name;
In this example, the RANK()
function assigns a rank to each row within the partitions defined by column1
. The rank resets for each new partition.
department_name
and total_revenue
.This medium-difficulty question tests your ability to join tables and aggregate revenue. The solution uses a join to connect customer orders with departments and aggregates the order amounts.
SELECT
d.department_name,
SUM(o.order_amount) AS total_revenue
FROM departments d
JOIN orders o ON d.department_id = o.department_id
WHERE o.order_date >= date('now', '-12 months')
GROUP BY d.department_name
ORDER BY total_revenue DESC;
total_amount_in_dollars
. You will need to convert non-USD amounts using the exchange rate and ensure proper scaling of the values.This is a difficult question that involves complex joins and calculations. Here's a pseudo-code solution:
WITH base AS (
SELECT
t.product_id,
CASE
WHEN t.currency_code = 'USD' THEN t.amount / 100.0
ELSE t.amount / 100.0 * er.rate
END AS amount_in_usd
FROM transactions t
LEFT JOIN exchange_rate er
ON t.currency_code = er.source_currency_code AND er.target_currency_code = 'USD'
INNER JOIN products p
ON t.product_id = p.id
INNER JOIN product_lines pl
ON p.product_line_id = pl.id AND pl.name = 'Telephones'
)
SELECT
ROUND(SUM(amount_in_usd), 2) AS total_amount_in_dollars
FROM base;
orders
table, write an SQL query that returns the order_id
, status
, start_date
, and end_date
for each status period of an order. If a status is the first for that order, end_date
should be NULL
.This hard-difficulty question requires date-wise calculations. A common table expression (CTE) helps calculate the relevant fields. Here's a pseudo-code solution:
WITH StatusChanges AS (
SELECT
order_id,
order_date,
status,
LEAD(order_date) OVER(PARTITION BY order_id ORDER BY order_date) AS next_date,
LAG(status) OVER(PARTITION BY order_id ORDER BY order_date) AS prev_status
FROM orders
)
SELECT
order_id,
status,
order_date AS start_date,
next_date AS end_date
FROM StatusChanges
WHERE status != prev_status OR prev_status IS NULL;
The machine learning coding interview will judge your theoretical concepts and how well you can demonstrate them via a practical application.
You can expect questions that will:
Some common questions to expect are:
A coin flip can be simulated in many ways in Python.
However, the most straightforward approach would be to use a binomial distribution.
A binomial distribution models the likelihood of achieving a particular outcome in an experiment, such as getting heads when flipping a coin.
Next, we need to plot the results as a histogram. Python libraries like Matplotlib or Plotly can be used to visualize this.
This is a pseudo-code solution using Python.
import numpy as np
import matplotlib.pyplot as plt
def simulate_coin_flips(n_simulations, n_flips, p_head):
"""Simulate flipping a coin n_flips times in each of n_simulations simulations."""
results = np.random.binomial(n_flips, p_head, n_simulations)
return results
sim_results = simulate_coin_flips(1000, 20, 0.2)
# Create histogram
plt.hist(sim_results, bins=range(20 + 2), edgecolor='black', color='skyblue')
plt.title('Histogram of Number of Heads in 1,000 Simulations')
plt.xlabel('Number of Heads')
plt.ylabel('Frequency')
plt.xticks(range(20 + 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
K-means is a popular clustering algorithm that uses an iterative approach to identify n centroids in the data.
The centroids are the central points of each of the n clusters.
In k-means, the centroids represent the average value of the entire cluster and are used to classify a new data point into one of the clusters.
This is a pseudo-code solution using Python:
import numpy as np
class Centroid:
def __init__(self, location, vectors):
self.location = location # (D,)
self.vectors = vectors # (N_i, D)
class KMeans:
def __init__(self, n_features, k):
self.n_features = n_features
self.centroids = [
Centroid(
location=np.random.randn(n_features),
vectors=np.empty((0, n_features))
)
for _ in range(k)
]
def distance(self, x, y):
return np.sqrt(np.dot(x - y, x - y))
def fit(self, X, n_iterations):
for _ in range(n_iterations):
# start initialization over again
for centroid in self.centroids:
centroid.vectors = np.empty((0, self.n_features))
for x_i in X:
distances = [
self.distance(x_i, centroid.location) for centroid in self.centroids
]
min_idx = distances.index(min(distances))
cur_vectors = self.centroids[min_idx].vectors
self.centroids[min_idx].vectors = np.vstack((cur_vectors, x_i))
for centroid in self.centroids:
if centroid.vectors.size > 0:
centroid.location = np.mean(centroid.vectors, axis=0)
def predict(self, x):
distances = [self.distance(x, centroid.location) for centroid in self.centroids]
return distances.index(min(distances))
A 2D convolution filter is popularly used in convolutional neural networks (CNN) for processing image data.
The filter can analyze images' spatial information (pixels) and condense information to smaller dimensions. The convolution operation multiplies the corresponding elements of the filter and the data and adds up the resulting values, creating an aggregated form of the larger representation.
The operation can be implemented in Python using loops and mathematical operators.
This is a pseudo-code solution using Python.
def conv2d(data, kernel):
m, n = len(data), len(data[0])
k = len(kernel)
# assume that the input is valid otherwise assert res = []
res = []
for i in range(m - k + 1):
row = []
for j in range(n - k + 1):
val = 0
for p in range(k):
for q in range(k):
val += data[i+p][j+q] * kernel[p][q]
row.append(val)
res.append(row[::])
return res
The train-eval-test split is a basic-level machine learning concept essential for ML training. During machine learning training, the dataset is split into sections.
The split between training and testing is dependent on several factors:
The split proportions also need to be considered for result reproducibility. This is often tackled by setting the random seed of the split function.
After the split, the largest portion is reserved for training the model. This is known as the training set, and it is passed to the model during training to teach it about the underlying patterns. The remaining portion is further split into the evaluation and test set.
The evaluation set is used during training to tune the model parameters against unseen data. It evaluates the model training in an unbiased manner.
The test set is usually separated from the training and evaluation sets and is used once the model has completed training. It evaluates the model performance on unseen data, ensuring it is ready for real-world scenarios.
The test set is purely for evaluation and does not influence the model parameters.
A data scientist must have a deep understanding of machine learning concepts.
These concepts are essential for training robust models, validating them effectively, and deploying them for real-world use. The machine learning section of the interview will typically touch on most key concepts, with some areas explored in greater depth.
Interview questions will evaluate:
Common questions include:
Since computers cannot process non-numerical data like text, categorical features must be encoded.
Common encoding methods include:
An imbalanced dataset occurs when the classes have unequal numbers of samples.
For example, in a cat vs. dog classifier, if 80% of the images are dogs and 20% are cats, the model may perform well on dogs but poorly on cats. To address this imbalance, common methods include:
In a decision tree, splits occur by dividing data based on feature values to create branches, aiming to make subsets of data more homogeneous. The best feature and value for splitting are chosen based on metrics like Gini Impurity or Entropy (for classification) and Variance Reduction (for regression).
Splitting continues until a stopping criterion is reached, such as maximum tree depth, a minimum number of samples per node, or pure subsets.
Accuracy measures the percentage of correct predictions and is suitable when all classes are equally important, such as in a basic cat vs. dog classifier.
However, accuracy is a poor metric when class importance varies. For example, in cancer prediction, a false negative (predicting a patient with cancer as healthy) is far more serious than a false positive. In such cases, metrics like recall provide a better evaluation.
Mean Squared Error (MSE) is used to evaluate regression models. It is calculated by squaring the errors (difference between actual and predicted values) and taking the average.
MSE is highly sensitive to outliers since squaring amplifies larger errors. This makes it less suitable when the data contains outliers or when the target variable spans a large range, as the squared errors may not be intuitive.
Logistic regression is a classification algorithm that predicts binary outcomes (0 or 1).
It uses the same linear equation concept as linear regression, where the equation is:
Y=m1x1+m2x2+m3x3+...Y = m_1x_1 + m_2x_2 + m_3x_3 + ...Y=m1x1+m2x2+m3x3+...
In this equation, the model learns the coefficients (mmm) based on the features (xxx) to predict the target variable. Unlike linear regression, which fits a straight line across the data, logistic regression applies the logistic function (also called the sigmoid function) to map the linear output to a probability between 0 and 1.
This is more suitable for classification tasks, where the target variables consist of binary values (1 or 0).
If the predicted probability is greater than 0.5, the model classifies the output as 1; otherwise, it classifies it as 0. The threshold can be adjusted depending on the use case.
Gradient Descent is an optimization technique that updates model parameters (weights) iteratively. The update process can be done in the following ways:
The bias-variance tradeoff describes the challenge of balancing underfitting and overfitting.
A well-tuned model minimizes both bias and variance to achieve good performance on both the training and test datasets.
Overfitting occurs when a model learns the noise and details of the training data so well that it negatively affects its performance on new data. An overfitted model performs well on the training set but poorly on test data or unseen examples.
How to diagnose overfitting:
How to prevent overfitting:
The Area Under the Curve (AUC) is a metric derived from the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at various classification thresholds. The AUC measures the entire two-dimensional area under the ROC curve.
In general, AUC helps evaluate how well the model ranks positive instances over negative instances across different threshold levels.
It is impossible to cover all the possible questions since data science is a broad and varied interview type!
Good luck with your upcoming DS interview!
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account