SQL is a fundamental topic in data science interviews at all levels. Querying, manipulating, and interpreting data are key skills for any data scientist.
For junior roles, you will likely be tested on basic querying and aggregation concepts. For more senior interviews, you should be familiar with advanced concepts like window functions, CTEs, and complex data transformations.
Below, we've compiled a list of example SQL interview questions indicative of the types of questions you can expect in real data science interviews. We created this list with input from data analysts, data scientists, and machine learning engineers at Meta, Google, Dropbox, Amazon, and top consulting firms.
Each question includes a sample answer and explanation of the underlying SQL concepts.
This guide was written and compiled by Derrick Mwiti, a senior data scientist and course instructor.
WHERE
, INNER JOIN
, ORDER BY
, GROUP BY
, and aggregations like COUNT()
, AVG()
, and MAX()
.CASE WHEN
, date and time manipulation, WITH
clauses, UNION
, INTERSECT
, EXCEPT
, and string functions like CONCAT()
and REPLACE()
.Early in your data career, your SQL interviews will likely be focused on basic querying, syntax, and data retrieval, such as:
WHERE
clauses to filter rows based on specific conditions.INNER JOIN
and LEFT JOIN
.COUNT()
, SUM()
, AVG()
, MIN()
, and MAX()
to aggregate.GROUP BY
to aggregate data into summary rows.ORDER BY
to sort query results.Here are some examples of common SQL questions you might hear:
You are given an employee
table with the following columns: employee_id
, name
, salary
, department_id
, and manager_id
.
Identify the employees who earn more than their direct managers. The expected output is: employee_id
, employee_name
.
Here is a sample solution:
SELECT
e.employee_id,
e.name AS employee_name
FROM
employee e
JOIN
employee m ON e.manager_id = m.employee_id
WHERE
e.salary > m.salary;
employee
table is aliased as e
for employees and m
for managers.e.manager_id = m.employee_id
.WHERE
clause e.salary > m.salary
.employee_name
and employee_id
retrieved using SELECT
.You work for Lyft, a ride-sharing company and are tasked with finding the average number of unmatched bookings per user. The expected output is: user_id
, user_name
, email
, and avg_unmatched_bookings
(rounded to the nearest 2 decimal places).
You have a users
table with user_id
, user_name
, email
. You also have a bookings
table with booking_id
, user_id
, driver_id
, booking_time
, status
.
Here's a sample solution:
SELECT
u.user_id,
u.user_name,
u.email,
COALESCE(ROUND(AVG(CASE WHEN b.status = 'Unmatched' THEN 1 ELSE 0 END), 2), 0) AS avg_unmatched_bookings
FROM users u
LEFT JOIN bookings b ON u.user_id = b.user_id
GROUP BY u.user_id, u.user_name, u.email;
LEFT JOIN
to combine the users
table with the bookings
table on the user_id
column. This ensures that even users with no bookings are included in the result.CASE
statement within the AVG()
function to compute the average number of bookings for each user. Unmatched bookings are counted as 1, or 0 otherwise.COALESCE
function together with the ROUND
function to ensure there are no null values. Then, the average is rounded to two decimal places.Show the total amount of successful posts a particular user_type
made in November 2023. Then, order your output by descending success rate.
Your output should include: user_type
, post_success
(number of successful posts), post_attempt
(number of total posts), and post_success_rate
(ranging from 0.00 to 1.00).
You've been given a post
table with post_id
, post_date
, user_id
, interface
, and is_successful_post
. You also have a post_user
table with user_id
, user_type
, and age
.
A sample solution looks like:
SELECT
pu.user_type,
COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END) AS post_success,
COUNT(p.post_id) AS post_attempt,
ROUND(COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END) * 1.0 / COUNT(p.post_id), 2) AS post_success_rate
FROM
post p
JOIN
post_user pu ON p.user_id = pu.user_id
WHERE
p.post_date >= '2023-11-01' AND p.post_date < '2023-12-01'
GROUP BY
pu.user_type
ORDER BY
post_success_rate DESC;
user_id
column to join the post
table with the post_user
table to get the user type for each post.WHERE
function to filter posts from November 2023.COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END)
.ROUND
. To guarantee that the division function is performed in floating-point, multiply the result by 1.0
.GROUP BY pu.user_type
.post_success_rate
in descending order with DESC
.Calculate the average duration of marketing campaigns, given three tables.
First, you have a campaign
table with an upsell_campaign_id
, date_start
, and date_end
. Then, you have a user
table with user_id
, name
, and is_eligible_for_upsell_campaign
. Finally, there's a transaction
table withtransaction_id
, user_id
, product_id
, transaction_date
, and quantity
.
Here is a sample solution:
SELECT AVG(DATEDIFF(date_end, date_start)) AS avg_campaign_length
FROM campaign;
DATEDIFF
.AVG
.Imagine you work at Netflix. Calculate the total viewership from laptops compared to mobile devices.
There are three types of devices that viewers use:phone
, tablet
, and laptop
. Mobile viewership is computed by summing the tablet and phone viewership. You also have a viewership
table with user_id
, device_type
, and view_time
.
Here's a sample solution:
SELECT
SUM(CASE WHEN device_type = 'laptop' THEN 1 ELSE 0 END) AS laptop_views,
SUM(CASE WHEN device_type IN ('phone', 'tablet') THEN 1 ELSE 0 END) AS mobile_views
FROM
viewership;
SUM
with CASE
statements to conditionally count the number of viewership records for each device_type
.Compute the average star rating for each product
grouped by the month
given a reviews
table with these columns: review_id
, user_id
, submit_date
, product_id
, and stars
.
Display the month as a numerical value, product ID, and average star rating rounded to two decimal places. Sort the results by the first month followed by the product ID.
For example:
SELECT
EXTRACT(MONTH FROM submit_date) AS mth,
product_id,
ROUND(AVG(stars), 2) AS avg_stars
FROM
reviews
GROUP BY
EXTRACT(MONTH FROM submit_date), product_id
ORDER BY
mth, product_id;
EXTRACT
the month from the submit_date
column.ROUND
computes the average to 2 decimal places.GROUP BY
and ORDER BY
order the data by month and product ID.You work for Amazon. The company likes to provide seasonal discounts. This year, electronics are discounted at 10%, clothing at 20%, groceries at 5%, and books at 15%.
You are given a products
table with name
, type
, and price
. Generate a discounted product catalog that displays the name
, type
, and discounted_price
for each item. Round the discounted prices to the nearest cent.
SELECT
name,
type,
ROUND(price *
CASE
WHEN type = 'Electronic' THEN 0.90
WHEN type = 'Clothing' THEN 0.80
WHEN type = 'Grocery' THEN 0.95
WHEN type = 'Book' THEN 0.85
ELSE 1.00
END, 2) AS discounted_price
FROM
products;
CASE
applies the appropriate discount based on the product. ROUND (..., 2)
rounds the price to the nearest cent. SELECT
picks the name, type, and discounted price. You work for a bank like Capital One that wants to launch a new credit card. You are given a table with the number of credit cards issued each month that includes card_name
, issued_amount
, issue_month
, and issue_year
.
Output the name of each credit card and the difference in the total number of credit cards between the highest and lowest issuance month. Arrange the output based on the highest disparity.
SELECT
card_name,
MAX(issued_amount) - MIN(issued_amount) AS difference
FROM
monthly_cards_issued
GROUP BY
card_name
ORDER BY
difference DESC;
MAX(issued_amount) - MIN(issued_amount)
computes the difference between the highest and lowest issuance amounts for each card.GROUP BY card_name
groups the results by card name to calculate the difference for each card.ORDER BY difference DESC
sorts the results by the difference in descending order.Fetch the number of transactions made by users in specific cities. Then, order by descending number of transactions.
You're given 2 tables. First, a users
table with id
, first_name
, last_name
, user_city
, and email
. You also have a transactions
table with id
, customer_id
, product_id
, currency_code
, date
, and amount
.
Here's a sample solution:
SELECT
u.user_city,
COUNT(t.id) AS number_of_transactions
FROM
users u
JOIN
transactions t ON u.id = t.customer_id
GROUP BY
u.user_city
ORDER BY
number_of_transactions DESC;
users
table is joined with the transactions
table on the customer_id
to get the transactions for each user.COUNT(t.id)
counts the number of transactions for each user city.user_city
to aggregate the transaction counts per city.number_of_transactions
in descending order.Data scientists use SQL to clean, transform, and preprocess data. As your data science career advances, so too will your SQL interviews.
Your roles will begin to require you to be able to process, manipulate, and summarize larger datasets.
For this reason, mid-level data scientists should expect to get asked SQL questions about:
ROW_NUMBER()
, RANK()
, DENSE_RANK()
, LEAD()
, and LAG()
.CASE WHEN
for conditional logic in queries.WITH
clauses to define temporary result sets.UNION
, INTERSECT
, and EXCEPT
to combine results from multiple queries.CONCAT()
, SUBSTRING()
, REPLACE()
.Here are some questions you can practice:
Given a table called purchase_orders
, with quantity
and price
columns, compute the total_revenue
. Group the total revenue by country, then order by the total revenue.
Here's a sample solution:
SELECT SUM(quantity * price) AS total_revenue
FROM purchase_orders
GROUP BY country
ORDER
BY total_revenue DESC;
quantity
and price
columns.SUM
them to find the total revenue. Given a table of orders
with quantity
, categorize each order into "small" and "large" orders with a new size
column using the CASE WHEN
function.
Large orders have quantity >= 2
.
SELECT
*,
CASE
WHEN quantity >= 2 THEN 'large'
ELSE 'small'
END AS size
FROM
orders;
CASE WHEN
is similar to if-then-else statements. It's used to evaluate one or more conditions and returns a value based on its evaluation.Given a table of orders
with timestamps in the order_date
column, aggregate and return the number of orders per month.
SELECT
DATE(order_date, 'start of month') AS order_month,
COUNT(*) AS no_orders
FROM orders
GROUP BY order_month;
DATE(column, 'start of month')
.DATE_TRUNC('month', column)
.EXTRACT(MONTH FROM column)
.Imagine you work at a company with monthly subscriptions. Given a table of users
, compute the average retention based on the month of the signup_date
and churn_date
. Group the results by signup_month
.
SELECT
date_trunc('month', signup_date) AS signup_month,
AVG(datediff('day', signup_date, churn_date)) AS avg_retention
FROM users
GROUP BY signup_month;
signup_date
and the churn_date
using datediff
while also finding the signup month. You're given tables for customers
and orders
. There's a foreign key customer_id
, which references customers.id
. Return all orders with their associated customer data from March 2022.
Here's a sample solution:
SELECT orders.id as order_id, customers.*
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.id
WHERE orders.order_date BETWEEN '2022-03-01' and '2022-04-01';
orders.id as order_id
renames the id
column from the orders
table to order_id
for clarity.customers.*
selects all columns from the customers
table.INNER JOIN
joins the orders
and customers
tables. The join condition is that the customer_id
in the orders
table matches the id
in the customers
table.WHERE
and BETWEEN
filter orders between March and April of 2022.Given customers
and orders
tables with id
and customer_id
columns, use subqueries to return all customers who have at least one order.
Here's a sample solution:
SELECT *
FROM customers
WHERE (SELECT COUNT(*)
FROM orders
WHERE customers.id = orders.customer_id) > 0;
SELECT COUNT()
counts the number of orders for each customer.WHERE
clause ensures that only customers with at least one order are counted.Alternatively:
SELECT *
FROM customers
WHERE id IN (SELECT customer_id FROM orders);
SELECT customer_id FROM orders
retrieves all customer_id
values from the orders
table.id
matches any of the customer_id
values returned by the subquery.Solve the same problem as above. This time, use Common Table Expressions (CTEs) instead of subqueries.
CTEs are used for breaking down complex queries into more manageable parts and are similar to subqueries, however they are defined before the main query. Subqueries, on the other hand, are defined within the main query.
CTEs can be referenced multiple times in a single query, whereas subqueries can't. However, subqueries are more efficient because the database engine easily optimizes them.
Here's a sample solution:
WITH customer_orders AS (
SELECT customer_id, COUNT(*) as num_orders
FROM orders
GROUP BY customer_id
)
SELECT customers.*
FROM customers
JOIN customer_orders
ON customers.id = customer_orders.customer_id
WHERE customer_orders.num_orders > 0;
WITH
keyword is used to define the CTE. The AS
keyword is used to define the query for populating it. customer_orders
. The CTE is then used in the main query as any other table.customer_orders
CTE in the main query finds customers with at least one order. Given an orders
table with price
and quantity
columns, write a query to return a list of orders ranked by total price.
A sample answer might look like:
SELECT
orders.*,
SUM(quantity * price) AS total_price,
RANK() OVER (ORDER BY SUM(quantity * price) DESC) AS rank
FROM
orders
GROUP BY
id, customer_id;
price
and quantity
to get the total price. RANK
window function. It's general syntax is:RANK() OVER ( [PARTITION BY partition_expression] [ORDER BY order_expression] )
PARTITION BY
and ORDER BY
clauses determine how the rows will be partitioned and ordered for processing by the RANK
function. RANK
function assigns the same rank when there is a tie. DENSE RANK
doesn’t skip rank values for ties. If two rows have the same value, they will receive the same rank, but the next row will get the next rank instead of skipping the rank value like in RANK
. Senior data scientists are trusted to handle extremely large datasets.
Large companies like Meta, Microsoft, TikTok, and Google need data scientists who can clean, synthesize, and interpret large amounts of information.
In combination with machine learning, senior scientists help to make sense of trends and make predictions for the future.
Before stepping into a senior SQL interview in your data science loops, consider brushing up on concepts like:
RANK
, DENSE RANK
, ROW_NUMBER
, and LAG
for complex calculations and analytics.COMMIT
, ROLLBACK
, and SAVEPOINT
.Here are some questions you can practice:
Write a SQL query to fetch the top-earning employee by department, ordered by department name.
You've been given the following schema.
employees projects
+---------------+---------+ +---------------+---------+
| id | int |<----+ +->| id | int |
| first_name | varchar | | | | title | varchar |
| last_name | varchar | | | | start_date | date |
| salary | int | | | | end_date | date |
| department_id | int |--+ | | | budget | int |
+---------------+---------+ | | | +---------------+---------+
| | |
departments | | | employees_projects
+---------------+---------+ | | | +---------------+---------+
| id | int |<-+ | +--| project_id | int |
| name | varchar | +-----| employee_id | int |
+---------------+---------+ +---------------+---------+
Your answer should be in the following format:
department_name | employee_id | first_name | last_name | salary
----------------+-------------+------------+-----------+--------
varchar | int | varchar | varchar | int
Here's a sample solution:
WITH ranked_employees AS (
SELECT
e.id AS employee_id,
e.first_name,
e.last_name,
e.salary,
d.name AS department_name,
ROW_NUMBER() OVER (PARTITION BY e.department_id ORDER BY e.salary DESC) AS rank
FROM
employees e
JOIN
departments d ON e.department_id = d.id
)
SELECT
department_name,
employee_id,
first_name,
last_name,
salary
FROM
ranked_employees
WHERE
rank = 1
ORDER BY
department_name;
ranked_employees
ranks employees within each department based on their salary in descending order. ROW_NUMBER()
function is used to assign a rank to each employee within their department.employees
table is joined with the departments
table to get department _name
. The result is then ordered by department name.Given a tweets
table with tweet_id
, user_id
, msg
, and tweet_date
, group the users by the number of tweets they posted in 2022. Count the number of users in each group.
WITH tweet_cte AS (
SELECT user_id, COUNT(*) as tweet_bucket
FROM tweets
WHERE EXTRACT(year from tweet_date) = 2022
GROUP BY user_id
)
SELECT tweet_bucket, COUNT(*) as users_num
FROM tweet_cte
GROUP BY tweet_bucket;
tweet_cte
counts tweets per user for 2022, resulting in user_id
and tweet_bucket
. tweet_bucket
.You have a table of Facebook user_actions
. Obtain the number of monthly active users in July 2022, including the month in numerical format "1, 2, 3."
The user_actions
table has user_id
, event_id
, event_type
, and event_date
.
Here's a sample solution:
WITH june_actions AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date BETWEEN '2022-06-01' AND '2022-06-30'
AND event_type IN ('sign-in', 'like', 'comment')
),
july_actions AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date BETWEEN '2022-07-01' AND '2022-07-31'
AND event_type IN ('sign-in', 'like', 'comment')
)
SELECT 7 AS month, COUNT(*) AS monthly_active_users
FROM july_actions
WHERE user_id IN (SELECT user_id FROM june_actions);
june_actions
and july_actions
CTEs select distinct user_ids
who performed relevant actions in June and July 2022, respectively.june_actions
and july_actions
, thus identifying users active in both months.Given post
and post_user
tables, find the success rate of posts (%) when the user's previous post had failed.
The user table contains post_id
, post_date
, user_id
, interface
, and is_successful_post
. The post_user
table contains user_id
, user_type
, and age
.
Your output should have the following columns: user_id
and next_post_sc_rate
(success rate of the post when the user’s previous post had failed). Order results by increasing next_post_sc_rate
.
Here's a sample solution:
WITH post_seq AS (
SELECT
p.user_id,
p.post_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY post_date) AS post_seq_id,
is_successful_post
FROM post as p
),
post_pairings AS (
SELECT
ps.user_id,
ps.post_seq_id AS fail_post_id,
ps.post_seq_id + 1 AS next_post_id
FROM post_seq AS ps
WHERE ps.is_successful_post = 0
)
SELECT
pp.user_id,
ROUND(SUM(p2.is_successful_post) * 1.0 / COUNT(p2.is_successful_post), 2) AS next_post_sc_rate
FROM post_pairings AS pp
JOIN post AS p2
ON pp.next_post_id = p2.post_id
GROUP BY 1
ORDER BY next_post_sc_rate ASC;
post_seq
CTE assigns a sequential ID post_seq_id
to each post per user based on the post_date
.post_pairings
CTE identifies pairs of posts where the previous post was unsuccessful. The next_post_id
is the sequential ID of the post following the unsuccessful post.SELECT
joins the post_pairings
with the post
table to get details about the next post. It computes the success rate of the next posts following an unsuccessful post. SUM()
calculates the ratio of successful next posts to the total number of next posts.ROUND(..., 2)
rounds the success rate to two decimal places.GROUP BY
groups by user_id
and orders the results by next_post_sc_rate
in ascending order.Identify any payments made at the same merchant using the same credit card with the same amount within 10 minutes of each other. Such transactions should be considered as repeat payments.
The first transaction of such payments shouldn't be considered as a repeat payment. This means that if there are three transactions by a merchant with the same credit card, for the same amount within 10 minutes, only 2 payments will be considered as repeated.
The transactions
table contains the following columns: transaction_id
, merchant_id
, credit_card_id
, amount
, and transaction_timestamp
.
Here's a sample solution:
WITH PreviousTransactions AS (
SELECT
transaction_id,
merchant_id,
credit_card_id,
amount,
transaction_timestamp,
LAG(transaction_timestamp) OVER (
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp
) AS previous_transaction
FROM transactions
),
RepeatPayments AS (
SELECT
transaction_id,
merchant_id,
credit_card_id,
amount,
transaction_timestamp
FROM PreviousTransactions
WHERE previous_transaction IS NOT NULL
AND transaction_timestamp <= previous_transaction + INTERVAL '10 MINUTE'
)
SELECT
COUNT(*) AS payment_count
FROM RepeatPayments;
PreviousTransactions
CTE uses LAG()
to get the timestamp of the previous transaction for each partition of merchant_id
, credit_card_id
, and amount
. It orders them by transaction_timestamp
to ensure that LAG()
correctly identifies the previous transaction.RepeatPayments
filters transactions where there is a valid previous_transaction
and the transaction_timestamp
of the current transaction is within 10 minutes of the previous_transaction
.Given orders
, departments
, and customers
tables, identify the customers who placed the most orders for each of the last 5 years.
The orders
table has order_id
, customer_id
, order_date
, order_amount
, and department_id
. The departments
table has department_id
and department_name
. And the customers
table has customer_id
, first_name
, and last_name
.
Output a list of the 5 top customers. Your output should have the following columns: customer_id
, first_name
, last_name
, and years
.
WITH orders_per_year AS (
SELECT
c.customer_id,
c.first_name,
c.last_name,
EXTRACT(YEAR FROM o.order_date) AS years, -- PostgreSQL function for extracting the year
COUNT(o.order_id) AS total_orders,
rank() OVER (PARTITION BY EXTRACT(YEAR FROM o.order_date) ORDER BY COUNT(o.order_id) DESC) AS order_rankings
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE EXTRACT(YEAR FROM o.order_date) IN (2019, 2020, 2021, 2022, 2023)
GROUP BY EXTRACT(YEAR FROM o.order_date), c.customer_id, c.first_name, c.last_name
)
SELECT
first_name,
last_name,
total_orders,
years
FROM orders_per_year
WHERE order_rankings = 1
ORDER BY years ASC;
A cloud service provider like AWS wants to optimize the usage of its fleet of servers. Calculate the total time that the fleet of servers was running in days. The total time a server fleet is running is the sum of the server's uptime.
A server can start and stop several times. The server_utilization
table has the following columns: server_id
, status_time
, and session_status
(start, stop).
The expected output is total_uptime_days
.
WITH SessionDurations AS (
SELECT
server_id,
status_time AS start_time,
LEAD(status_time) OVER (PARTITION BY server_id ORDER BY status_time) AS stop_time,
session_status
FROM server_utilization
),
FilteredSessions AS (
SELECT
server_id,
start_time,
stop_time
FROM SessionDurations
WHERE session_status = 'start' AND stop_time IS NOT NULL
),
CalculatedDurations AS (
SELECT
server_id,
EXTRACT(EPOCH FROM (stop_time - start_time)) AS uptime_seconds
FROM FilteredSessions
)
SELECT
ROUND(SUM(uptime_seconds) / (24 * 60 * 60), 0) AS total_uptime_days
FROM
CalculatedDurations;
SessionDurations
CTE, LEAD
pairs each start time with the next stop time for the same server_id
.FilteredSessions
CTE, we filter rows where session_status = 'start' and stop_time is NOT NULL.CalculatedDurations
CTE gets the duration of each session in seconds. SELECT
sums the uptime_seconds
and converts them to days by dividing by 24 * 60 * 60
.You work for a leading game development company where players can team up and compete. Each player's performance in different game sessions is recorded as distinct score entries in the database.
You're provided a players
table with player_id
, player_name
, and team_id
columns and a scores
table with score_id
, player_id
, and game_score
.
Return the top 2 players from each team based on their single highest score across all sessions. If multiple players share the same highest score, include all of them, which may result in more than two top players for some teams.
Here's an example solution:
WITH PlayerMaxScores AS (
SELECT
p.team_id,
p.player_name,
MAX(s.game_score) AS max_score
FROM
players p
JOIN
scores s ON p.player_id = s.player_id
GROUP BY
p.team_id, p.player_name
),
RankedPlayers AS (
SELECT
team_id,
player_name,
max_score,
DENSE_RANK() OVER (PARTITION BY team_id ORDER BY max_score DESC) AS rank
FROM
PlayerMaxScores
)
SELECT
team_id,
player_name,
max_score
FROM
RankedPlayers
WHERE
rank <= 2
ORDER BY
team_id, max_score DESC, player_name;
PlayerMaxScores
CTE aggregates the maximum score for each player. DENSE_RANK()
window function in the RankedPlayers
CTE assigns a rank to each player within their team based on their maximum score. DENSE_RANK()
ensures that players with the same score get the same rank. SELECT
picks the top two players from each team. It is impossible to cover all the possible questions since SQL is a broad and varied technology!
Hopefully, these questions have given you a glimpse into what to expect in your data science interviews.
Good luck with your upcoming interview!
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account