SQL is a fundamental topic in data science interviews at all levels. Querying, manipulating, and interpreting data are key skills for any data scientist.
For junior roles, you will likely be tested on basic querying and aggregation concepts. For more senior interviews, you should be familiar with advanced concepts like window functions, CTEs, and complex data transformations.
Below, we've compiled a list of example SQL interview questions indicative of the types of questions you can expect in real data science interviews. We created this list with input from data analysts, data scientists, and machine learning engineers at Meta, Google, Dropbox, Amazon, and top consulting firms.
Each question includes a sample answer and explanation of the underlying SQL concepts.
This guide was written and compiled by Derrick Mwiti, a senior data scientist and course instructor.
WHERE, INNER JOIN, ORDER BY, GROUP BY, and aggregations like COUNT(), AVG(), and MAX().CASE WHEN, date and time manipulation, WITH clauses, UNION, INTERSECT, EXCEPT, and string functions like CONCAT() and REPLACE().Early in your data career, your SQL interviews will likely be focused on basic querying, syntax, and data retrieval, such as:
WHERE clauses to filter rows based on specific conditions.INNER JOIN and LEFT JOIN.COUNT(), SUM(), AVG(), MIN(), and MAX() to aggregate.GROUP BY to aggregate data into summary rows.ORDER BY to sort query results.Here are some examples of common SQL questions you might hear:
You are given an employee table with the following columns: employee_id, name, salary, department_id, and manager_id.
Identify the employees who earn more than their direct managers. The expected output is: employee_id, employee_name.
Here is a sample solution:
SELECT
e.employee_id,
e.name AS employee_name
FROM
employee e
JOIN
employee m ON e.manager_id = m.employee_id
WHERE
e.salary > m.salary;employee table is aliased as e for employees and m for managers.e.manager_id = m.employee_id.WHERE clause e.salary > m.salary.employee_name and employee_id retrieved using SELECT.You work for Lyft, a ride-sharing company and are tasked with finding the average number of unmatched bookings per user. The expected output is: user_id, user_name, email, and avg_unmatched_bookings (rounded to the nearest 2 decimal places).
You have a users table with user_id, user_name, email. You also have a bookings table with booking_id, user_id, driver_id, booking_time, status.
Here's a sample solution:
SELECT
u.user_id,
u.user_name,
u.email,
COALESCE(ROUND(AVG(CASE WHEN b.status = 'Unmatched' THEN 1 ELSE 0 END), 2), 0) AS avg_unmatched_bookings
FROM users u
LEFT JOIN bookings b ON u.user_id = b.user_id
GROUP BY u.user_id, u.user_name, u.email;LEFT JOIN to combine the users table with the bookings table on the user_id column. This ensures that even users with no bookings are included in the result.CASE statement within the AVG() function to compute the average number of bookings for each user. Unmatched bookings are counted as 1, or 0 otherwise.COALESCE function together with the ROUND function to ensure there are no null values. Then, the average is rounded to two decimal places.Show the total amount of successful posts a particular user_type made in November 2023. Then, order your output by descending success rate.
Your output should include: user_type, post_success (number of successful posts), post_attempt (number of total posts), and post_success_rate (ranging from 0.00 to 1.00).
You've been given a post table with post_id, post_date, user_id, interface, and is_successful_post. You also have a post_user table with user_id, user_type, and age.
A sample solution looks like:
SELECT
pu.user_type,
COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END) AS post_success,
COUNT(p.post_id) AS post_attempt,
ROUND(COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END) * 1.0 / COUNT(p.post_id), 2) AS post_success_rate
FROM
post p
JOIN
post_user pu ON p.user_id = pu.user_id
WHERE
p.post_date >= '2023-11-01' AND p.post_date < '2023-12-01'
GROUP BY
pu.user_type
ORDER BY
post_success_rate DESC;user_id column to join the post table with the post_user table to get the user type for each post.WHERE function to filter posts from November 2023.COUNT(CASE WHEN p.is_successful_post = 1 THEN 1 END).ROUND. To guarantee that the division function is performed in floating-point, multiply the result by 1.0.GROUP BY pu.user_type.post_success_rate in descending order with DESC.Calculate the average duration of marketing campaigns, given three tables.
First, you have a campaign table with an upsell_campaign_id, date_start, and date_end. Then, you have a user table with user_id, name, and is_eligible_for_upsell_campaign. Finally, there's a transaction table withtransaction_id, user_id, product_id, transaction_date, and quantity.
Here is a sample solution:
SELECT AVG(DATEDIFF(date_end, date_start)) AS avg_campaign_length
FROM campaign;DATEDIFF.AVG.Imagine you work at Netflix. Calculate the total viewership from laptops compared to mobile devices.
There are three types of devices that viewers use:phone, tablet, and laptop. Mobile viewership is computed by summing the tablet and phone viewership. You also have a viewership table with user_id, device_type, and view_time.
Here's a sample solution:
SELECT
SUM(CASE WHEN device_type = 'laptop' THEN 1 ELSE 0 END) AS laptop_views,
SUM(CASE WHEN device_type IN ('phone', 'tablet') THEN 1 ELSE 0 END) AS mobile_views
FROM
viewership;SUM with CASE statements to conditionally count the number of viewership records for each device_type.Compute the average star rating for each product grouped by the month given a reviews table with these columns: review_id, user_id, submit_date, product_id, and stars.
Display the month as a numerical value, product ID, and average star rating rounded to two decimal places. Sort the results by the first month followed by the product ID.
For example:
SELECT
EXTRACT(MONTH FROM submit_date) AS mth,
product_id,
ROUND(AVG(stars), 2) AS avg_stars
FROM
reviews
GROUP BY
EXTRACT(MONTH FROM submit_date), product_id
ORDER BY
mth, product_id;EXTRACT the month from the submit_date column.ROUND computes the average to 2 decimal places.GROUP BY and ORDER BY order the data by month and product ID.You work for Amazon. The company likes to provide seasonal discounts. This year, electronics are discounted at 10%, clothing at 20%, groceries at 5%, and books at 15%.
You are given a products table with name, type, and price. Generate a discounted product catalog that displays the name, type, and discounted_price for each item. Round the discounted prices to the nearest cent.
SELECT
name,
type,
ROUND(price *
CASE
WHEN type = 'Electronic' THEN 0.90
WHEN type = 'Clothing' THEN 0.80
WHEN type = 'Grocery' THEN 0.95
WHEN type = 'Book' THEN 0.85
ELSE 1.00
END, 2) AS discounted_price
FROM
products;CASE applies the appropriate discount based on the product. ROUND (..., 2) rounds the price to the nearest cent. SELECT picks the name, type, and discounted price. You work for a bank like Capital One that wants to launch a new credit card. You are given a table with the number of credit cards issued each month that includes card_name, issued_amount, issue_month, and issue_year.
Output the name of each credit card and the difference in the total number of credit cards between the highest and lowest issuance month. Arrange the output based on the highest disparity.
SELECT
card_name,
MAX(issued_amount) - MIN(issued_amount) AS difference
FROM
monthly_cards_issued
GROUP BY
card_name
ORDER BY
difference DESC;MAX(issued_amount) - MIN(issued_amount) computes the difference between the highest and lowest issuance amounts for each card.GROUP BY card_name groups the results by card name to calculate the difference for each card.ORDER BY difference DESC sorts the results by the difference in descending order.Fetch the number of transactions made by users in specific cities. Then, order by descending number of transactions.
You're given 2 tables. First, a users table with id, first_name, last_name, user_city, and email. You also have a transactions table with id, customer_id, product_id, currency_code, date, and amount.
Here's a sample solution:
SELECT
u.user_city,
COUNT(t.id) AS number_of_transactions
FROM
users u
JOIN
transactions t ON u.id = t.customer_id
GROUP BY
u.user_city
ORDER BY
number_of_transactions DESC;users table is joined with the transactions table on the customer_id to get the transactions for each user.COUNT(t.id) counts the number of transactions for each user city.user_city to aggregate the transaction counts per city.number_of_transactions in descending order.Data scientists use SQL to clean, transform, and preprocess data. As your data science career advances, so too will your SQL interviews.
Your roles will begin to require you to be able to process, manipulate, and summarize larger datasets.
For this reason, mid-level data scientists should expect to get asked SQL questions about:
ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), and LAG().CASE WHEN for conditional logic in queries.WITH clauses to define temporary result sets.UNION, INTERSECT, and EXCEPT to combine results from multiple queries.CONCAT(), SUBSTRING(), REPLACE().Here are some questions you can practice:
Given a table called purchase_orders, with quantity and price columns, compute the total_revenue. Group the total revenue by country, then order by the total revenue.
Here's a sample solution:
SELECT SUM(quantity * price) AS total_revenue
FROM purchase_orders
GROUP BY country
ORDER
BY total_revenue DESC;quantity and price columns.SUM them to find the total revenue. Given a table of orders with quantity, categorize each order into "small" and "large" orders with a new size column using the CASE WHEN function.
Large orders have quantity >= 2.
SELECT
*,
CASE
WHEN quantity >= 2 THEN 'large'
ELSE 'small'
END AS size
FROM
orders;CASE WHEN is similar to if-then-else statements. It's used to evaluate one or more conditions and returns a value based on its evaluation.Given a table of orders with timestamps in the order_date column, aggregate and return the number of orders per month.
SELECT
DATE(order_date, 'start of month') AS order_month,
COUNT(*) AS no_orders
FROM orders
GROUP BY order_month;DATE(column, 'start of month').DATE_TRUNC('month', column).EXTRACT(MONTH FROM column).Imagine you work at a company with monthly subscriptions. Given a table of users, compute the average retention based on the month of the signup_date and churn_date. Group the results by signup_month.
SELECT
date_trunc('month', signup_date) AS signup_month,
AVG(datediff('day', signup_date, churn_date)) AS avg_retention
FROM users
GROUP BY signup_month;signup_date and the churn_date using datediff while also finding the signup month. You're given tables for customers and orders. There's a foreign key customer_id, which references customers.id. Return all orders with their associated customer data from March 2022.
Here's a sample solution:
SELECT orders.id as order_id, customers.*
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.id
WHERE orders.order_date BETWEEN '2022-03-01' and '2022-04-01';orders.id as order_id renames the id column from the orders table to order_id for clarity.customers.* selects all columns from the customers table.INNER JOIN joins the orders and customers tables. The join condition is that the customer_id in the orders table matches the id in the customers table.WHERE and BETWEEN filter orders between March and April of 2022.Given customers and orders tables with id and customer_id columns, use subqueries to return all customers who have at least one order.
Here's a sample solution:
SELECT *
FROM customers
WHERE (SELECT COUNT(*)
FROM orders
WHERE customers.id = orders.customer_id) > 0;SELECT COUNT() counts the number of orders for each customer.WHERE clause ensures that only customers with at least one order are counted.Alternatively:
SELECT *
FROM customers
WHERE id IN (SELECT customer_id FROM orders);SELECT customer_id FROM orders retrieves all customer_id values from the orders table.id matches any of the customer_id values returned by the subquery.Solve the same problem as above. This time, use Common Table Expressions (CTEs) instead of subqueries.
CTEs are used for breaking down complex queries into more manageable parts and are similar to subqueries, however they are defined before the main query. Subqueries, on the other hand, are defined within the main query.
CTEs can be referenced multiple times in a single query, whereas subqueries can't. However, subqueries are more efficient because the database engine easily optimizes them.
Here's a sample solution:
WITH customer_orders AS (
SELECT customer_id, COUNT(*) as num_orders
FROM orders
GROUP BY customer_id
)
SELECT customers.*
FROM customers
JOIN customer_orders
ON customers.id = customer_orders.customer_id
WHERE customer_orders.num_orders > 0;WITH keyword is used to define the CTE. The AS keyword is used to define the query for populating it. customer_orders. The CTE is then used in the main query as any other table.customer_orders CTE in the main query finds customers with at least one order. Given an orders table with price and quantity columns, write a query to return a list of orders ranked by total price.
A sample answer might look like:
SELECT
orders.*,
SUM(quantity * price) AS total_price,
RANK() OVER (ORDER BY SUM(quantity * price) DESC) AS rank
FROM
orders
GROUP BY
id, customer_id;price and quantity to get the total price. RANK window function. It's general syntax is:RANK() OVER ( [PARTITION BY partition_expression] [ORDER BY order_expression] )PARTITION BY and ORDER BY clauses determine how the rows will be partitioned and ordered for processing by the RANK function. RANK function assigns the same rank when there is a tie. DENSE RANK doesn’t skip rank values for ties. If two rows have the same value, they will receive the same rank, but the next row will get the next rank instead of skipping the rank value like in RANK. Senior data scientists are trusted to handle extremely large datasets.
Large companies like Meta, Microsoft, TikTok, and Google need data scientists who can clean, synthesize, and interpret large amounts of information.
In combination with machine learning, senior scientists help to make sense of trends and make predictions for the future.
Before stepping into a senior SQL interview in your data science loops, consider brushing up on concepts like:
RANK, DENSE RANK, ROW_NUMBER, and LAG for complex calculations and analytics.COMMIT, ROLLBACK, and SAVEPOINT.Here are some questions you can practice:
Write a SQL query to fetch the top-earning employee by department, ordered by department name.
You've been given the following schema.
employees projects
+---------------+---------+ +---------------+---------+
| id | int |<----+ +->| id | int |
| first_name | varchar | | | | title | varchar |
| last_name | varchar | | | | start_date | date |
| salary | int | | | | end_date | date |
| department_id | int |--+ | | | budget | int |
+---------------+---------+ | | | +---------------+---------+
| | |
departments | | | employees_projects
+---------------+---------+ | | | +---------------+---------+
| id | int |<-+ | +--| project_id | int |
| name | varchar | +-----| employee_id | int |
+---------------+---------+ +---------------+---------+Your answer should be in the following format:
department_name | employee_id | first_name | last_name | salary
----------------+-------------+------------+-----------+--------
varchar | int | varchar | varchar | intHere's a sample solution:
WITH ranked_employees AS (
SELECT
e.id AS employee_id,
e.first_name,
e.last_name,
e.salary,
d.name AS department_name,
ROW_NUMBER() OVER (PARTITION BY e.department_id ORDER BY e.salary DESC) AS rank
FROM
employees e
JOIN
departments d ON e.department_id = d.id
)
SELECT
department_name,
employee_id,
first_name,
last_name,
salary
FROM
ranked_employees
WHERE
rank = 1
ORDER BY
department_name;ranked_employees ranks employees within each department based on their salary in descending order. ROW_NUMBER() function is used to assign a rank to each employee within their department.employees table is joined with the departments table to get department _name. The result is then ordered by department name.Given a tweets table with tweet_id, user_id, msg, and tweet_date, group the users by the number of tweets they posted in 2022. Count the number of users in each group.
WITH tweet_cte AS (
SELECT user_id, COUNT(*) as tweet_bucket
FROM tweets
WHERE EXTRACT(year from tweet_date) = 2022
GROUP BY user_id
)
SELECT tweet_bucket, COUNT(*) as users_num
FROM tweet_cte
GROUP BY tweet_bucket;tweet_cte counts tweets per user for 2022, resulting in user_id and tweet_bucket. tweet_bucket.You have a table of Facebook user_actions. Obtain the number of monthly active users in July 2022, including the month in numerical format "1, 2, 3."
The user_actions table has user_id, event_id, event_type, and event_date.
Here's a sample solution:
WITH june_actions AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date BETWEEN '2022-06-01' AND '2022-06-30'
AND event_type IN ('sign-in', 'like', 'comment')
),
july_actions AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date BETWEEN '2022-07-01' AND '2022-07-31'
AND event_type IN ('sign-in', 'like', 'comment')
)
SELECT 7 AS month, COUNT(*) AS monthly_active_users
FROM july_actions
WHERE user_id IN (SELECT user_id FROM june_actions);june_actions and july_actions CTEs select distinct user_ids who performed relevant actions in June and July 2022, respectively.june_actions and july_actions, thus identifying users active in both months.Given post and post_user tables, find the success rate of posts (%) when the user's previous post had failed.
The user table contains post_id, post_date, user_id, interface, and is_successful_post. The post_user table contains user_id, user_type, and age.
Your output should have the following columns: user_id and next_post_sc_rate (success rate of the post when the user’s previous post had failed). Order results by increasing next_post_sc_rate.
Here's a sample solution:
WITH post_seq AS (
SELECT
p.user_id,
p.post_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY post_date) AS post_seq_id,
is_successful_post
FROM post as p
),
post_pairings AS (
SELECT
ps.user_id,
ps.post_seq_id AS fail_post_id,
ps.post_seq_id + 1 AS next_post_id
FROM post_seq AS ps
WHERE ps.is_successful_post = 0
)
SELECT
pp.user_id,
ROUND(SUM(p2.is_successful_post) * 1.0 / COUNT(p2.is_successful_post), 2) AS next_post_sc_rate
FROM post_pairings AS pp
JOIN post AS p2
ON pp.next_post_id = p2.post_id
GROUP BY 1
ORDER BY next_post_sc_rate ASC;post_seq CTE assigns a sequential ID post_seq_id to each post per user based on the post_date.post_pairings CTE identifies pairs of posts where the previous post was unsuccessful. The next_post_id is the sequential ID of the post following the unsuccessful post.SELECT joins the post_pairings with the post table to get details about the next post. It computes the success rate of the next posts following an unsuccessful post. SUM() calculates the ratio of successful next posts to the total number of next posts.ROUND(..., 2) rounds the success rate to two decimal places.GROUP BY groups by user_id and orders the results by next_post_sc_rate in ascending order.Identify any payments made at the same merchant using the same credit card with the same amount within 10 minutes of each other. Such transactions should be considered as repeat payments.
The first transaction of such payments shouldn't be considered as a repeat payment. This means that if there are three transactions by a merchant with the same credit card, for the same amount within 10 minutes, only 2 payments will be considered as repeated.
The transactions table contains the following columns: transaction_id, merchant_id, credit_card_id, amount, and transaction_timestamp.
Here's a sample solution:
WITH PreviousTransactions AS (
SELECT
transaction_id,
merchant_id,
credit_card_id,
amount,
transaction_timestamp,
LAG(transaction_timestamp) OVER (
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp
) AS previous_transaction
FROM transactions
),
RepeatPayments AS (
SELECT
transaction_id,
merchant_id,
credit_card_id,
amount,
transaction_timestamp
FROM PreviousTransactions
WHERE previous_transaction IS NOT NULL
AND transaction_timestamp <= previous_transaction + INTERVAL '10 MINUTE'
)
SELECT
COUNT(*) AS payment_count
FROM RepeatPayments;PreviousTransactions CTE uses LAG() to get the timestamp of the previous transaction for each partition of merchant_id, credit_card_id, and amount. It orders them by transaction_timestamp to ensure that LAG() correctly identifies the previous transaction.RepeatPayments filters transactions where there is a valid previous_transaction and the transaction_timestamp of the current transaction is within 10 minutes of the previous_transaction.Given orders, departments, and customers tables, identify the customers who placed the most orders for each of the last 5 years.
The orders table has order_id, customer_id, order_date, order_amount, and department_id. The departments table has department_id and department_name. And the customers table has customer_id, first_name, and last_name.
Output a list of the 5 top customers. Your output should have the following columns: customer_id, first_name, last_name, and years.
WITH orders_per_year AS (
SELECT
c.customer_id,
c.first_name,
c.last_name,
EXTRACT(YEAR FROM o.order_date) AS years, -- PostgreSQL function for extracting the year
COUNT(o.order_id) AS total_orders,
rank() OVER (PARTITION BY EXTRACT(YEAR FROM o.order_date) ORDER BY COUNT(o.order_id) DESC) AS order_rankings
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE EXTRACT(YEAR FROM o.order_date) IN (2019, 2020, 2021, 2022, 2023)
GROUP BY EXTRACT(YEAR FROM o.order_date), c.customer_id, c.first_name, c.last_name
)
SELECT
first_name,
last_name,
total_orders,
years
FROM orders_per_year
WHERE order_rankings = 1
ORDER BY years ASC;A cloud service provider like AWS wants to optimize the usage of its fleet of servers. Calculate the total time that the fleet of servers was running in days. The total time a server fleet is running is the sum of the server's uptime.
A server can start and stop several times. The server_utilization table has the following columns: server_id, status_time, and session_status (start, stop).
The expected output is total_uptime_days.
WITH SessionDurations AS (
SELECT
server_id,
status_time AS start_time,
LEAD(status_time) OVER (PARTITION BY server_id ORDER BY status_time) AS stop_time,
session_status
FROM server_utilization
),
FilteredSessions AS (
SELECT
server_id,
start_time,
stop_time
FROM SessionDurations
WHERE session_status = 'start' AND stop_time IS NOT NULL
),
CalculatedDurations AS (
SELECT
server_id,
EXTRACT(EPOCH FROM (stop_time - start_time)) AS uptime_seconds
FROM FilteredSessions
)
SELECT
ROUND(SUM(uptime_seconds) / (24 * 60 * 60), 0) AS total_uptime_days
FROM
CalculatedDurations;SessionDurations CTE, LEAD pairs each start time with the next stop time for the same server_id.FilteredSessions CTE, we filter rows where session_status = 'start' and stop_time is NOT NULL.CalculatedDurations CTE gets the duration of each session in seconds. SELECT sums the uptime_seconds and converts them to days by dividing by 24 * 60 * 60.You work for a leading game development company where players can team up and compete. Each player's performance in different game sessions is recorded as distinct score entries in the database.
You're provided a players table with player_id, player_name, and team_id columns and a scores table with score_id, player_id, and game_score.
Return the top 2 players from each team based on their single highest score across all sessions. If multiple players share the same highest score, include all of them, which may result in more than two top players for some teams.
Here's an example solution:
WITH PlayerMaxScores AS (
SELECT
p.team_id,
p.player_name,
MAX(s.game_score) AS max_score
FROM
players p
JOIN
scores s ON p.player_id = s.player_id
GROUP BY
p.team_id, p.player_name
),
RankedPlayers AS (
SELECT
team_id,
player_name,
max_score,
DENSE_RANK() OVER (PARTITION BY team_id ORDER BY max_score DESC) AS rank
FROM
PlayerMaxScores
)
SELECT
team_id,
player_name,
max_score
FROM
RankedPlayers
WHERE
rank <= 2
ORDER BY
team_id, max_score DESC, player_name;PlayerMaxScores CTE aggregates the maximum score for each player. DENSE_RANK() window function in the RankedPlayers CTE assigns a rank to each player within their team based on their maximum score. DENSE_RANK() ensures that players with the same score get the same rank. SELECT picks the top two players from each team. It is impossible to cover all the possible questions since SQL is a broad and varied technology!
Hopefully, these questions have given you a glimpse into what to expect in your data science interviews.
Good luck with your upcoming interview!
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account




