Top 25 SQL Data Engineering Interview Questions

Data Engineer
Exponent TeamExponent TeamLast updated

SQL is a fundamental topic in data engineering interviews at all levels.

Below, we've compiled a list of example SQL interview questions indicative of the types of questions you can expect in real data engineering interviews.

We created this list with insights from data engineers, data scientists, and machine learning engineers at Meta, Google, Dropbox, Amazon, and top startups.

ℹ️
We focus on SQL for relational databases, but the syntax can be slightly adjusted for distributed systems such as Athena, which works on data on top of AWS S3. The same queries can be used with minimal changes in a Spark SQL environment.

Beginner SQL

Early in your data career, your SQL interviews will likely be focused on basic querying, syntax, and data retrieval, such as:

  • Using WHERE clauses to filter rows based on specific conditions.
  • Combining data from multiple tables using different types of joins such as INNER JOIN and LEFT JOIN.
  • Using functions like COUNT(), SUM(), AVG(), MIN(), and MAX() to aggregate.
  • Using GROUP BY to aggregate data into summary rows.
  • Using ORDER BY to sort query results.

Here are some examples of common SQL questions you might hear:

1. LinkedIn Job Applicants

You are given a LinkedIn table of applicants and their skills. You are tasked with finding candidates with experience in Python, Tableau, and PostgreSQL.

The table contains candidate_id and skill columns. 

SQL
SELECT candidate_id FROM candidates WHERE skill IN ('Python', 'Tableau', 'PostgreSQL');
  • SELECT the candidate_id where the skill is Python, Tableau, or PostgreSQL. 

2. Facebook Pages with No Likes

You are given two tables with Facebook Pages and their respective likes.

You are tasked with getting the pages with no likes.

The pages table has page_id and page_name columns, while the page_likes table has user_id, page_id, and liked_date columns.

SQL
SELECT page_id FROM pages WHERE page_id NOT IN (SELECT page_id FROM page_likes);
  • SELECT page_id that is not in the page_likes table. 

3. Tesla Unfinished Parts 

You are given a Tesla parts_assembly table with the following columns: part, finish_date, and assembly_step. You are tasked with finding parts that have begun the production process but are not yet finished. 

Assuming the unfished parts lack a finish_date, the answer is: 

SQL
SELECT part, assembly_step FROM parts_assembly WHERE finish_date IS NULL;

4. Average Post Hiatus Facebook 

You are given a Facebook table with users who posted at least twice in 2021.

Find the number of days between each user’s first post of the year and the last post in 2021.

The output should be the days between each user’s first and last posts. The posts table has the following columns: user_id, post_id, post_content, and post_date.

SQL
SELECT user_id, DATE_PART('day', MAX(post_date) - MIN(post_date)) AS days_between_posts FROM posts WHERE EXTRACT(YEAR FROM post_date) = 2021 GROUP BY user_id HAVING COUNT(*) >= 2;
  • The first step is to filter the posts to include only those made in 2021.
  • Group by user_id to calculate values for each group. 
  • MIN(post_date) gives the date of the first post and MAX(post_date) gives the date for the last post. 
  • DATE_PART('day', ...) calculates the days between the first and last posts. 
  • HAVING COUNT(*) >= 2 ensures that only users with at least 2 posts in 2021 are returned. 

5. Microsoft Teams Power Users 

Given a  Microsoft messages table with the following columns, message_idsender_idreceiver_idcontent, and sent_date, find the two power users who sent the most messages in Microsoft Teams in August 2022.

The output should be ordered in descending order based on the number of messages. 

SQL
SELECT sender_id, COUNT(sender_id) AS message_count FROM messages WHERE sent_date BETWEEN '08/01/2022' AND '08/31/2022' GROUP BY sender_id ORDER BY message_count DESC LIMIT 2;
  • SELECT sender_id and count of messages as message_count.
  • Set the sent_date between the first and last day of August 2022.
  • GROUP the results by sender_id.
  • Order in descending order of the message count. 
  • LIMIT to 2 results. 

6. LinkedIn Duplicate Job Postings

Find duplicate job listings in a LinkedIn job_listings table with the following columns: job_id, company_id, title, and description.

A duplicate has the same title and description. Return the number of companies that have posted duplicate jobs. 

SQL
SELECT COUNT(DISTINCT company_id) AS company_count FROM ( SELECT company_id FROM job_listings GROUP BY company_id, title, description HAVING COUNT(*) > 1 ) AS duplicates;
  • The inner query finds duplicate jobs by grouping them by company_id, title, and description.
  • COUNT(DISTINCT company_id) counts the number of companies with duplicate job listings. 

7. Cities With Completed Trades Robinhood

You are given two tables from Robinhood's stock trading app: trades and users.

The trades table has the following columns: order_id, user_id, quantity, status ('Completed', 'Cancelled'), date, and price.

The users table has user_id, city, email, and signup_date.

Write a query to find the three cities with the highest number of orders arranged in descending order. 

SQL
SELECT city, COUNT(t.order_id) as total_orders FROM trades t INNER JOIN users u ON t.user_id=u.user_id WHERE t.status = 'Completed' GROUP BY u.city ORDER BY total_orders DESC LIMIT 3;
  • Join the users table and the trades table on the user_id.
  • Count the number of orders for each city. 
  • Filter the results by the 'Completed’ status. 
  • Group by the city. 
  • Order by total orders in descending order.
  • Limit to 3 results.

8. Facebook Click-through Rate 

Given the events table for Facebook app analytics, calculate the app's click-through rate for 2022 in 2 decimals.

The events table has the following columns: app_id, event_type, and timestamp.

SQL
SELECT app_id, ROUND( (100.0 * COUNT(CASE WHEN event_type = 'click' THEN 1 END) / COUNT(CASE WHEN event_type = 'impression' THEN 1 END)), 2 ) AS ctr FROM events WHERE EXTRACT(YEAR FROM timestamp) = 2022 GROUP BY app_id;
  • Use CASE WHEN and COUNT to compute the ctr by finding the click and impression events. 
  • Find the entries only from 2022 using EXTRACT.
  • Group by app_id. 

9. Second-Day Confirmation TikTok

Given two TikTok tables, texts, and emails, find the users who confirmed their sign-up on the second day.

The texts table contains the following columns: text_id, email_id, signup_action ('Confirmed', 'Not confirmed'), and action_date.

The emails table has email_id, user_id, and signup_date columns.

SQL
SELECT user_id FROM emails e INNER JOIN texts t ON e.email_id = t.email_id WHERE t.action_date = e.signup_date + INTERVAL '1 day' AND t.signup_action = 'Confirmed';
  • Join the texts and emails table on the email_id.  
  • Filter by action_date and signup_date having an interval of 1 day and sign-up action being Confirmed. 

Intermediate SQL

As your data engineer career advances, your SQL interviews will evolve to include more complex data manipulation, transformation, and preprocessing tasks.

Expect to be tested on larger datasets and more advanced SQL concepts.

At this stage, you should be comfortable answering SQL questions involving:

  • Complex joins between multiple tables.
  • Writing nested and correlated subqueries.
  • Using window functions like ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), and LAG().
  • Implementing CASE WHEN for conditional logic within queries.
  • Manipulating dates and times for calculations.
  • Using WITH clauses to define temporary result sets.
  • Combining results from multiple queries using UNION, INTERSECT, and EXCEPT.
  • Handling strings with functions like CONCAT(), SUBSTRING(), and REPLACE().

Here are some intermediate-level SQL questions to help you practice:

10. Uber Transactions

You are given a table of Uber transactions with columns: user_id, spend, and transaction_date. Find the third transaction made by every user. The output should include the user_id, spend, and transaction_date.

SQL
SELECT user_id, spend, transaction_date FROM ( SELECT user_id, spend, transaction_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) AS rn FROM transactions ) AS ranked_transactions WHERE rn = 3;
  • Use ROW_NUMBER() to assign a row number (rn) to each transaction within each user_id, ordered by transaction_date.
  • Filter the results to include only rows where rn = 3, corresponding to each user's third transaction.

11. Second Highest Salary

Given an employee table with columns: employee_id, name, salary, department_id, and manager_id, find the second highest salary.

SQL
WITH ranked_salaries AS ( SELECT salary, RANK() OVER (ORDER BY salary DESC) AS rank FROM employee ) SELECT salary FROM ranked_salaries WHERE rank = 2;
  • Use a CTE (WITH clause) and the RANK() window function to rank employees by salary.
  • Select the salary with the second-highest rank (rank = 2).

12. Snapchat Opening vs. Sending Snaps

You are given two tables: activities with columns activity_id, user_id, activity_type ('send', 'open', 'chat'), time_spent, and activity_date, and age_breakdown with user_id and age_bucket. Calculate the percentage of time spent sending vs. opening snaps, grouped by age group, rounded to two decimal places.

SQL
SELECT ab.age_bucket, ROUND( 100.0 * SUM(CASE WHEN a.activity_type = 'send' THEN a.time_spent ELSE 0 END) / SUM(CASE WHEN a.activity_type IN ('send', 'open') THEN a.time_spent ELSE 0 END), 2 ) AS send_perc, ROUND( 100.0 * SUM(CASE WHEN a.activity_type = 'open' THEN a.time_spent ELSE 0 END) / SUM(CASE WHEN a.activity_type IN ('send', 'open') THEN a.time_spent ELSE 0 END), 2 ) AS open_perc FROM activities a JOIN age_breakdown ab ON a.user_id = ab.user_id WHERE a.activity_type IN ('send', 'open') GROUP BY ab.age_bucket;
  • Use JOIN to combine the activities and age_breakdown tables on user_id.
  • Use CASE statements to calculate the percentage of time spent on each activity type.
  • Round the results to two decimal places and group by age_bucket.

13. Tweets' Rolling Averages

You are given a tweets table with columns: user_id, tweet_date, and tweet_count. Calculate the 3-day rolling average of tweets for each user. The output should include user_id, tweet_date, and the rolling average rounded to two decimal places.

SQL
SELECT user_id, tweet_date, ROUND( AVG(tweet_count) OVER ( PARTITION BY user_id ORDER BY tweet_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ), 2 ) AS rolling_avg_3d FROM tweets ORDER BY user_id, tweet_date;
  • Use the AVG() window function to calculate the rolling average of tweet_count for each user.
  • PARTITION BY user_id ensures the rolling average is calculated separately for each user.
  • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW defines the window for the rolling average, considering the current row and the two previous rows.

14. Amazon Highest-Grossing Items

The product_spend table contains data about customers' spending in different Amazon categories.

You are tasked with finding the two highest-grossing items per category in 2022.

The table has the following columns: category, product, user_id, spend, and transaction_date. Output the category, product, and total_spend.

SQL
WITH ranking_cte AS ( SELECT category, product, SUM(spend) AS total_spend, RANK() OVER (PARTITION BY category ORDER BY SUM(spend) DESC) AS rank FROM product_spend WHERE EXTRACT(YEAR FROM transaction_date) = 2022 GROUP BY category, product ) SELECT category, product, total_spend FROM ranking_cte WHERE rank <= 2 ORDER BY category, rank;
  • Use a WITH clause to compute the ranking of each product within its category based on total spend.
  • Use RANK() to assign a rank, with the highest-spending product ranked first.
  • Filter for transactions in 2022 using the EXTRACT() function and select the top two products per category.

15. MAANG Top Three Salaries

You are given two tables: employee with columns employee_id, name, salary, department_id, and manager_id, and department with columns department_id and department_name.

Find the top three employees by salary in each department. Output the department_name, name, and salary.

SQL
WITH salary_cte AS ( SELECT d.department_name, e.name, e.salary, DENSE_RANK() OVER (PARTITION BY d.department_name ORDER BY e.salary DESC) AS rank FROM employee e INNER JOIN department d ON d.department_id = e.department_id ) SELECT department_name, name, salary FROM salary_cte WHERE rank <= 3;
  • Use DENSE_RANK() to rank employees by salary in each department, ensuring that ties don't create gaps in the ranking.
  • Join the employee and department tables on department_id.
  • Select the top three employees per department based on salary.

16. Spotify Top Five Artists

You are given three tables: artists (with columns artist_id, artist_name, label_owner), songs (with columns song_id, artist_id, name), and global_song_rank (with columns day, song_id, and rank).

Find the top five artists whose songs appear most frequently in the Top 10 of the global chart. Output the artist_name and artist_rank.

SQL
WITH songs_cte AS ( SELECT a.artist_name, DENSE_RANK() OVER (ORDER BY COUNT(s.song_id) DESC) AS artist_rank FROM artists a INNER JOIN songs s ON a.artist_id = s.artist_id INNER JOIN global_song_rank g ON g.song_id = s.song_id WHERE g.rank <= 10 GROUP BY a.artist_name ) SELECT artist_name, artist_rank FROM songs_cte WHERE artist_rank <= 5;
  • Join the three tables and compute the number of times each artist's songs appear in the Top 10.
  • Use DENSE_RANK() to rank the artists based on how frequently their songs appear in the Top 10.
  • Select the top five ranked artists.

17. TikTok Activation Rate

You are given emails and texts tables from TikTok. The emails table contains information about user sign-ups, while the texts table contains information about user activations. The columns are as follows: emails table (email_id, user_id, signup_date), and texts table (text_id, email_id, signup_action).

Calculate the activation rate of users, rounded to two decimal places.

SQL
SELECT ROUND( COUNT(DISTINCT CASE WHEN t.signup_action = 'Confirmed' THEN e.user_id END) * 1.0 / COUNT(DISTINCT e.user_id), 2 ) AS confirm_rate FROM emails e LEFT JOIN texts t ON e.email_id = t.email_id;
  • Use a LEFT JOIN to ensure all users from the emails table are included.
  • Count the number of users who confirmed their sign-up using COUNT(DISTINCT CASE WHEN ...).
  • Divide the number of confirmed users by the total number of unique users and round the result to two decimal places.

18. Microsoft Supercloud Customer

You are given two tables: customer_contracts (with columns customer_id, product_id, and amount) and products (with columns product_id, product_category, and product_name).

Find the customer IDs of customers who have purchased at least one product from every product category.

SQL
SELECT cc.customer_id FROM customer_contracts cc JOIN products p ON cc.product_id = p.product_id GROUP BY cc.customer_id HAVING COUNT(DISTINCT p.product_category) = (SELECT COUNT(DISTINCT product_category) FROM products);
  • Join the customer_contracts and products tables on product_id.
  • Group by customer_id and use HAVING to filter for customers who have purchased at least one product from each category.
  • Compare the number of distinct product categories purchased by each customer with the total number of unique categories available.

Advanced SQL

Senior data engineers are trusted to handle extremely large datasets.

Large companies like Meta, Microsoft, TikTok, and Google need data engineers who can clean, synthesize, and interpret large amounts of information.

In combination with machine learning, senior engineers help to make sense of trends and make predictions for the future.

Before stepping into a senior SQL interview in your data engineer loops, consider brushing up on concepts like:

  • Window functions like RANK, DENSE_RANK, ROW_NUMBER, and LAG for complex calculations and analytics.
  • Efficient queries and execution plans.
  • Using recursive CTEs for hierarchical or recursive data structures.
  • Transforming rows into columns with pivot tables.
  • Implementing intricate data transformations and aggregations.
  • Understanding and using transactions, including COMMIT, ROLLBACK, and SAVEPOINT.
  • Understanding normalization, indexing, and schema design principles.

Here are some questions you can practice:

19. Wayfair Year-on-Year Growth Rate

You are given the Wayfair user_transactions table with the following columns, transaction_id, product_id, spend, and transaction_date. Compute the year-on-year growth rate for the total spend of each product, grouping the results by product ID. Include the year in ascending order, product ID, current year's spend, previous year's spend, and year-on-year growth percentage, rounded to 2 decimal places.

SQL
WITH yearly_spend AS ( SELECT EXTRACT(YEAR FROM transaction_date) AS year, product_id, SUM(spend) AS curr_year_spend FROM user_transactions GROUP BY year, product_id ), spend_with_prev AS ( SELECT ys1.year, ys1.product_id, ys1.curr_year_spend, LAG(ys1.curr_year_spend) OVER (PARTITION BY ys1.product_id ORDER BY ys1.year) AS prev_year_spend FROM yearly_spend ys1 ) SELECT year, product_id, curr_year_spend, prev_year_spend, ROUND( CASE WHEN prev_year_spend IS NULL THEN NULL ELSE ((curr_year_spend - prev_year_spend) / prev_year_spend) * 100 END, 2) AS yoy_rate FROM spend_with_prev ORDER BY product_id, year;
  • Compute the total spend for each product by year.
  • Join the current year's spend with the previous year's spend to calculate the YoY growth rate.
  • Format the results to show the year, product ID, current year's spend, previous year's spending, and the YoY growth rate rounded to two decimal places.
  • yearly_spend CTE extracts the year from transaction_date and calculates the total spend (curr_year_spend) for each product by year.
  • spend_with_prev CTE uses the LAG() window function to retrieve the previous year's spend for each product. This function helps in comparing the current year's spend with the previous year's spend by partitioning the data by product_id and ordering it by year.
  • The final SELECT computes the YoY growth rate. The CASE statement ensures that if there is no previous year's spend, the YoY rate is shown as NULL. The ROUND() function rounds the YoY growth rate to 2 decimal places. Results are ordered by product_id and year in ascending order.

20. Maximize Amazon Prime Item Inventory

Write a query to find the maximum number of Amazon Prime and non-Prime batches that can be stored in Amazon’s 500,000 square feet warehouse, prioritizing Prime items.

You are given an inventory table with item_id, item_type, item_category, and square_footage columns. Return the item_type with Prime-eligible items listed first, followed by non-Prime items, along with the maximum number of batches that can be stocked.

SQL
WITH summary AS ( -- Summarize total square footage and item count for both prime_eligible and not_prime items SELECT item_type, SUM(square_footage) AS total_sqft, COUNT(*) AS item_count FROM inventory GROUP BY item_type ), prime_occupied_area AS ( -- Calculate the number of prime batches we can store and how much space they occupy SELECT item_type, total_sqft, FLOOR(500000 / total_sqft) AS prime_batch_count, FLOOR(500000 / total_sqft) * item_count AS prime_item_count FROM summary WHERE item_type = 'prime_eligible' ) -- Output the maximum number of prime and non-prime items that can be stored SELECT item_type, CASE -- For prime_eligible, calculate maximum items using the full available warehouse space WHEN item_type = 'prime_eligible' THEN (FLOOR(500000 / total_sqft) * item_count) -- For not_prime, calculate remaining space and determine maximum items fitting in that space WHEN item_type = 'not_prime' THEN FLOOR((500000 - (SELECT prime_batch_count * total_sqft FROM prime_occupied_area)) / total_sqft) * item_count END AS item_count FROM summary ORDER BY item_type DESC;
  • Summarize the total square footage and item count for both prime_eligible and not_prime items, grouping by item_type from the inventory table.
  • SUM(square_footage) calculates the total square footage taken by all items of a given item_type (e.g., total space used by all prime_eligible items).
  • prime_occupied_area CTE calculates the space occupied by Prime-eligible items and determines how many full Prime item batches can be stored in the warehouse, given the total warehouse size of 500,000 square feet.
  • In the main query, the results are calculated to get the maximum number of Prime and non-Prime items that can be stored in the warehouse after accounting for the space taken by the Prime items.
  • Sort the results of the query in descending order by the item_type.

21. Median Google Search Frequency

You are given a search_frequency table with the following columns: searches and num_users. Find the median number of searches made by a user last year.

SQL
WITH searches_expanded AS ( SELECT searches FROM search_frequency CROSS JOIN GENERATE_SERIES(1, num_users) ) SELECT ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY searches)::numeric, 1) AS median FROM searches_expanded;
  • The searches_expanded CTE implements CROSS JOIN GENERATE_SERIES(1, num_users) which generates a series of numbers from 1 to num_users (inclusive). For each unique value in the search_frequency table, it generates a row for every user.
  • PERCENTILE_CONT(0.5) computes the median of the searches column.
  • WITHIN GROUP (ORDER BY searches) specifies the ordering of the data to compute the percentile. It orders the searches column before calculating the median.
  • ::numeric casts the result of PERCENTILE_CONT to a numeric type because the ROUND() function expects a numeric type.

22. Facebook Advertiser Status

You are provided with advertiser and daily_pay tables from Facebook. The advertiser table has information about advertisers, while the daily_pay table has information about their payments. Write a query to update the payment status for Facebook advertisers using data from the daily_pay table. The results should display the user ID and their updated payment status, sorted by user ID.

Advertisers' payment statuses can be classified into these categories:

  • New: Advertisers who are newly registered and have completed their first payment.
  • Existing: Advertisers with a history of payments who have recently made an additional payment.
  • Churn: Advertisers who have previously made payments but have not made any recent payments.
  • Resurrect: Advertisers who had a break in payments but have recently made a new payment after an earlier payment.
SQL
SELECT COALESCE(advertiser.user_id, daily_pay.user_id) AS user_id, CASE WHEN daily_pay.paid IS NULL THEN 'CHURN' WHEN daily_pay.paid IS NOT NULL AND advertiser.status IN ('NEW', 'EXISTING', 'RESURRECT') THEN 'EXISTING' WHEN daily_pay.paid IS NOT NULL AND advertiser.status = 'CHURN' THEN 'RESURRECT' WHEN daily_pay.paid IS NOT NULL AND advertiser.status IS NULL THEN 'NEW' END AS new_status FROM advertiser FULL OUTER JOIN daily_pay ON advertiser.user_id = daily_pay.user_id ORDER BY user_id;
  • Combine advertisers and payments using a FULL OUTER JOIN on user_id to ensure the inclusion of all advertisers and payments, even if there are no corresponding matches in the other table.
  • Select the non-null user_id from either table using COALESCE.
  • If paid is NULL, the status is set to CHURN.
  • If there is a payment and the current status is one of NEW, EXISTING, or RESURRECT, the updated status is EXISTING.
  • If there is a payment and no recorded status (NULL), the updated status is NEW.
  • Sort the results by user_id.

23. McKinsey 3-Topping Pizzas

You are given a pizza_toppings table with the following columns: topping_name and ingredient_cost.

Write a query to find all possible three-topping combinations and their total cost, sorted by the highest total cost.

Ingredients should be listed in alphabetical order.

SQL
WITH pizza_combinations AS ( -- Generate combinations of 3 toppings SELECT t1.topping_name AS topping1, t2.topping_name AS topping2, t3.topping_name AS topping3, t1.ingredient_cost AS cost1, t2.ingredient_cost AS cost2, t3.ingredient_cost AS cost3 FROM pizza_toppings t1 JOIN pizza_toppings t2 ON t1.topping_name < t2.topping_name JOIN pizza_toppings t3 ON t2.topping_name < t3.topping_name ), pizza_costs AS ( -- Calculate total cost and format the pizza name SELECT CONCAT(topping1, ',', topping2, ',', topping3) AS pizza, (cost1 + cost2 + cost3) AS total_cost FROM pizza_combinations ) SELECT pizza, total_cost FROM pizza_costs ORDER BY total_cost DESC, pizza;
  • The pizza_combinations CTE performs a self-join to generate all possible unique combinations of three different toppings.
  • The toppings are listed in alphabetical order by using < to avoid repeating the same combination in a different order.
  • The pizza_costs CTE concatenates the toppings into a pizza name and calculates the total cost.
  • The final SELECT retrieves the pizza name and total cost, ordering results first by the highest total cost and then alphabetically by pizza name.

24. MAANG Department vs. Company Salary

You are given employee and salary tables. You are tasked with finding out how the average salary in each department compares to the company's overall average salary for March 2024.

Provide the comparison results as 'higher', 'lower', or 'same' for each department.

Include the department ID, payment month (formatted as MM-YYYY), and the comparison result in the output.

SQL
WITH company_avg AS ( -- Calculate the company's overall average salary for March 2024 SELECT AVG(amount) AS avg_salary FROM salary WHERE EXTRACT(MONTH FROM payment_date) = 3 AND EXTRACT(YEAR FROM payment_date) = 2024 ), department_avg AS ( -- Calculate the average salary per department for March 2024 SELECT e.department_id, AVG(s.amount) AS avg_salary FROM employee e JOIN salary s ON e.employee_id = s.employee_id WHERE EXTRACT(MONTH FROM s.payment_date) = 3 AND EXTRACT(YEAR FROM s.payment_date) = 2024 GROUP BY e.department_id ), comparison AS ( -- Compare each department's average salary with the company's overall average salary SELECT d.department_id, '03-2024' AS payment_date, CASE WHEN d.avg_salary &gt; c.avg_salary THEN 'higher' WHEN d.avg_salary &lt; c.avg_salary THEN 'lower' ELSE 'same' END AS comparison FROM department_avg d CROSS JOIN company_avg c ) SELECT department_id, payment_date, comparison FROM comparison ORDER BY department_id;
  • The company_avg CTE calculates the company's overall average salary for March 2024.
  • The department_avg CTE calculates the average salary per department for March 2024.
  • The comparison CTE compares each department's average salary with the company's average salary.
  • The CROSS JOIN ensures that each department's salary is compared against the company's overall average.
  • The CASE statement determines whether the department’s salary is higher, lower, or same compared to the company's average.

25. JPMorgan Chase Card Launch Success

You are given a monthly_cards_issued table from JPMorgan Chase. The table has the following columns: issue_month, issue_year, card_name, and issued_amount.

Find the name of the credit card and the number of cards issued during its launch month. The launch month is the earliest record in the monthly_cards_issued table.

The results should be ordered from the largest issued amount.

SQL
WITH ranked_cards AS ( -- Assign a rank to each card based on the issue date (earliest issue_month and issue_year) SELECT card_name, issue_month, issue_year, issued_amount, ROW_NUMBER() OVER (PARTITION BY card_name ORDER BY issue_year, issue_month) AS rn FROM monthly_cards_issued ) -- Select only the launch month (rank = 1) records SELECT card_name, issued_amount FROM ranked_cards WHERE rn = 1 ORDER BY issued_amount DESC;
  • The ranked_cards CTE assigns a row number to each card based on the issue date, using ROW_NUMBER() to rank the earliest issue dates.
  • PARTITION BY card_name ensures row numbers are assigned independently for each card.
  • The WHERE rn = 1 clause filters for the first (launch) month for each card.
  • The final SELECT retrieves the card name and the number of cards issued during the launch, ordered by the highest issued amount.

Performance Optimization Queries

Finally, let’s look at performance optimization queries. 

26. Partial Index on Orders

  • A partial index in SQL is built over a table subset based on a condition specified in a WHERE clause. This means the index only includes rows that satisfy a certain condition, unlike a full index, which includes every table row.
  • Partial indexes can improve query performance by reducing the index size and focusing on frequently queried data. This is especially useful in cases where only a small portion of the table is commonly queried or updated.
  • For example, consider an orders table with millions of rows, but only recent orders (within the last month) are queried frequently:
SQL
CREATE INDEX idx_recent_orders ON orders (customer_id, order_date) WHERE order_date &gt;= NOW() - INTERVAL '1 month';
  • In this case, instead of indexing every row in the orders table, the partial index idx_recent_orders only includes orders from the past month. This reduces the index size and improves query performance when searching for recent orders.

27. Analyze JPMorgan Chase Card Launch Success

Write a query to analyze the query execution in question 25 (see above).

  • This can be achieved using the EXPLAIN ANALYZE command.
SQL
EXPLAIN ANALYZE WITH ranked_cards AS ( -- Assign a rank to each card based on the issue date (earliest issue_month and issue_year) SELECT card_name, issue_month, issue_year, issued_amount, ROW_NUMBER() OVER (PARTITION BY card_name ORDER BY issue_year, issue_month) AS rn FROM monthly_cards_issued ) -- Select only the launch month (rank = 1) records SELECT card_name, issued_amount FROM ranked_cards WHERE rn = 1 -- Filter to only include the launch month (the earliest) ORDER BY issued_amount DESC; -- Order by issued amount in descending order

The output obtained is:

Sort (cost=21.30..21.31 rows=1 width=520) (actual time=0.047..0.048 rows=2 loops=1) Sort Key: ranked_cards.issued_amount DESC Sort Method: quicksort Memory: 25kB -&gt; Subquery Scan on ranked_cards (cost=16.39..21.29 rows=1 width=520) (actual time=0.037..0.044 rows=2 loops=1) Filter: (ranked_cards.rn = 1) Rows Removed by Filter: 7 -&gt; WindowAgg (cost=16.39..19.54 rows=140 width=536) (actual time=0.035..0.042 rows=9 loops=1) -&gt; Sort (cost=16.39..16.74 rows=140 width=528) (actual time=0.028..0.029 rows=9 loops=1) Sort Key: monthly_cards_issued_174.card_name, monthly_cards_issued_174.issue_year, monthly_cards_issued_174.issue_month Sort Method: quicksort Memory: 25kB -&gt; Seq Scan on monthly_cards_issued_174 (cost=0.00..11.40 rows=140 width=528) (actual time=0.014..0.016 rows=9 loops=1) Planning Time: 0.239 ms Execution Time: 0.079 ms

The output from the EXPLAIN ANALYZE query provides insight into how PostgreSQL executes the query, showing each step, the cost estimates, and actual execution times:

  • PostgreSQL used quicksort, a memory-efficient algorithm. It only needed 25kB of memory, indicating a small result set.
  • The sorting took 0.048 milliseconds to complete.
  • The ranked_cards sub-query took around 0.037 to 0.044 milliseconds.
  • The ROW_NUMBER() window function calculation took 0.035 to 0.042 milliseconds.
  • The time PostgreSQL took to generate the execution plan was 0.239 milliseconds.
  • The total time to execute the query was swift, at 0.079 milliseconds.
  • The query uses a Sequential Scan because no index exists on the columns. This isn’t a problem because the table is tiny (9 rows), but indexing and optimizing the sequential scan should be considered for larger datasets.

28. MAANG Department vs. Company Salary Query Optimization

Write an optimized version of the query in Question 24.

SQL
WITH -- Calculate the company's overall average salary for March 2024 company_avg AS ( SELECT AVG(amount) AS avg_salary FROM salary WHERE payment_date &gt;= '2024-03-01' AND payment_date &lt; '2024-04-01' ), -- Calculate the average salary per department for March 2024 department_avg AS ( SELECT e.department_id, AVG(s.amount) AS avg_salary FROM employee e JOIN salary s ON e.employee_id = s.employee_id WHERE s.payment_date &gt;= '2024-03-01' AND s.payment_date &lt; '2024-04-01' GROUP BY e.department_id ), -- Compare each department's average salary with the company's average salary comparison AS ( SELECT d.department_id, '03-2024' AS payment_date, CASE WHEN d.avg_salary &gt; c.avg_salary THEN 'higher' WHEN d.avg_salary &lt; c.avg_salary THEN 'lower' ELSE 'same' END AS comparison FROM department_avg d CROSS JOIN company_avg c ) -- Select the results SELECT department_id, payment_date, comparison FROM comparison ORDER BY department_id;
  • Replace EXTRACT(MONTH FROM payment_date) with Date Range Filtering, the database can use any potential index in the payment_date column.
  • The query can be optimized by creating indexes on payment_date, employee.employee_id, and salary.employee_id.

SQL Interview Tips

Hopefully, these questions have given you a glimpse into what to expect in your data engineering interviews.

Good luck with your upcoming interview!

Your Exponent membership awaits.

Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.

Create your free account