Do you have an upcoming machine learning system design interview?
ML system design interviews typically last 45 minutes to 1 hour.
During the interview, an ML engineer will ask you to design a system from start to finish, including:
These questions assess your ability to consider real-world aspects of productionizing an ML model, such as efficiency, monitoring, preventing harmful model outputs, and building inference infrastructure.
They also test your ability to model a business problem as an ML problem.
ML system design interview questions are challenging because they require you to synthesize many ML concepts into a working solution.
You have the added pressure of working within a limited time frame.
A framework helps you stay focused, budget your time strategically, and communicate with the interviewer.
Begin your ML system design by defining the problem, setting interview parameters, and aligning with the interviewer.
This step gauges your ability to scope problems and identify system requirements.
Specify the model and datasets needed for your system:
💬 In the Spotify example, you could say:
"We are trying to build an ML-based recommender system on Spotify that recommends artists to users based on their liked playlists, songs, and artists.
The success of this system will depend on user engagement, which is defined by number of clicks. If a user clicks on a recommendation, that's a point towards the algorithm. If they don't, then we can agree it was a bad recommendation.
We can go deeper and assess the amount of time they engaged with the recommendation, but to keep things simple for now, let's go with just a click."
Identify requirements and potential tradeoffs. Consider:
💬 In the Spotify example, you could say:
"I have two clarifying questions:
We’ll assume that click data from users will be one data source. The other source will be user metadata, such as age or location.
Understanding the condition of the raw data helps us plan for what pipelines and transformations are needed to convert it into a usable format.
Let’s assume we get click data in a JSON serialized format.
These are usually events that come in and land in an object store. The user metadata is simpler, as it's available directly within the Postgres account table. However, we must remember that it is PII data, so it must be used carefully."
Designing a data pipeline shows your interviewer that you understand the importance of high-quality data, not just high-quality algorithms.
Show your interviewer that you’re thinking about data quality:
💬 In the Spotify example, you could say:
"Having clarified the data conditions and sources in the previous step, we’re ready to design a data processing pipeline. We’ll use the above two points to create data processing pipelines and fetch what we need to make our features.
Then, we’ll access the raw click data and the Postgres table for the account information. Afterwards, we’ll create our features.
We must decide between a batch-based or real-time solution to collect and process the data. A batch-based system is usually easier to manage, whereas inferencing and training in real-time are compute-intensive and expensive.
It’s usually better to have at least one in batch, preferably the training (as this takes the most time). However, we can do inferencing in real time if needed.
Ideally, both training and inferencing would be in batches. Some serverless jobs would pull the latest recommendations stored by the batch job in a cache. This way, the recommendations are always available but refreshed every few hours.
For this scenario, we’ll use a batch-based system for both training and inferencing.
Since click data is coming in as JSON events and landing in an object store, we’ll design the data pipeline by creating an ETL Pipeline.
We’ll create an abstracted data model to illustrate how we want our data to look before feeding it into the model. Generally, we want our features to be as mutually exclusive as possible because this prevents complicated correlations between features from occurring.
We’ll take the following feature engineering steps:
We’ll store all these features in a new table and then write them to a feature store for model consumption."
Once you've got your data, select and train a suitable ML model. In this step, you need to justify your model choice considering:
💬 In the Spotify example, you could say:
"Now that we’ve created a data pipeline, we’ll consider the types of models typically used for recommendation systems.
Traditionally, recommendation systems take advantage of data from other users and use that to recommend something to new or even existing users. This is known as collaborative filtering, which has the potential to become a challenge if there is a lack of data from other users.
Additionally, recommendation systems increasingly involve deep learning and traditional supervised techniques like decision trees, XGBoosts, etc.
There’s a massive library of paths to choose from."
Identify suitable model architectures that meet the system requirements, like latency or memory optimization.
Potential architectures for a classification task include logistic regression, a complex neural network, or a search-optimized two-tower architecture.
For example, you might choose a simpler neural network model to improve training performance, even if it affects latency at inference time.
💬 In the Spotify example, you could say:
"To satisfy our current use case, let’s start with a simple architecture. Assuming we have the required data, we’ll move forward with the collaborative filtering element. With music, trends are traditionally developed through mutual sharing between listeners.
The simplest model we can select will create feature vectors for each user. Each feature vector is a unique ID for each user, comprised of a user’s features (age group, location, array of favorite artists maps, array of favorite songs maps).
We’ll score each of these vectors between -1 and 1. This scoring method consolidates the vectors into a single number representing users and their preferences. We’ll also score each item we recommend between -1 and 1, depending on its popularity and number of plays. This allows us to compare users on the same scale (normalization).
We’ll then organize these scores for each user into a user-item matrix. Each user is on a row, and each item is on a column. We’ll then compute the product of each feature vector’s score with the recommended song’s score and set a threshold between -1 and 1.
Depending on how close the product is to 1, we’ll decide whether to recommend that item to the user. We can set the threshold high and vice versa if we want to give particular and limited recommendations.
Generally, starting with a low threshold is better to collect as much information as possible. Then, we can begin to pinpoint the optimal threshold value for future recommendations."
Select a model and decide on an optimizer algorithm, metrics for monitoring, and hyperparameters tuning.
Your training plan might change depending on your hardware availability, parallel training jobs, and data and model parameters distribution across multiple devices.
Specific models may allow fine-tuning of pre-trained models instead of training from scratch.
💬 In the Spotify example, you could say:
"To create training inputs, we’ll take the process data, code non-numerical data, and featurize the rest. The training will produce a user-item matrix. This matrix will then create a probabilistic prediction to recommend an item to the user.
The user is then presented with these recommendations. The click data is collected as positive feedback if a user clicks on any recommendation. Any items that have been recommended that were not clicked will be considered negative feedback. The number of clicks over the total number of recommendations is considered the accuracy metric for the model."
Present your evaluation plan to your interviewer, considering where your model will be used and how an incorrect prediction could impact users.
Evaluation standards include:
Discuss the pros and cons of your chosen evaluation metrics, such as how precision@k compares to ndcg@k in a ranking task.
💬 In the Spotify example, you could say:
"Once we’ve established the accuracy metric, we’ll use the features for the positive and negative recommendations to see the difference. This difference will indicate if certain features played a more significant role in affecting user behavior versus the other.
This data can then be used to create a feature weighting algorithm that learns to get better at weighing features.
Consequently, the collaborative filtering algorithm will also improve."
Understanding how components fit into the overall picture is crucial. Address these three key points:
💬 In the Spotify example, you could say:
"The last step in this process is to understand when and how best to deploy our model into production.
First, we’ll define the appropriate metrics we previously discussed as engagement. Then, we can deploy an A/B test plan for this model to understand if this is the best step for the user experience.
Second, we must understand the compute and storage resources to train, test, validate, and infer the information. Let’s say we are using a cloud system like AWS.
We can take advantage of AWS sagemaker (to house, train, and test the model), lambda (to service requested recommendations), elasticache (to store the recommendations), and provide them back to the application via an API endpoint.
We can then auto-scale the resources to handle changing traffic volumes from the application."
Review the problem scope, data processing pipeline, and how you would train, evaluate, and deploy the model in the last few minutes of the interview.
If there’s time, discuss some of your overall system design's main bottlenecks and tradeoffs.
Ending with a high-level overview and additional considerations shows the interviewer you have a comprehensive understanding of the system and how to move your ML model into a production environment.
Once you’ve wrapped up, check in with your interviewer to see if they have follow-up questions.
💬 In the Spotify example, you could say:
"To recap, we’ve just designed a high-level system to recommend artists on Spotify.
We first identified our data sources as user metadata and click data. We then opted for a batch-based system to process the data, used a collaborative filtering model to score each user’s feature vectors, and collected click data to train the model.
We then discussed the factors affecting model deployment, such as engagement and compute and storage resources.
The other consideration to shed additional light on is post-production work. Machine learning is very dynamic since incoming data changes constantly. This affects the model and its performance, so monitoring and observing the model, data, and feature drift is important.
Observing the model's performance is essential to ensuring we meet our metric. We can check model performance constantly by observing the metric we are testing against (churn)."
These are common mistakes we see candidates make.
Rushing to a solution. Rather than jumping into the design, first analyze the specific problem you’re trying to solve by clarifying the system requirements, the situation's context, the data's scale, etc. Once you develop a baseline model, get the interviewer's input about what pieces to focus on.
Looking for the “right” answer. In most cases, there are no strictly right or wrong answers. Some are better justified than others, and your interviewer expects you to thoroughly justify your answers by explaining why you chose your design over possible alternatives.
Defaulting to state-of-the-art (SotA) models. It's important to check ML benchmark leaderboards to identify the current SotA models for a given task. However, remember that SotA models are often less efficient to train and run inference with (requiring more compute or data).
They're also usually evaluated only on academic benchmarks rather than in real-world settings. Practice building your own models and research other models to have a holistic understanding of the available options.
Overcomplicating the model. Many things can go wrong when training models, so start with a low-capacity, v1 solution. Once you have a v1 solution for the system that works on clean data, expand the model capacity to account for additional complexity (e.g., messy data and corner cases).
Starting with a basic model also budgets time for the interviewer to identify the pieces of the ML design they’d like you to focus on. Those hints show you can collaborate and incorporate feedback on your design.
Overlooking model evaluation and validation. Model selection is just one part of the problem, so budget time for the other steps.
Clarify how you’ll initially validate a model learned from some data (your strategy should involve quantitative and qualitative analysis), and discuss how continual validation will happen (e.g., using a metrics dashboard).
In your ML interviews, be prepared to answer a mix of behavioral, coding, conceptual, and system design questions.
Here are some real machine learning system design interview questions that other candidates have heard recently.
They include questions asked in FAANG and other top companies.
It is impossible to cover all the possible questions since machine learning system design is a broad and varied topic!
However, hopefully, this guide has given you what to expect in your interviews.
Good luck with your upcoming machine-learning interview!
Exponent is the fastest-growing tech interview prep platform. Get free interview guides, insider tips, and courses.
Create your free account