Career Nav #68 – How to Prepare for a Machine Learning Interview
Suhyun Kim, Co-founder of DSPractice, shares her talk, “How to Prepare for a Machine Learning Interview.” She discusses differences in using data in research versus production, career paths using machine learning, questions to prepare for interviews in these positions and resources for online practice.
I am a co-founder of DSPractice. It's the website that helps you prepare for ML coding interview questions. Previously, I worked at Amazon as a machine learning engineer at Amazon Alexa. Before that, I was in Columbia University studying my Master's in Computer Science with ML focus. I was also a mobile engineer, making Android applications for Nickelodeon.
The way ML is done in research is very different from the way it's done in production. The kind of interview questions that they ask you are very different. When it comes to requirements in research, state-of-the-art model performance is really important. You want to make sure that you run the models against benchmark data sets. In production, they all have different requirements. Oftentimes, a simple model might perform much better because you tend to have a lot of data.
Another thing is computational priority. In research, you want to train as fast as possible. In production, the inference speed is much more important because you want to ensure customers have low latency application experience. When it comes to data, research data tends to be static because you're using open source data. In production, the data distribution is constantly shifting. You want to make sure that you monitor the shifts in the data and fairness interpretability of the model. These two things are very important when the model is in production. You want to be able to debug how your model performed and if the model decisions were fair or if the model decisions were interpretable, so that you can improve your future models.
Sometimes people use ML engineers interchangeably with data scientists. I want to make sure what workflow that we're covering in terms of the kind of jobs we are talking about. It starts from data ingestion data pipeline. You're getting data from a web application or cloud, data warehouse or any type of data place. You explore, validate, wrangle, clean your data and then the data is now distributed between split into train and test set. From there, you're going into model development and model engineering, different versions of the models. You want to evaluate your model, how well your model is performing, and model packaging. You want to have different versions of the models, and depending on where the model is running maybe you want to have your model run in the browser, edge computing or just cloud.
Different kinds of model packaging is based on where the model is going to be deployed to. We have regular software code as well as machine learning pipeline, meaning model development pipeline and data pipeline, to make sure that we have something like feature stores or the data is being ingested into the models. Most importantly, we need monitoring and logging tools that could be related to the app performance so that you can indirectly measure the model performance or that can directly be implemented on the model, or the monitoring and logging can be implemented on both pipelines so that you can have the right feedback loop, you can monitor how the software's performing.
The ML jobs that I'm going to talk about involve every single stage. It's usually a large amount of data and drift in data, drift meaning data distribution changes and continuous model updates are involved. Different companies might have a different terminology. Machine learning engineers or applied scientists, usually mean people who are working on the models or the model pipelines. At Amazon, they're called applied scientists, and sometimes they're called ML platform engineers. They're working directly on creating the platform that I just described, and there are machine learning researchers. They're doing research and ML, so their main focus is having a publication instead of making a product out of ML. Given the data, they might not make a ML model out of it, but they might just do the data analysis on what kind of model that they should develop. Sometimes people refer to data scientists as machine learning engineers too. These are people who are working on the data pipeline. There are also prompt engineers, who are trying to hack ChatGPT.
To prepare for ML interviews, there are five different types of questions you can prepare for. We have data structures, the de facto questions that you should always prepare for. This would be any software-related jobs and machine learning algorithms, ML system design, SQLs and research. Anything that requires research-related questions are usually for PhDs. For MLE jobs, they might ask you about state of the art research models. Data structures are something that we're very familiar with as software engineers. This kind of interview structure sets the tone for the rest of the interview types. You want to be familiar with breadth, hashing, and it's in the order of how they're related to each other. If you know how to solve an array questions, it's going to help you with two pointers. The two pointers questions will help you solve binary search, sliding window, linked list, and then it'll help you with tree questions. Trees are going to be useful in backtracking because every backtracking question can be turned into tree. Backtracking is going to help you with graph questions, and backtracking can help you with dynamic programming. With dynamic programming, you can start with backtracking and you can add memoization to make it more efficient and optimized. From trees you can obviously to tribes, because tribes are extended versions of trees, and you can do heap and priority queues that use trees.
I spend about 10 minutes coming up with edge cases and brainstorming solutions. The next 10 minutes I focus on writing code, then the last 10 minutes I work on a test case. Number one, this question is valid for every single interview question type. You want to start asking questions instead of going straight into coding or going into writing a diagram for system design questions. Ask questions and come up with edge cases. Usually edge cases involve input being null. Size, input size is one or multiple size input, then it's duplicates, negative numbers or is zero. When it comes to sorting the numbers, you want to always consider increasing sequence and decreasing sequence and zig-zag sequence. By that, I mean increasing and decreasing that sequence, and you want to provide obvious solutions first. Suggest something like library functions.
Try to turn the problem into a different one by changing the data structure to represent the given input. That's going to help you solve a problem and look at the problem in a different way. I like to do a small example set and see if there is a pattern to figure out what kind of data structures or what kind of algorithms that I can apply. Come up with at least two solutions. It's good to come up with as many solutions as possible and discuss the time complexity and space complexity, then you can choose the most efficient one and implement it. From step one to five, I like to spend about 10 minutes and then discuss with the interviewer about what solution that I should implement. Then the next 10 minutes, write inputs and outputs of a function, calling out what specific values you want to return from a function.
Start with something like input checks. What if I get null value, what's the base case? Then, continue writing code. Next, go through a simple example. This step is important because ideally, you want to catch your own mistake during the step, and it's okay if you have made a mistake earlier in your code, but if you catch your own mistake, it's a huge plus. Ideally, all of those edge cases that you came up with in step one are small enough that you can go through with your hand first. I spent 10 minutes on coding, but the rest of 20 minutes was communication. Communication matters more than solving a problem or coming up with an "ideal" solution.
The next type of question is ML coding questions. This is what makes ML interviews special. For software engineers we don't really have to prepare for this type of questions. An example is implementing K-Nearest Neighbors for binary classification. When you're given such a question, you want to apply with a similar strategy for software engineering interviews. Number one, ask clarifying questions. What are edge cases? What kind of data are we talking about? Are we talking about images, texts? Are we talking about regular scalar values? Talk about your knowledge on different distance functions. There is Euclidean Distance, Cosine Similarity and Manhattan distance. Mention all of them, to let them know that you are knowledgeable in implementing KNN. Ask what kind of business problem that they're talking about with KNN, and what kind of distinction would be useful for the interview purposes. The second example is to implement and fit a linear regression model. Ask detailed questions about their business objectives, because it depends on what kind of business problem you're trying to solve. The inputs and data engineering that's required for each regression model would be completely different. Mention that you know the importance of the background knowledge when it comes to doing ML models.
The third type of example question is loading data from a CSV file. Clean and transform it and save it into a new CSV. They might ask you about augmentationing the features, and they might ask you about reduction, on dimensional reductions. You can talk about some numerical ways of approaching it, or you can talk about some software engineering ways of approaching it. These are the type of questions that you can actually see from dspractice.com. I also host a Meetup. It's bi-weekly. For ML Tech interview prep, you can attend my meet ups too. I also recommend a course from Coursera, Applied Machine Learning in Python.
Next one is ML System Design. An example question would be implementing an online and offline training pipeline. When you're given a question like that, it's actually better to ask them what exactly are we trying to build. The rubric for such questions is have you described the problem exactly with inputs and outputs. Have you figured out the throughput and latency for the system? Have you described the design with enough clarity and detail so the interviewer knows where you're going to build? Have you provided reasonable metrics so people know exactly what you want to measure? Have you described continuous model updates with details? Continuous model updates are important. Machine learning operations is the platform that makes it different from regular software engineering, DevOps. Regular software engineering, DevOps, you don't necessarily update the code as often as the model updates happen. Most importantly, the model size sometimes has gigabytes or maybe terabytes size of model, and then that also includes the data. The data's going to be much bigger, the data size, and it's usually multiple models. How are you going to update? How often are you going to update them? How are you going to detect data drifts through monitoring? What kind of things are you going to look at to make sure that it is time for retraining the models?
Have you described the fault tolerance of the system? Have you described the pros and cons of your current system? The first thing when it comes to ML system design is you want to calculate latency, throughput. Just like any other system design questions you want to calculate your metrics. In the ML settings, there are a lot of batch. Sometimes people use batch for queries because when you just use a single query latency is 10 milliseconds and throughputs hundred queries per second. When the latency increases throughput actually decreases. In batched queries, when you're batching everything as the latency increases, throughput increases too. That's because if the data's available in a data warehouse and you batch them together using a tool like MapReduce or Spark, then you can increase the throughput like that. The way MapReduce works is the system splits up the data across different machines and each machine starts running the user provided map program.
The map program takes some data and emits a key value pair that the system shuffles around, reorganizes the data so that all key value pairs associated with a given key go to the same machine. Every data with the same key goes to the same machine to be processed by reduce. Reduce processes the data in the same machine emitting a new key in value, so reduces doing the work. The result of this might be fed back into the reduce program for more reducing, but usually it ends from map to reduce process like that. You're taking advantage of distributed systems and a lot of clusters that you have. All the work, instead of having one pipeline that's doing a single query, you're managing a lot of data processing at the same time.
The way the batch prediction works is that because batching the latency is pretty high, and you want to make sure that everything stays within milliseconds for the UX. Prediction is usually made in advance and is made in, number one from batch features. Batch features are coming from the data warehouse. A repository for storing structure data is called a data warehouse. The repository for storing unstructured data is called data lake. Data lakes are usually used to store raw data before processing. Data warehouses are used to store data that has been processed into formats ready to be used. The prediction results are also stored back into the data warehouse. The app is going to get prediction results.
The model can use the historical data. The problem with this is that it's really hard to continuously update. In order to do continuous updates in the models, there are multiple strategies. Number one is shadowing. All these principles are actually traditional software engineering principles, but they apply to model development as well. Shadowing deployment is to deploy the candidate model in parallel with the existing model. And for each incoming request, it goes to both models to make predictions but only once serves the existing models prediction to the user. You have two models that are deployed, and then you make sure that the new model has the similar predictions or pretty much the same predictions to make sure that the new model doesn't perform badly compared to the previous one. When you think the model's ready, then you can start deploying the new model. We have the A/B testing of the model and interleaving experiments.
In order to prepare for ML System Design try to make your own ML productionized app using AWS GCP, Azure, Hugging Face, Relevance AI. There's a lot of cloud systems you can take advantage of. You can also take a look into Snowflake as well. You can learn from other people's examples and you can take a look at what other companies are doing. They would usually have a company blog post about it. I actually made a blog post about Machine Learning Serverless so you can check it out too. The last type of questions are SQL questions. SQL questions are not necessarily asked for MLE jobs. It's usually geared towards data analysts or data scientists. If your job's going to involve using databases a lot, then they're going to ask you SQL questions. An example of SQL question is combining two SQL tables. In order for you to practice more SQL questions, you can go to leecode.com or stratascratch.com. They have a lot of SQL questions from multiple companies for you to practice.