Preparing for Data Science Interview

While preparing for data science/machine learning interviews, I realized there was a lot of knowledge to cover for the different rounds of the interview process. In order not to be overwhelmed with information overload, I decided to put together an article on the major focus points.

My study notes were divided into different focus areas:

  • Data analysis implementation
  • Machine learning and deep learning algorithms
  • Building and evaluating models
  • Machine Learning (ML) theory
  • Programming
  • Statistics and probability
  • Natural Language Processing (NLP)
  • Recommendation Systems
  • Project-based questions
  • Behavioral questions
  • Questions to ask your interviewer
  • Questions to ask your recruiter

Data analysis implementation: Here, the focus was on syntaxes, methods, and processes for wrangling data.

  1. Exploratory Data Analysis (EDA): 
    1. Using methods such as df.head(), df.shape, and df.describe for EDA of a data frame.
    2. Other important analyses are getting unique values in a column or the count of unique values, grouping by one or more columns, and getting aggregates.
  2. Handling missing values: 
    1. Either using df.dropna(axis=0) to drop rows with missing values or df.fillna() to fill in values. You could either fill in with the mean or median of values in that column.
    2. Filling missing values with the mean does not work well with a diverse dataset/column that is skewed and has outliers. In this case, the median is more appropriate.
    3. If data was collected in some sort of other, bfill might also be a good method to use. Only fill missing data where data was not recorded, not because it does not exist. That should be kept as NaN.
  3. Data cleaning:
    1. Pandas methods for dropping columns, renaming columns, changing the data types of columns, and resetting the index.
    2. Subsetting a data frame based on the specific numbers, column names, or condition(s), how to merge, join, or concatenate data frames on either row or column axis.

Machine learning and deep learning algorithms

You should be able to provide an in-depth explanation and give reasons for whatever answers you give to most questions. I’ll be sharing examples of areas you should consider focusing on and some questions you should think about.

  1. Machine learning algorithms:
    1. Neural networks (NN):
      1. Why do you scale the features?
      2. What is batch normalization?
      3. How do you choose layers to add?
    2. Support Vector Machines:
      1. What are support vectors?
      2. What are kernel functions and the types of kernel functions?
    3. Linear and logistic regression:
      1. Difference between linear and logistic regression.
      2. Handling bias in these algorithms.
      3. What are solvers for logistic regression?
    4. Naïve Bayes:
      1. Why is it naïve?
      2. When does it perform well?
  2. Compare machine learning algorithms:
    1. Classification vs regression algorithms.
    2. Supervised vs unsupervised algorithms.
    3. How do different factors affect these algorithms, e.g. decision boundaries, training data size, and speed of training?
    4. Different hyper-parameters for these algorithms.
    5. When to use which algorithm, and why (give different use cases and why they have advantages there)?
    6. The loss functions for different algorithms.
    7. Which algorithms require scaled data for the algorithms and why do they need this?
    8. Difference between KNN vs K-means clustering
  3. Cross-validation:
    1. What is cross-validation and why is it used in machine learning?
    2. When is cross-validation not necessary?
  4. Normalizing vs scaling vs standardization: The difference between these three terms, knowing when to use each of them and why you should choose which.
  5. Categorical encoding vs target encoding vs ordinal encoding:
    1. Definition of these encodings and differences between each.
    2. Use cases and examples where they could be applied.
  6. Generative and discriminative models:
    1. Explanation of these two types of models and their differences.
    2. Examples of machine learning algorithms under each of these categories.
    3. Use cases of when one of the categories works better than the other.
  7. Types of outputs in classification problems: 
    1. Class outputs vs probability outputs, and how they work.
    2. What models give what types of outputs?
  8. Tree-based algorithms:
    1. Which algorithms are built on the tree-based model?
    2. What kind of data do tree-based models have an advantage on (sparse vs dense data)?
  9. Ensemble models:
    1. How they work.
    2. Boosting vs bagging.
    3. Advantages of different ensemble methods, preferred data types, and example algorithms based on each method.
    4. RandomForest vs GradientBoosting.

Building and evaluating models

  1. How to create features: Different techniques that can be used to create features from a data set.
  2. Build a simple model:
    1. Fit a model using a machine learning algorithm from scikit-learn.
    2. Predict the results on unseen test data.
  3. Tune hyperparameters: What hyperparameters are available for the selected algorithm and what do they do? Understand the different methods of tuning hyperparameters and the pros and cons of these different cross-validation techniques.
  4. Validating a model: Using a validation dataset.
  5. Feature engineering: What it is, how to do it, and why it is important.
  6. Confusion matrix:
    1. Using absolute numbers of vs percentages.
    2. Calculating true positive rate, false positive rate, true negative rate, and false negative rate.
  7. Evaluation metrics for regression vs classification models:
    1. Mean Absolute error (MAE), Root Mean Square Error (RMSE), R2.
    2. Accuracy, precision, recall.
    3. Negative predictive value, specificity.
    4. F1-score.
    5. ROC, AUC, ROC-AUC.
    6. Diversity, coverage, serendipity, novelty, etc.
    7. When is accuracy not a good evaluation metric for a classification problem?
  8. Overfitting and underfitting:
    1. The definitions of both theoretically and giving real-world examples.
    2. How to solve both problems.
    3. Bias-variance tradeoff and model complexity.
  9. Measuring feature importance.

ML Theory

  1. Regularization:
    1. Lasso regularization.
    2. Ridge regularization.
    3. Elastic net regularization.
    4. Comparison of L1 vs L2 regularization, the way they work, and when to use one over the other.
  2. Loss functions, cost functions, objective functions:
    1. Definitions of each of these with examples.
    2. The problems these functions solve.
    3. How do they relate to one another?
  3. Entropy:
    1. Explain entropy in machine learning.
    2. What machine learning algorithms use entropy and how is it applied in the algorithms?
  4. How do you deal with outliers?
  5. How do you handle imbalanced datasets in a classification model?
  6. How can you avoid overfitting?

Programming

  1. Time complexity and efficiency of Python sort and other inbuilt methods.
  2. Data structures and algorithms.
    1. Which data structures or algorithms to apply to solve a problem optimally.
    2. Writing code for some data structures and algorithms from scratch, and not using inbuilt methods.
  3. Big-O notation:
    1. Space and time complexity for different data structures and algorithms.
    2. Which to optimize for given a specific problem and limitations.
  4. Building queries in SQL: Basic and advanced queries focusing on syntax, efficiency, and neatness of the query.

Statistics and probability

  1. Statistical distributions:
    1. Normal, uniform, binomial, poisson, bernoulil distributions.
    2. Multinomial, multinoulli, uniform distributions.
    3. The behaviors, properties, basic calculations, and use cases of each of these distributions.
  2. Sampling:
    1. Population sampling.
    2. Central limit theorem.
  3. Random variables:
    1. Discrete variables.
    2. Continuous variables.
  4. Statistical analysis:
    1. Variance.
    2. Standard deviation.
    3. Covariance.
    4. Correlation.
    5. Regression.
  5. Probability theory:
    1. Properties of probability.
    2. Probability distributions (probability density function, probability mass function, cumulative density function).
    3. De Morgan’s law.
  6. Probability events:
    1. Independent/mutually exclusive events.
    2. Non-mutually exclusive events
    3. Disjoint events.
    4. How are these all different or similar (overlapping)?
  7. Hypothesis and A/B testing:
    1. Statistical significance.
    2. Null and alternative hypotheses, with examples.
    3. Type I and II errors.
    4. What are the p-value, statistical power, and confidence level, and how are they calculated?
  8. Bayes’ rule:
    1. Definition of Bayes’ rule.
    2. Formula to calculate conditional probability based on this rule.
    3. Application of Bayes’ rule in real-world probability calculations.

Natural Language Processing (NLP)

  1. Definitions:
    1. Vocabulary.
    2. Language model.
  2. Text pre-processing/analysis:
    1. Stemming.
    2. Lemmatization.
    3. Tokenization.
    4. Stop words.
    5. TF-IDF.
  3. Text vectorization:
    1. One-hot encoding
    2. Bag of Words
    3. Word embeddings/word vectors
    4. Sub-words.
  4. Model architecture:
    1. Recurrent Neural Networks (RNNs).
    2. Long Short-Term Memory (LSTM).
    3. Transformers (self-attention).
    4. Pros and cons of each of them.

Recommendation Systems

  1. Types of recommendation systems:
    1. Content-based
    2. Collaborative filtering
    3. Knowledge-based
    4. Hybrid recommender systems.
    5. Pros and cons of each, use-cases, and when they will not perform well.
    6. Explain the cold-start problem in collaborative filtering when it could occur and how to fix it.
  2. Ranking and clustering algorithms.
  3. Performance evaluation for recommender systems.
  4. What to consider and optimize for when designing a recommender system: Accuracy, relevance, speed, latency, diversity.

Project-based questions

  1. Explain your ML project process?
  2. What’s your favorite algorithm, and can you explain it in less than a minute?
  3. What are your favorite use cases of machine learning models?
  4. Specific questions about a project on your resume or portfolio: You should be able to talk about the end-to-end process, from the business needs, planning process, data collection, building the model, evaluating performance, deploying, and measuring performance in production.

Behavioral questions: It is very helpful to use the STAR (Situation, Task, Action, quantifiable Result) framework in answering these questions. Also, try to make it personal and take more responsibility for your work and contributions, by saying more of I, than we.

Here are some sample questions to consider while prepping:

  1. Tell me about yourself: Talk about your background. Describe your interests. Mention your experience. Explain why you’re excited about the opportunity.
  2. Why do you want to work with the company?
  3. What do you think is the most valuable data in business?
  4. What has been the most significant accomplishment in my career so far / biggest success?
  5. Where do you want to take your career? What do you want next?
  6. Describe the last time you had to adjust your course to reach a goal more effectively? What was the goal, how did you adjust, and what was the outcome?
  7. Would you prioritize speed of delivery or quality of the product?
  8. Talk about a project that you worked on that failed.
  9. Talk about a time when you didn’t think you could do something.
  10. Talk about a time when the people around you disagreed about something, and how you resolved it.

Questions to ask your interviewer

  1. Ask about the day-to-day responsibilities of the role for which you have been interviewed.
  2. What makes the best intern (new hire) on their team stand out / what are the key values and characteristics that they look for in an employee?
  3. What are the metrics on which your performance will be evaluated while working in the company / what are the expectations for this role?
  4. Is there anything about your experience or skills that they have reservations about? If so, let them know that you would like to address their concerns.
  5. What is the typical career path for someone hired for this role?
  6. What is their favorite part about working in this company and what is the most challenging aspect of this job?

Questions to ask your recruiter

  1. How many steps are in the interview process and what are the next steps currently?
  2. Is it clear yet what team you will be placed in?
  3. Is there room to switch teams or shadow people on other teams?
  4. Is there relocation or accommodation assistance?
  5. How long have they been with the company and why do they like working there?
  6. What did they think about the company before they joined and what do they think now?

Conclusion

I hope this is helpful to you, not as a comprehensive syllabus, but as a guide to the various topics you could study for your interviews. Data science is a very broad field, and the role differs by company. It is important to do in-depth research on the specific company, read the job description carefully, and speak to your recruiter to understand the focus areas to lean more towards.

I also understand that the job application and interview process can be daunting. If you are faced with rejections, please do not lose track of the fact that it is not necessarily a reflection of your knowledge gap, nor is it a measure of your worth as a human being, or a measure of your intelligence. Remember to give yourself grace, ask for feedback, make improvements in whatever way you can, and keep practicing and sending in those applications.

Please let me know what you think about this piece and kindly share this with anyone you think might benefit from this. I would also appreciate your suggestions on any other topic you might like me to write about in relation to job applications and getting data science roles.

Thank you for reading.

Leave a Comment