top of page

数据科学 (英语)

Updated: Jun 21, 2022

Data Science - Technical Interview Questions





在数据科学的面试过程中,你的面试官会根据以下因素评估你对数据科学的技术能力和概念理解水平。

  1. 框架 (Framing):你如何处理歧义。你能看到数据背后的底层结构吗?你能识别和使用正确的数据科学方法吗?你能提炼出一个开放式的问题吗?

  2. 操作化 (Operationalization):你能否从数据中得出可操作的见解?

  3. 分析理解 (Analytical Understanding):你表达结论的能力。你能在数字和单词之间进行翻译吗?

  4. 假设驱动 (Hypothesis Driven):你能识别(并在逻辑上支持)合理的假设吗?你能否查看数据,做出支持/反驳产品洞察力的决定并解释自己?

与大多数分析/技术面试一样,你的面试官正在寻找对你的思维过程和解决问题方法的洞察力。你的创造力和表达复杂主题的能力是最重要的;这比得出“正确”答案更重要。


  1. Write a function that takes in two sorted lists and outputs a sorted list that is their union.

  2. Write a function to return the number of times a character appears in a string. The character can be the empty string.

  3. If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?

  4. Write a sorting algorithm for a numerical dataset in Python.

  5. You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.

  6. List the differences between supervised and unsupervised learning. Give concrete examples.

  7. What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?

  8. How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences?

  9. What is selection bias?

  10. What is a bias-variance trade-off?

  11. In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set. Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it. If a word has many roots can form it, replace it with the root with the shortest length.

  12. What is a confusion matrix?

  13. How do you find RMSE and MSE in a linear regression model?

  14. How to deal with unbalanced binary classification?

  15. Describe different regularization methods, such as L1 and L2 regularization.

  16. What is cross-validation?

  17. How to define/select metrics?

  18. Explain what precision and recall are.

  19. Explain what a false positive and a false negative are. Why is it important these from each other?

  20. Find the second largest element in a Binary Search Tree.

  21. Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model

  22. When would you use Random Forest over SVM and vice version? Explain your answer.

  23. Do you think 50 small decision trees are better than a large one? Why?

  24. Why is dimension reduction important?

  25. What is principal component analysis? Explain the sort of problems you would use PCA for.

  26. What are the drawbacks of a linear model?

  27. What are the assumptions required for linear regression? What if some of these assumptions are violated?

  28. What is collinearity and what to do with it? How to remove multicollinearity?

  29. How to check if the regression model fits the data well?

  30. What is a kernel? Explain the kernel trick.

  31. You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?

  32. Difference between convex and non-convex cost function? What does it mean when a cost function is non-convex?

  33. How do you assess the statistical significance of an insight?

  34. What is Performance Tuning?

  35. How can you tell if a given coin is biased?

  36. There are 25 horses, and you want to rank the fastest 3 horses out of those 25. In each race, only 5 horses can run at the same time because there are only 5 tracks. What is the minimum number of races required to find the 3 fastest horses without using a stopwatch?

  37. Given two binary strings, write a function that adds them. You are not allowed to use any built in string to int conversions or parsing tools. E.g. Given "100" and "111" you should return "1011". What is the time and space complexity of your algorithm?

  38. What is Monkey Patching?

  39. What is the probability of drawing two cards (from the same deck of cards) that have the same suite?

  40. How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections?

  41. You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

  42. Assume the distribution of children per family is given by: # children 0 | 1 | 2 | 3 | 4 | >=5 p 0.3 | 0.25 | 0.2 | 0.15 | 0.1 | 0 Consider a random girl in the population of children. What's the probability that she has a sister?

  43. A gas station has 30 gallon of gasoline worth 1.20 per gallon and some worth 1.40 per gallon. How many gallons of the 1.40 brand must the owner mix in to produce gasoline that cost 1.28 per gallon?

  44. How would you explain a confidence interval to a non-technical audience?

  45. How to sum all values in a range of values between A and B?

  46. How many Big Mac does McDonald sell each year in US?

  47. Draw a sample distribution of average daily views by users for Instagram.

  48. What’s the probability that in a room full of k people, at least 2 people will have the same birthday?

  49. How did you prevent overfitting when using Deep Learning models?

  50. How do you handle missing data? What imputation techniques do you recommend?

  51. You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

  52. How can LinkedIn figure out when users falsify their attended schools?

  53. You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

  54. How would you build the recommendation algorithm for type-ahead search for Netflix?

  55. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.

  56. Facebook is rolling out a new feature called "Mentions" which is an app specifically for celebrities on Facebook to connect with their fans.How would you measure the health of the Mentions app?

  57. What is the Law of Large Numbers?

  58. How do you calculate the needed sample size?

  59. How do you prove that males are on average taller than females by knowing just gender height?

  60. Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

  61. You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?

  62. If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?

  63. Likes/user and minutes spent on a platform are increasing but total number of users are decreasing. What could be the root cause of it?

  64. There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning/ending. Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.

  65. Write a SQL query to get the second highest salary from the Employee table. For example, given the Employee table below, the query should return 200 as the second highest salary. If there is no second highest salary, then the query should return null.

















Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.

© 2022 青藤职业   |   Proudly created by IECG CAREER

bottom of page