Blog

# Data Scientist Interview Questions

• ###### | 1325 Ratings

If you're looking for Data Scientist Interview Questions for Experienced or Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research Data Science Market Expected to Reach \$128.21 Billion With 36.5% CAGR Forecast To 2022. So, You still have opportunity to move ahead in your career in Data Scientist. Mindmajix offers Advanced Data Scientist Interview Questions 2019 that helps you in cracking your interview & acquire dream career as Data Scientist Engineer.

Enthusiastic about exploring the skill set of Data Science? Then, have a look at the Data Science Training together additional knowledge.

Q. Explain what is feature vectors in detail?
A feature network is nothing but an n-dimensional vector which has numerical features that are used to represent a particular object. In machine learning terminology, the feature vectors are especially used to depict the characteristics of the objects so that they are easy to understand and also analyze for further studies.

Q. List out the steps that are necessary to make a decision tree?
The following are the steps that are important while making a decision tree:

* Consider the entire data set as the input data
* Split the data into two sets. The main use of split is that they divide the data into two sets.
* The process will be stopped when it meets the stopping criteria
* If you have crossed the meeting point then you have to clean up the tree so that you can get back to the meeting point. This process is called pruning.

Q. Explain what is root cause analysis?
Root cause analysis is an error identification process where it identifies all the factors that are responsible for the irregular output. Initially it was used to analyze industrial accident scenarios but later on, it has been widely used in each and every sector. It is one of the prominent problem-solving technique where all the factors are evaluated so that the problem can be identified and mitigated.

Q. Explain what is logistic regression?
The logistic regression is one of the analysis processes where it best suits when the DV (Dependent variables) is binary. It is also considered as predictive analysis. This regression method is used to describe the data and also explains the relationship between the binary variables.

Q. Explain in detail about Recommender Systems?
The recommender systems are very prominently used these days. These systems are nothing but a subclass of information filtering systems or process. With the help of this system, the user rating to a product can be predicted.

Q. What is Cross- Validation?
The cross-validation is one of the validation technique which is used to evaluate the outcome of a statistical analysis. This process is widely used in the backend process where the core objective is to make sure that how the model is working out while practicing. The main objective of cross-validation analysis is to make sure and test the data set and evaluate the same so that the errors or problems can be minimized ( overfitting, how the model can be generalized etc.)

Q. Explain in detail what is collaborative filtering?
The collaborative filtering (CF) is a technique which is widely used by recommender systems. The collaborative filtering has two senses, i.e.

* Narrow sense: This is the new process of collaborative filtering. Based on the preferences information collected from many users, this process helps in promoting and predicting a particular product or service based out of their interest. All this happens automatically.
* General sense: This process has a broader perspective and it involves in filtration of information by applying different techniques which involves multiple agents and data sources.

The use of collaborative filtering is widely used. A few of them are listed below:

* Monitoring data in mineral exploration
* Widely used in financial services
* E-commerce and web applications

Q. What is the goal of A/B Testing?
The A/B testing is also called as split testing. This is a prominent testing platform which helps the users to compare two versions of a web page and check which one performs better compared to other. This is a very important process that every business has to go through so that they can see maximum benefit of having an online presence.

The businesses having an online presence has to focus on the conversion rate, i.e. how the organic traffic is coming over to their web page and behaving.

The ultimate goal of A/B testing is to make sure that the businesses can achieve higher conversion rate and maximize their earnings.

Q. What are the main drawbacks of the Linear model?
The major drawbacks of the linear regression model are listed below:

1. For the linear regression model, the data set should be independent. Then only the model can be applied.
2. The linear model output is always a straight line, but in most cases, this is not the desired output. So it is not a right fit.  So in a sense, this model is only limited to linear relationships.
3. The data set for linear regression model only considers the mean of the dependent variables.
Few of the overfitting problems cannot be solved using this model.

Q. Explain in detail what is the law of large numbers is?
The Large numbers law is nothing but a theorem which is based on performing experiments multiple times and aggregating the final output. So the main basis of this theorem is based on the frequency style execution. According to this theorem, the experiment is performed and the output is aggregated and the mean value is considered as the final output. So the output is based on the sample mean, sample variance.

Q. Explain what is star schema?
A Star schema is nothing but a traditional database schema with a central table. The tables are also known as lookup tables and they are used in real-time applications. They are known for saving a lot of memory. With the help of star schemas, several layers of data are summarized so that the information recovery will be faster when compared to others.

Q. How often can an algorithm be updated?
The algorithm can be updated based on:

1. The model should evolve as the data streams through the entire infrastructure
2. The algorithm can be updated if the underlying data source is constantly changing
3. If there is a difference in the variable variance, i.e. non-stationarity

Check Out Data Science Tutorial

Q. What is the importance of resampling and why it has to be done?
Resampling is a process that is executed in any one of the scenarios below:

1. To estimate the accuracy of the all the sample statistics that we have used.
2. While cross-validation or validating models by using subsets.

Q. List out the different types of biases that occur during sampling?
They are three different types of biases that can actually occur during sampling activity, they are listed below:

1. Selection bias
2. Under coverage bias
3. Survivorship bias

Q. Explain the process of selecting the important variables from the datasets while working? Explain the methods?
The following are the variables that can be selected from the datasets:

1. Suggest using linear regression and select variables based on p values.
2. Proper usage of forwarding selection, backward selection, and stepwise selection.
3. Make use of Random Forest and Plot variable importance chart
4. Use of Lasso regression technique.

Q. Can you capture the correlation between continuous and categorical variables? If yes, please explain the process?
Yes, it is possible to capture the correlation between continuous and categorical variables. By using ANCOVA process ( analysis of covariance) technique, using this technique one identify the association between continuous and categorical variables.

Q. Which technique is widely used to predict the categorical responses?
The classification technique is widely used in mining the classifying data sets.

Q. Define what is Interpolation and Extrapolation?
Interpolation is a process where the value is estimated based on 2 known values.
Extrapolation is a process where the value is approximated by extending the known set of values.

Q. Explain the main difference between supervised learning and unsupervised learning?
A supervised learning is a process where the learning algorithm has learned something from the training data and the knowledge is applied back to the test data. A perfect example of supervised learning is “Classification”.

An unsupervised learning is a process where there is no learning available from the training data.  A perfect example of unsupervised learning is “Clustering”.

Q. Explain the different steps that are involved in analytics project?
Below are the different steps that are involved in an analysis project:

1. First of all, understand the business problem
2. Explore the data and get familiar with the same
3. Start preparing for data modeling
4. Start running the model and understand the results
5. Validate the model using the new data sets
6. Start implementing the model and gather the results and analyze the outcome. Continue the same process.    