If you're looking for Data Mining Interview Questions and Answers for Experienced & Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research, Artificial Intelligence (AI) market is expected to be worth USD 16.06 Billion by 2022, growing at a CAGR of 62.9%. So, You still have opportunities to move ahead in your career in Data Mining. Mindmajix offers advanced Data Mining Interview Questions 2018 that helps you in cracking your interview & acquire your dream career as AI Developer.
Q: Explain what are the different storage models that are available in OLAP?
The different storage models that are available in OLAP are as follows:
1. MOLAP - Multidimensional Online Analytical Processing
2. ROLAP - Relational online Analytical processing
3. HOLAP - Hybrid online Analytical Processing
They are advantages and disadvantages of with each of these storages models that are available in OLAP.
Q: Explain in detail what is MOLAP? What are the advantages and disadvantages?
>> As the name itself depicts “MOLAP” , i.e. Multidimensional.
>> In this type of data storage, the data is stored in multidimensional cubes and not in the standard relational databases.
The advantage of using MOLAP is:
The query performance is excellent, this is because the data is stored in multidimensional cubes. Also the calculations are pre generated when a cube is created.
The disadvantage of using MOLAP is:
1. Only limited amount of data can be stored. Since the calculations are triggered at the cube generation process it cannot withstand huge amount of data.
2. Needs a lot of skill to utilize this.
3. Also it has licensing cost associated to it.
Q: Explain in detail what is ROLAP? What are the advantages and disadvantages?
As the name suggests that, the data is stored in the form of relational databases.
The advantages of using ROLAP is:
1. As the data is stored in relational databases, it can handle huge amount of data storage.
2. All the functionalities are available as this is a relational database.
The disadvantages of using ROLAP is:
1. It is comparatively slow.
2. All the limitations that apply to SQL , the same applies to ROLAP too
Q: Explain in detail what is HOLAP? What are the advantages of using this type of data storage?
HOLAP stands for Hybrid online analytical processing.
Actually it is a combination of MOLAP and ROLAP.
The advantages of using MOLAP is:
1. In this model, the cube is used to get summarized information.
2. For drill down capabilities it uses ROLAP structure.
Q: Explain the main difference between Data Mining and Data Warehousing?
It is a process where the data is extracted from various sources. Further, the data is cleansed and stored.
1. It is a process where it explores the data using the queries.
2. Basically, the queries are used to explore a particular data set and examine the results. This will help the individual in reporting, strategy planning, visualizing meaningful data sets.
The above can be explained by taking a simple example:
1. Let’s take a software company where all of their projects information is stored. This is nothing but Data Warehousing.
2. Accessing a particular project and identifying the Profit and Loss statement for that project can be considered as Data Mining.
Q: Explain in detail what is Data Purging?
Data purging is an important step in maintaining appropriate data in the database.
Basically deleting unnecessary data or rows which have NULL values from the database is nothing but data purging. So if there is a need to load fresh data into the database table we need to utilized database purging activity. This will clear all unnecessary data in the database and helps in maintaining clean and meaningful data.
Data purging is a process where junk data that exists in the database gets cleared out.
Q: Explain in detail what does CUBE mean?
Cube is nothing but a data storage place where the data can be stored and makes it easier for the user to deal with his/her reporting tasks. It helps in expeditie data analysis process.
Let’s say the data related to an employee is stored in the form of cube. If you are evaluating the user performance based on weekly, monthly basis then week and month are considered to be the dimensions of the cube.
Q: What are the different problems that “Data Mining” can solve in general?
Data Mining is a very important process where it could be used to validate and screen the data how it is coming through and the process can be defined based on the data mining results. By doing these activities, the existing process can be modified.
They are widely used in the following industries :
4. Artificial Intelligence
5. Government intelligence
By following the standard principles a lot of illegal activities can be identified and dealt with. As the internet has evolved a lot of loops holes also evolved at the same time.
Q: Explain the difference between OLAP and OLTP?
1. OLTP stands for Online Transaction and Processing.
2. This is useful in the applications which involves in a lot of transactions and high volumes of data. This type of applications are mainly observed in Banking sectors, Air ticketing etc. The architecture used in OLTP is Client server architecture. It actually supports the transactions cross network as well.
1. OLAP stands for Online Analytical Processing.
2. It is widely used in applications where we need to support business data where complex calculations happen. Most of the time, the data is in low volumes. As this is being multidimensional database, the user will have insight of how the data is coming through the various sources.
Q: Explain the different stages of “Data Mining”?
They are three different stages in Data Mining, they are as follows:
2. Model building and validation
Exploration is a stage where a lot of activities revolve around preparation and collection of different data sets. So activities like cleaning, transformation are also included. Based on the data sets available , different tools are necessary to analyze the data.
Model Building and validation:
In this stage, the data sets is validated by applying different models where the data sets are compared for best performance. This particular step is called as pattern identification. This is a tedious process because the user has to identify which pattern is best suitable for easy predictions.
Based on the previous step, the best pattern is applied for the data sets and it is used to generate predictions and it helps in estimating expected outcomes.
Q: Explain what is Discrete and continuous data concepts in Data Mining world?
Discrete data can be classified as a defined data or a finite data. That has a meaning to itself. For example: Mobile numbers, gender.
Continuous data is nothing but a data that continuous changes in an orderly fashion. The example for continuous data is “Age”.
Q: Explain what is MODEL in terms of Data Mining subject?
Model is an important factor in Data Mining activities, it defines and helps the algorithms in terms of making decisions and pattern matching. The second step to is that they evaluate different models that are available and select a best suitable model for the validating the data sets.
Q: Explain what is Naive Bayes Algorithm?
The Naive Bayes Algorithm is widely used to generate mining models. These models are generally used to identify the relationship between the input columns and the predicated columns that are available. This algorithm is widely used during the initial stages of the explorations.
Q: Explain in detail about Clustering Algorithm?
1. The clustering algorithm is actually used on groups of data sets are available with a common characteristics, they are called as clusters.
2. As the clusters are formed, it helps to make faster decisions and exporting the data is also fast.
3. First of all the algorithm identifies the relationships that are available in the dataset and based on that it generates clusters. The process of creating clusters is also repetitive.
Q: Explain what is time series algorithm in data mining?
1. This algorithm is a perfect fit for type of data where the values changes continuous based on the time. For example : Age
2. If the algorithm is skilled and tuned to predict the data set, then it will be successfully keep a track of the continuous data and predict the right data.
3. This algorithm generates a specific model which is capable of predicting the future trends of the the data based on the real original data sets.
4. In between the process new data can also be added in part of trend analysis.
Q: Explain in detail about association algorithm in Data mining?
This algorithm is mainly used in recommendation engine for a specific market based analysis.
So the input for this algorithm would be the products or items that are bought by a specific customer, based on that purchase a recommendation engine will predict the best suitable products for the customers.
Q: What is sequence clustering algorithm?
As the name itself states that the data is collected at different points which occurs at sequence of events. The different data sets are analyzed based on the sequence of data sets that occur based on the events. The data sets are analyzed and then best possible data input will be determined for clustering.
A sequence clustering algorithm will help the organization to specific a particular path to introduce a new product which has similar characteristics in a retail warehouse.
Q: What are the different concepts and capabilities of Data Mining?
So Data Mining is primarily responsible to understand and get meaningful data from the data sets that are stored in the database.
In terms of exploring the data in data mining is definitely helpful because it can be used in the following areas:
4. Meaningful Patterns etc.
A large amount of data is cleaned as per the requirement and can be transformed into a meaningful data which can be helpful for decision making at the executive level.
Data mining is really helpful with the following types of data:
1. Data sets which are in the form of sales figures
2. Forecast values for the business projection
4. Metadata etc
Based on the data analyzed, the information can be analyzed and appropriate relationships are defined.
Q: What is the best way to work with data mining algorithms that are included in SQL Server data mining?
With the use of SQL Server data mining offers an add on for MS office 2007. This will help to identify and discover the relationships with the data. This data is helpful in future for enhanced analysis.
The add on is called as “ Data Mining client for excel”. With this the users will be able to first prepare data, build and further manage and evaluate the data where the final output will predicting results.
Q: How to use DMX- the data mining query language in detail?
DMX consists of two types of statements in general.
Data definition and Data Manipulation.
This is used to define and create new models and structures.
As the name itself depicts, the data is manipulated based on the requirement.
The usage is explained in detail by picking up an example:
1. Create Mining Structure
2. Create Mining Model
3. Data Manipulation that is used in existing structures and models.
With the syntax, it is
SELECT FROM. CONTENT (DMX)
Q: What are the different functions of data mining?
The different functions of data mining are as follows:
2. Association and correlation analysis
5. Cluster analysis
6. Evolution analysis
7. Sequence analysis
Q: Explain in detail what is data aggregation and Generalization?
As the name itself is self explanatory , the data is aggregated altogether where a cube can be constructed for data analysis purposes.
It is a process where low level data is replaced by high level concept so the data can be generalized and meaningful.
Q: Explain in detail about In Learning and Inclassification:
This is a model which is primarily used to analyze a particular training data set and it has training data samples that are selected from a selected population.
This model is primarily used for providing an estimation for a particular class by selecting test samples randomly. The term classification is usually determined by identifying a known class for a specific unknown data.
Q: Explain in detail what is Cluster Analysis?
The term cluster analysis is an important human activity which is widely used in different applications. To be specific, this type of analysis is used in market research, pattern recognition, data analysis and image processing.
Q: Explain about data mining interface?
The data mining interface is usually used for improving the quality of the queries that are used.
The data mining Interface is nothing but the GUI form for data mining activities.
Q: Why Tuning data warehouse is needed, explain in detail?
The main aspect of data warehouse is that the data evolves based on the time frame and it is difficult to predict the behaviour because of its ad hoc environment. The database tuning is much difficult in an OLTP environment because of its ad hoc and real time transaction loads. Due to its nature, the need to data warehouse tuning is necessary and it will change the way how the data is utilized based on the need.