If you're looking for Pentaho BI Interview Questions for Experienced or Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research Pentaho BI has a market share of about 3.7%. So, You still have opportunity to move ahead in your career in Pentaho BI Development. Mindmajix offers Advanced Pentaho BI Interview Questions 2019 that helps you in cracking your interview & acquire dream career as Pentaho BI Developer.
It addresses the blockades that block the organization’s ability to get value from all our data. Pentaho is discovered to ensure that each member of our team from developers to business users can easily convert data into value.
Direct Analytics on MongoDB: It authorizes business analysts and IT to access, analyze, and visualize MongoDB data.
Science Pack: Pentaho’s Data Science Pack operationalizes analytical modeling and machine learning while allowing data scientists and developers to unburden the labor of data preparation to Pentaho Data Integration.
Full YARN Support for Hadoop: Pentaho’s YARN mixing enables organizations to exploit the full computing power of Hadoop while leveraging existing skillsets and technology investments.
ThePentaho BI Project is an current effort by the Open Source communal to provide groups with best-in-class solutions for their initiative Business Intelligence (BI) needs..
The Pentaho BI Project encompasses the following major application areas:
Yes, Pentaho is a trademark.
Pentaho Metadata is a piece of the Pentaho BI Platform designed to make it easier for users to access information in business terms.
With the help of Pentaho’s open source metadata capabilities, administrators can outline a layer of abstraction that presents database information to business users in familiar business terms.
Pentaho Reporting Evaluation is a particular package of a subset of the Pentaho Reporting capabilities, designed for typical first-phase evaluation activities such as accessing sample data, creating and editing reports, and viewing and interacting with reports..
Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas.
Finite ordered list of elements is called as tuple.
The Cube will contain the following data:
1. 3 Fact fields – Sales, Costs and Discounts
2. Time Dimension – with the following hierarchy: Year, Quarter and Month
3. 2 Customer Dimensions – one with location (Region, Country) and the other with Customer Group and Customer Name
4. Product Dimension – containing a Product Name
Transformations is moving and transforming rows from source to target.
Jobs are more about high level flow control.
If we want to join 2 tables from the same database, we can use a “Table Input” step and do the join in SQL itself.
If we want to join 2 tables that are not in the same database. We can use the the “Database Join”.
it is not possible as in PDI transformations all of the steps run in parallel. So we can’t sequentialize them.
We can Create a new conversion or close and re-open the ones we have loaded in Spoon.
BIT is not a standard SQL data type. It’s not even standard on MySQL as the meaning (core definition) changed from MySQL version 4 to 5.
Also a BIT uses 2 bytes on MySQL. That’s why in PDI we made the safe choice and went for a char(1) to store a boolean. There is a simple workaround available: change the data type with a Select Values step to “Integer” in the metadata tab. This converts it to 1 for “true” and 0 for “false”, just like MySQL expects.
This is not possible as in PDI transformations all the steps run in parallel. So we can’t sequentialize them. This would require architectural changes to PDI and sequential processing also result in very slow processing.
we can’t. if we have duplicate fieldnames. Before PDI v2.5.0 we were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.
1. Open Source
2. Have community that support the users
3. Running well under multi platform (Windows, Linux, Macintosh, Solaris, Unix, etc)
4. Have complete package from reporting, ETL for warehousing data management,
5. OLAP server data mining also dashboard.
Arguments are command line arguments that we would normally specify during batch processing .
Variables are environment or PDI variables that we would normally set in a previous transformation in a job.
1. Suite Pentaho
– BI Platform (JBoss Portal)
– Pentaho Dashboard
2. All build under Java platform
Pentaho Dashboards give business users the critical information they need to understand and improve organizational performance.
Pentaho Reporting allows organizations to easily access, format and deliver information to employees, customers and partners.
Pentaho Schema Workbench offers a graphical edge for designing OLAP cubes for Pentaho Analysis.
Pentaho Data Mining used the Waikato Environment for Information Analysis to search data for patterns. It have functions for data processing, regression analysis, classification methods, etc.
It is a visual, banded report writer. It has various features lilke using subreports, charts and graphs etc.
It is an entri level tool for data manipulation.
A hierarchical navigation menu allows the user to come directly to a section of the site several levels below the top.
It is the technology which enables files to be transparently encrypted to secure personal data from attackers with physical access to the computer.
Repository is a storage location where we can store the data safely without any harmness.
ETL Tool is used to get data from many source system like RDBMS, SAP, etc. and convert them based on the user requirement. It is required when data float across many systems.
ETL is extraction , transforming , loading process the steps are :
1 – define the source
2 – define the target
3 – create the mapping
4 – create the session
5 – create the work flow
The metadata stored in the repository by associating information with individual objects in the repository.
Snapshots are read-only copies of a master table located on a remote node which can be periodically refreshed to reflect changes made to the master table.
Data staging is actually a group of procedures used to prepare source system data for loading a data warehouse.
Full Load means completely erasing the insides of one or more tables and filling with fresh data.
Incremental Load means applying ongoing changes to one or more tables based on a predefined schedule.
Dataflow from source to target is called as mapping.
It is a set of instruction which tell when and how to move data from respective source to target.
It is a set of instruction which tell the infomatica server how to execute the task.
It creates and configure the set of transformation.
A data warehouse is said to be a three-tier system where a middle system provides usable data in a secure way to end users. Both side of this middle system are the end users and the back-end data stores.
ODS is Operational Data Store which comes in between of data warehouse and staging area.
ETL Tool is used for extracting data from the legecy system and load it into specified database with some processing of cleansing data.
OLAP Tool is used for reporting process . Here data is available in multidimensional model hence we can write simple query to extract data from database.
XML is an extensiable markup language which defines a set of rule for encoding documents in both formats which is human readable and machine readable.
Informatica Powercenter 4.1, Informatica Powercenter 5.1, Powercenter Informatica 6.1.2, Informatica Powercenter 7.1.2, etc.
Abinitio,DataStage, Informatica, Cognos Decision Stream, etc
MDX is multi- dimensional expression which is a main query language implemented by the Mondrains.
It is a cube to view data where we can slice and dice the data. It have time dimension, locations and figures.
Several solutions exist:
Use a “Select Values” step renaming a field while selecting also the original one. The result will be that the original field will be duplicated to another name. It will look as follows:
This will duplicate fieldA to fieldB and fieldC.
Use a calculator step and use e.g. The NLV(A,B) operation as follows:
This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
You can’t. PDI will complain in most of the cases if you have duplicate fieldnames. Before PDI v2.5.0 you were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.
Transformations stream data through their steps.
That means that the slowest step is going to determine the speed of a transformation.
So you optimize the slowest steps first. How can you tell which step is the slowest: look at the size of the input buffer in the log view.
In the latest 3.1.0-M1 nightly build you will also find a graphical overview of this: HTTP://WWW.IBRIDGE.BE/?P=92
(the “graph” button at the bottom of the log view will show the details).
A slow step will have consistently large input buffer sizes. A fast step will consistently have low input buffer sizes.
If you look in the PDI main directory you will see a sub-directory “simple-jndi”, which contains a file called “jdbc.properties”. You should change this file so that the JNDI information matches the one you use in your application server.
After that you set in the connection tab of Spoon the “Method of access” to JNDI, the “Connection type” to the type of database you’re using. And “Connection name” to the name of the JDNI datasource (as used in “jdbc.properties”).
The catch is to specifically restrict the file list to the files inside the compressed collection. Some examples:
You have a file with the following structure:
To read each of these files in a File Input step:
You have a simpler file, fat-access.log.gz. You could use the Compression option of the File Input step to deal with this simple case, but if you wanted to use VFS instead, you would use the following specification: Note: If you only wanted certain files in the tarball, you could certainly use a wildcard like access.log..* or something. .+ is the magic if you don’t want to specify the children filenames. .* will not work because it will include the folder (i.e. tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar!/ )
Finally, if you have a zip file with the following structure:
You might want to access all the files, in which case you’d use:
Note: For some reason, the .+ doesn’t work in the subdirectories, they still show the directory entries. :/
Spoon is the design interface for building ETL jobs and transformations. Spoon provides a drag-and-drop interface that allows you to graphically describe what you want to take place in your transformations. Transformations can then be executed locally within Spoon, on a dedicated Data Integration Server, or a cluster of servers.
The Data Integration Server is a dedicated ETL server whose primary functions are:
Pentaho Data Integration is composed of the following primary components:
Spoon. Introduced earlier, Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs without having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run or debug a transformation or job, you will be using Spoon.
Pan. A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.
Kitchen. A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals.
Carte. Carte is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server, but does not provide scheduling, security integration, and a content management system
Free Demo for Corporate & Online Trainings.