What is Apache Hadoop Sqoop

Hadoop Sqoop

Capture 15 Sqoop is a tool designed to transfer data between Hadoop and relational databases.

Capture 15 We can use sqoop to import data from a relational database management system such as Mysql or oracle into the Hadoop Distributed file system (HDFS)

Capture 15 Also, we can use sqoop to transform the data in Hadoop Map Reduce and the export the data back to on RDBMS

Capture 15 Sqoop automate most of this process, relying on the database to describe the schema for the data to be imported.

Capture 15 Sqoop uses map reduce to import and export the data which provides parallel operation as well as fault conditions.

Capture 15 After manipulating the imported records with map reduce or Hive, you may have a result data set which you can then export back to the relational database

Capture 15 sqoop also includes a primitive SQL execution shell.

Direct-mode Imports:

Capture 15 When using direct mode, you can specify additional arguments which should be passed to the underlying tool

Capture 15 If the argument – – is given on the command line, then subsequent arguments are sent directly to the underlying tool

 Capture 15 Mysql Provides the Mysql dump tool which can export data from Mysql to other systems very quickly.

Capture 15 By supplying the—direct argument, you are specifying that sqoop should attempt the direct import channel and this channel may be of higher performance than using JDBC

     For example, the following adjusts the character set used by Mysql dump.

#sqoop import - - connect jdbc: Mysql://server.foo.com/db
--table bar\-- direct—default-character-set=latin1

Note:-When using direct mode of import, certain data back client utilities are expected to be present in the shell path of the task process

Mysql – Mysql dump Mysql import

Exports and transaction:-

Capture 15 Export is performed by multiple writers in parallel

Capture 15 Each writer uses a separate connection to the database and these have separate transactions from one another.

Capture 15 Sqoop uses multi-row INSERT Syntax to insert up to 100 records per statement.

Capture 15 For every 100 Statements, the current transaction within a writer’s task is committed causing a commit for every 10,000 rows.

Capture 15 This insures that transaction buffers do not grow without bond and cause out-of-memory conditions.

Capture 15 There fore export is an atomic process and partial results of the export will become visible before the export is complete.

Ex:- Sqoop export – connect jdbc: Mysql://db.example.com/foo

         –Table bar\–export-dir/results/bar-data

Incremental imports:-

      Sqoop provides an incremental import mode, which can be used to retrieve only rows newer than some previously – imported set of rows

The following arguments control, incremental imports:


1) check—column Capture 15 specifies the column to be examined when determining which rows to import.

2) incremental(mode) Capture 15 specifies how sqoop determines which rows are new and legal values for mode include append and last modified.

3) last-value(value) Capture 15 Specifies the maximum value of the check column from the previous import.

Capture 15 sqoop supports two types of incremental imports: append and last modified and you can use the – incremental argument to specify the type of incremental import to perform.

Ex:- $ sqoop import – connect jdbc:Mysql://db.foo.com/
Some db—table some table\--where” id>10000”
--Target-dir/ incremental-dataset -- append

Sqoop  internals:-

It describes the internal architecture of sqoop

  1. General program flow
  2. Sub packages
  3. Interfacing with map reduces.

Sqoop program is driven by the com. Cloudera. sqoop. Sqoop

Limited number of additional classes are in the same package

  1. General program flow

Capture 15 Com. cloudera. sqoop is the main class and implements Tool.

Capture 15 A new instance is launched with tool runner

Capture 15 The first argument to Sqoop is a string identifying the name of a Sqoop tool to run

Capture 15 Sqoop tool itself drives the execution of the user request operation (Ex:-import, export, code gen etc)

Capture 15 We can set the appropriate fields in the sqoop options class to run its body.

Capture 15 ConnFactory is used to get a connManager from a manager factor.

Capture 15 ConnManager . import Table() method is left to determine how best it is to run the import

Capture 15 Each main action is actually controlled by the conn manager except for the generating of code, which is done by the connection manager and class writer. Both  are in the com.cloudera.sqoop.Hive.Hiveimport class after the importable() has computed.

Capture 15 Importing into Hive is also taken care of via the com. Cloudera. Sqoop.Hive. Hive import class after the import table() has completed.

Capture 15 Import table() method receives a single argument of type import job content which contains parameters to the method

Capture 15 Similarly, export table() method receives an argument of type    ExportJobContent.

2) Sub packages:-

The following sub packages under com. cloudera. Sqoop exist.

  1. Hive Capture 15 Facilitates importing data to Hive.
  2. To Capture 15 Implementations of Java. io* interfaces, namely output stream and writer.
  3. Lib Capture 15 the external public API(described earlier)
  4. Manager Capture 15 The connManager and ManageFactory interface and their implementations.
  5. Map reduce Capture 15 classes interfacing with the new(0.20+)
  6. Orm-code auto-generation and depends on JDK j tools.jar which provides com. sun. tools. package
  7. Tool- implementations of sqoop tool
  8. Util- miscellaneous utility classes.

3) Interfaces with Map Reduce:-

Capture 15sqoop schedules Map Reduce jobs to affect imports and exports

Capture 15 Configuration and execution of map reduce jobs follows a few common steps. They are:

  1. Configuring the input format
  2. Configuring the output format
  3. Setting the mapper implementation

Capture 15 These steps are formalized in the com. cloudera. Sqoop. Mapreduce. Job Base class

Capture 15 The JobBase allows a user to specify the input format, output format and mapper to use

Capture 15 The JobBase itself is sub classed by import job base and Job base, which after better support for the particular configuration steps common to import or export – related jobs respectively

Capture 15 ImportjobBase-runImport() will call the configuration steps and run a job to import a table to HDFS

àFor example. DataDrivenImportJob uses the DataDrivenDBInput Format to run an import.

àMysql uses a different class(MysqlDumpImportJob)to run a direct- mode import.

Importing data into Hive:-

Capture 15 We can create Hive schema using sqoop

Capture 15 sqoop import tools main function B to upload your data into files in HDFs

Capture 15 If you have a Hive megastore associated with your HDFS Cluster, sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data layout in Hive.

Capture 15 Importing data into Hive is as simple as adding the –Hive—import option to your sqoop command line

Capture 15 If the Hive table already exists, you can specify the –Hive-over  write option to indicate that the existing table in Hive must be replaced.

Capture 15 After your data is imported into HDFS, Sqoop will generate a Hive savior containing a CREATE TABLE Operation defining your columns using Hive types and LOAD DATA IMPATH statement to move the data into the live warehouse directory.

Ex: # sqoop create-Hive- table—connect jdbc:Mysql:
//local host:3306/test—table sample—fields
To create a Hive table without data
Capture 15 # sqoop import--connect jdbc: Mysql://local host:3306
/test – table sample-m 1—Hive-import

To create a table in Hive with data

Importing data into HBase:-

Capture 15 Sqoop supports additional import targets beyond HDFS and Hive

Capture 15 Sqoop can also import records in a table in HBase

Capture 15 By specifying –HBase- table, you instruct sqoop to import a table in HBase rather than a directory in HDFS

Capture 15 Sqoop will import data to the table specified as the argument to – h base- table

Capture 15 Each row of the input table will be transformed into an HBase put operation to a row of the output table

Capture 15 The key for each row is taken from a column of the input

Capture 15 You can manually specify the row key column with – h base-row-key.

Capture 15 Each output column will be placed in the same column family, which must be specified with – column-family

Capture 15 We should create the target table and column family before running an import

Capture 15 If you specify – hbase-create- table, sqoop will create the target table and column family if they do not exit,

Ex:# sqoop import—connect jdbc: Mysql://local host:3306
/test—table ORDERS—hbase-create-table –hbase-table ORDERS—Column-family Mysql.




0 Responses on What is Apache Hadoop Sqoop"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.