Sqoop2 7 Minutes Demo with DB2

In this post, I’m walking you through Sqoop 1.99.6 5 minutes Demo with emphasis on working with DB2. Hope this post can save you some time on googling (Believe me, I spent quite some time to get everything figured out when doing this hands-on practice on my own.) Let’s get started.

Preparation phase

Installation

I highly recommended to use cloudera hadoop distribution, CDH for this hands-on lab because CDH has already pre-installed Sqoop2 and has everything well-configured. I have tried out sqoop2 installation on  Hortonworks HDP. The experience is not fun. So, I will assume you work with CDH for the rest of walkthrough.

I’m working with cloudera CDH 5.7.0 and “sqoop-1.99.6-bin-hadoop200”.

  1. Once you have sqoop downloaded, you need to untar it on the client that DB2 instance installed.
  2. You need to download the JDBC 4.0 Driver from IBM that matches with your DB2 version.
  3. You need to install the driver onto CDH sqoop and restart the sqoop service:
sudo cp db2jcc4.jar db2jcc.jar /var/lib/sqoop2/
sudo /sbin/service sqoop2-server stop
sudo /sbin/service sqoop2-server start

Create a test database and table in DB2 to work with

$ db2 list db directory

System Database Directory

Number of entries in the directory = 1

Database 1 entry:

Database alias                       = HZY
Database name                        = HZY
Local database directory             = /home/iidev20
Database release level               = 14.00
Comment                              =
Directory entry type                 = Indirect
Catalog database partition number    = 0
Alternate server hostname            =
Alternate server port number         =

$ db2 connect to hzy

  Database Connection Information

Database server        = DB2/LINUXX8664 11.1.0
SQL authorization ID   = IIDEV20
Local database alias   = HZY

$ db2 list tables

Table/View                      Schema          Type  Creation time
------------------------------- --------------- ----- --------------------------
TEST1                           IIDEV20         T     2016-07-12-20.33.38.801694

  1 record(s) selected.

$ db2 "select * from test1"

C1
-----------
          4

  1 record(s) selected.

Creating Link Object

Check for the registered connectors on your sqoop server:

sqoop:000> show connector
+----+------------------------+-----------------+------------------------------------------------------+----------------------+
| Id |          Name          |     Version     |                        Class                         | Supported Directions |
+----+------------------------+-----------------+------------------------------------------------------+----------------------+
| 1  | kite-connector         | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.kite.KiteConnector        | FROM/TO              |
| 2  | kafka-connector        | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.kafka.KafkaConnector      | TO                   |
| 3  | hdfs-connector         | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.hdfs.HdfsConnector        | FROM/TO              |
| 4  | generic-jdbc-connector | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO              |
+----+------------------------+-----------------+------------------------------------------------------+----------------------+

Generic JDBC Connector in our example has a persistence Id 4 and we will use this value to create new link object for this connector:

sqoop:000> create link -c 4
Creating link for connector with id 4
Please fill following values to create new link object
Name: Fist Link

Link configuration

JDBC Driver Class: com.ibm.db2.jcc.DB2Driver
JDBC Connection String: jdbc:db2://9.112.250.80:50591/HZY
Username: iidev20
Password: ********
JDBC Connection Properties:
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and persistent id 4

There are two things worth mentioning about DB2:

  1. “JDBC Driver Class” for DB2 is “com.ibm.db2.jcc.DB2Driver
  2. In “JDBC Connection String”, I specify the IP address of the client where DB2 instance resides. Then I specify the port number that is for DB2. Lastly I specify the database name we just created.  The port number “50591” can be obtained by:
$ db2 get dbm cfg | grep SVCE
 TCP/IP Service name                          (SVCENAME) = iidev20
 SSL service name                         (SSL_SVCENAME) =

$ grep xiidev20 /etc/services
xiidev20                      50591/tcp
xiidev20_int                  50592/tcp

Our new link object was created with assigned id 4.

Let us create another link object but this time for the hdfs-connector instead:

sqoop:000> create link -c 3
Creating link for connector with id 3
Please fill following values to create new link object
Name: Second Link

Link configuration

HDFS URI: hdfs://quickstart.cloudera:8020/
New link was successfully created with validation status OK and persistent id 5
  1. “quickstart.cloudera” is the hostname, which can be obtained by hostname command in CDH.

Now, we have created two links. You can check by running show link --all:

sqoop:000> show link --all
2 link(s) to show:
link with id 4 and name Fist Link (Enabled: true, Created by iidev20 at 8/4/16 11:57 AM, Updated by iidev20 at 8/4/16 11:57 AM)
Using Connector generic-jdbc-connector with id 4
  Link configuration
      JDBC Driver Class: com.ibm.db2.jcc.DB2Driver
      JDBC Connection String: jdbc:db2://9.112.250.80:50591/HZY
      Username: iidev20
      Password:
      JDBC Connection Properties:
link with id 5 and name Second Link (Enabled: true, Created by iidev20 at 8/4/16 12:36 PM, Updated by iidev20 at 8/4/16 12:36 PM)
Using Connector hdfs-connector with id 3
  Link configuration
    HDFS URI: hdfs://quickstart.cloudera:8020/

Creating Job Object

Now, we create the job object:

sqoop:000> create job -f 4 -t 5
Creating job for links with from id 4 and to id 5
Please fill following values to create new job object
Name: sqoopy

From database configuration

Schema name: IIDEV20
Table name: TEST1
Table SQL statement:
Table column names:
Partition column name: C1
Null value allowed for the partition column:
Boundary query:

ToJob configuration

Override null value:
Null value:
Output format:
  0 : TEXT_FILE
    1 : SEQUENCE_FILE
    Choose: 0
    Compression format:
      0 : NONE
      1 : DEFAULT
      2 : DEFLATE
      3 : GZIP
      4 : BZIP2
      5 : LZO
      6 : LZ4
      7 : SNAPPY
      8 : CUSTOM
    Choose: 0
    Custom compression format:
    Output directory: /cloudera/

    Throttling resources

    Extractors: 2
    Loaders: 2
    New job was successfully created with validation status OK  and persistent id 3

sqoop:000> show job
+----+--------+----------------+--------------+---------+
| Id |  Name  | From Connector | To Connector | Enabled |
+----+--------+----------------+--------------+---------+
| 3  | sqoopy | 4              | 3            | true    |
+----+--------+----------------+--------------+---------+

The idea here is quite simple. We use JDBC connector to read data from our DB2 table (indicated by -f 4), and then we write the data to HDFS by using -t 5.

Here, you need to pay attention to “Partition column name: C1”. If you don’t specify the partition column, you may hit the error when starting the job:

Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception

Our new job object was created with assigned id 3.

Start Job (Data Transfer)

Now we are ready to start the sqoop job we just created:

sqoop:000> start job -j 3
Submission details
Job ID: 3
Server URL: http://9.181.139.59:12000/sqoop/
Created by: iidev20
Creation date: 2016-08-04 11:57:43 CDT
Lastly updated by: iidev20
External ID: job_1469730693462_0125
        http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/
2016-08-04 11:57:43 CDT: BOOTING  - Progress is not available

You can interatively check your running job status with status job command:

sqoop:000> status job -j 3
Submission details
Job ID: 3
Server URL: http://9.181.139.59:12000/sqoop/
Created by: iidev20
Creation date: 2016-08-04 11:57:43 CDT
Lastly updated by: iidev20
External ID: job_1469730693462_0125
        http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/
2016-08-04 11:58:30 CDT: RUNNING  - 0.00 %

sqoop:000> status job -j 3
Submission details
Job ID: 3
Server URL: http://9.181.139.59:12000/sqoop/
Created by: iidev20
Creation date: 2016-08-04 11:57:43 CDT
Lastly updated by: iidev20
External ID: job_1469730693462_0125
        http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/
2016-08-04 11:58:47 CDT: RUNNING  - 50.00 %

sqoop:000> status job -j 3
Submission details
Job ID: 3
Server URL: http://9.181.139.59:12000/sqoop/
Created by: iidev20
Creation date: 2016-08-04 11:57:43 CDT
Lastly updated by: iidev20
External ID: job_1469730693462_0125
        http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/
2016-08-04 12:00:02 CDT: SUCCEEDED
Counters:
        org.apache.hadoop.mapreduce.FileSystemCounter
                FILE_LARGE_READ_OPS: 0
                FILE_WRITE_OPS: 0
                HDFS_READ_OPS: 1
                HDFS_BYTES_READ: 110
                HDFS_LARGE_READ_OPS: 0
                FILE_READ_OPS: 0
                FILE_BYTES_WRITTEN: 373985
                FILE_BYTES_READ: 17
                HDFS_WRITE_OPS: 2
                HDFS_BYTES_WRITTEN: 2
        org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
                BYTES_WRITTEN: 0
        org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
                BYTES_READ: 0
        org.apache.hadoop.mapreduce.JobCounter
                MB_MILLIS_MAPS: 20448256
                TOTAL_LAUNCHED_MAPS: 1
                VCORES_MILLIS_REDUCES: 82496
                TOTAL_LAUNCHED_REDUCES: 2
                VCORES_MILLIS_MAPS: 19969
                SLOTS_MILLIS_REDUCES: 82496
                MB_MILLIS_REDUCES: 84475904
                SLOTS_MILLIS_MAPS: 19969
                MILLIS_REDUCES: 82496
                MILLIS_MAPS: 19969
                OTHER_LOCAL_MAPS: 1
        org.apache.sqoop.submission.counter.SqoopCounters
                ROWS_READ: 1
                ROWS_WRITTEN: 1
                Shuffle Errors
                CONNECTION: 0
                WRONG_LENGTH: 0
                BAD_ID: 0
                WRONG_MAP: 0
                WRONG_REDUCE: 0
                IO_ERROR: 0
         org.apache.hadoop.mapreduce.TaskCounter
                MAP_OUTPUT_MATERIALIZED_BYTES: 17
                MERGED_MAP_OUTPUTS: 2
                SPILLED_RECORDS: 2
                REDUCE_INPUT_RECORDS: 1
                VIRTUAL_MEMORY_BYTES: 4516757504
                MAP_INPUT_RECORDS: 0
                SPLIT_RAW_BYTES: 110
                FAILED_SHUFFLE: 0
                REDUCE_SHUFFLE_BYTES: 17
                MAP_OUTPUT_BYTES: 3
                PHYSICAL_MEMORY_BYTES: 466186240
                GC_TIME_MILLIS: 1305
                REDUCE_INPUT_GROUPS: 1
                COMBINE_OUTPUT_RECORDS: 0
                SHUFFLED_MAPS: 2
                REDUCE_OUTPUT_RECORDS: 1
                MAP_OUTPUT_RECORDS: 1
                COMBINE_INPUT_RECORDS: 0
                CPU_MILLISECONDS: 7470
                COMMITTED_HEAP_BYTES: 393216000
Job executed successfully
  1. start job -j 3 -s allows you to start a sqoop job and observe job running status. stop job -j 3 will not stop running job at any time.

 

Check Result

we can check the result in CDH:

[cloudera@quickstart ~]$ hdfs dfs -ls /cloudera/
Found 2 items
-rw-r--r--   1 sqoop2 supergroup          0 2016-08-04 09:59 /cloudera/33895be5-a670-4e25-aada-a66fc2cf1919.txt
-rw-r--r--   1 sqoop2 supergroup          2 2016-08-04 09:59 /cloudera/ffe359d6-afe9-40e9-baf9-d2e29937a86c.txt
[cloudera@quickstart ~]$ hdfs dfs -cat /cloudera/33895be5-a670-4e25-aada-a66fc2cf1919.txt
[cloudera@quickstart ~]$ hdfs dfs -cat /cloudera/ffe359d6-afe9-40e9-baf9-d2e29937a86c.txt
4

 

Extra 2 Minutes …

In this section, we transfer data back from HDFS to DB2 table:

sqoop:000> create job -f 5 -t 4
Creating job for links with from id 5 and to id 4
Please fill following values to create new job object
Name: h2d

From Job configuration

Input directory: /cloudera/
Override null value:
Null value:

To database configuration

Schema name: IIDEV20
Table name: TEST1
Table SQL statement:
Table column names:
Stage table name:
Should clear stage table:

Throttling resources

Extractors: 2
Loaders: 2
New job was successfully created with validation status OK  and persistent id 4

Then, we start the job:

sqoop:000> start job -j 4 -s
Submission details
Job ID: 4
Server URL: http://9.181.139.59:12000/sqoop/
Created by: iidev20
Creation date: 2016-08-04 14:53:20 CDT
Lastly updated by: iidev20
External ID: job_1469730693462_0134
        http://quickstart.cloudera:8088/proxy/application_1469730693462_0134/
2016-08-04 14:53:20 CDT: BOOTING  - Progress is not available
2016-08-04 14:53:37 CDT: BOOTING  - 0.00 %
2016-08-04 14:53:48 CDT: BOOTING  - 0.00 %
2016-08-04 14:53:58 CDT: BOOTING  - 0.00 %
2016-08-04 14:54:11 CDT: RUNNING  - 0.00 %
2016-08-04 14:54:21 CDT: RUNNING  - 0.00 %
2016-08-04 14:54:32 CDT: RUNNING  - 0.00 %
2016-08-04 14:54:43 CDT: RUNNING  - 0.00 %
2016-08-04 14:54:54 CDT: RUNNING  - 0.00 %
2016-08-04 14:55:05 CDT: RUNNING  - 0.00 %
2016-08-04 14:55:16 CDT: RUNNING  - 50.00 %
2016-08-04 14:55:27 CDT: RUNNING  - 50.00 %
2016-08-04 14:55:38 CDT: RUNNING  - 50.00 %
2016-08-04 14:55:49 CDT: RUNNING  - 50.00 %
2016-08-04 14:55:59 CDT: RUNNING  - 50.00 %
2016-08-04 14:56:10 CDT: RUNNING  - 50.00 %
2016-08-04 14:56:21 CDT: RUNNING  - 100.00 %
2016-08-04 14:56:35 CDT: SUCCEEDED
Counters:
      org.apache.hadoop.mapreduce.FileSystemCounter
              FILE_LARGE_READ_OPS: 0
              FILE_WRITE_OPS: 0
              HDFS_READ_OPS: 8
              HDFS_BYTES_READ: 427
              HDFS_LARGE_READ_OPS: 0
              FILE_READ_OPS: 0
              FILE_BYTES_WRITTEN: 494580
              FILE_BYTES_READ: 17
              HDFS_WRITE_OPS: 0
              HDFS_BYTES_WRITTEN: 0
      org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
              BYTES_WRITTEN: 0
      org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
              BYTES_READ: 0
      org.apache.hadoop.mapreduce.JobCounter
              MB_MILLIS_MAPS: 109466624
              TOTAL_LAUNCHED_MAPS: 2
              VCORES_MILLIS_REDUCES: 135729
              TOTAL_LAUNCHED_REDUCES: 2
              VCORES_MILLIS_MAPS: 106901
              SLOTS_MILLIS_REDUCES: 135729
              MB_MILLIS_REDUCES: 138986496
              SLOTS_MILLIS_MAPS: 106901
              MILLIS_REDUCES: 135729
              MILLIS_MAPS: 106901
              OTHER_LOCAL_MAPS: 2
      org.apache.sqoop.submission.counter.SqoopCounters
              ROWS_READ: 1
              ROWS_WRITTEN: 1
              Shuffle Errors
              CONNECTION: 0
              WRONG_LENGTH: 0
              BAD_ID: 0
              WRONG_MAP: 0
              WRONG_REDUCE: 0
              IO_ERROR: 0
      org.apache.hadoop.mapreduce.TaskCounter
              MAP_OUTPUT_MATERIALIZED_BYTES: 29
              MERGED_MAP_OUTPUTS: 4
              SPILLED_RECORDS: 2
              REDUCE_INPUT_RECORDS: 1
              VIRTUAL_MEMORY_BYTES: 6016950272
              MAP_INPUT_RECORDS: 0
              SPLIT_RAW_BYTES: 420
              FAILED_SHUFFLE: 0
              REDUCE_SHUFFLE_BYTES: 29
              MAP_OUTPUT_BYTES: 3
              PHYSICAL_MEMORY_BYTES: 670957568
              GC_TIME_MILLIS: 4852
              REDUCE_INPUT_GROUPS: 1
              COMBINE_OUTPUT_RECORDS: 0
              SHUFFLED_MAPS: 4
              REDUCE_OUTPUT_RECORDS: 1
              MAP_OUTPUT_RECORDS: 1
              COMBINE_INPUT_RECORDS: 0
              CPU_MILLISECONDS: 9510
              COMMITTED_HEAP_BYTES: 453255168
Job executed successfully

Finally, we verify the result from DB2’s perspective:

$ db2 "select * from test1"

C1
-----------
          4
          4

  2 record(s) selected.
Advertisements