In this post, I’m walking you through Sqoop 1.99.6 5 minutes Demo with emphasis on working with DB2. Hope this post can save you some time on googling (Believe me, I spent quite some time to get everything figured out when doing this hands-on practice on my own.) Let’s get started.
Preparation phase
Installation
I highly recommended to use cloudera hadoop distribution, CDH for this hands-on lab because CDH has already pre-installed Sqoop2 and has everything well-configured. I have tried out sqoop2 installation on Hortonworks HDP. The experience is not fun. So, I will assume you work with CDH for the rest of walkthrough.
I’m working with cloudera CDH 5.7.0 and “sqoop-1.99.6-bin-hadoop200”.
- Once you have sqoop downloaded, you need to untar it on the client that DB2 instance installed.
- You need to download the JDBC 4.0 Driver from IBM that matches with your DB2 version.
- You need to install the driver onto CDH sqoop and restart the sqoop service:
sudo cp db2jcc4.jar db2jcc.jar /var/lib/sqoop2/ sudo /sbin/service sqoop2-server stop sudo /sbin/service sqoop2-server start
Create a test database and table in DB2 to work with
$ db2 list db directory System Database Directory Number of entries in the directory = 1 Database 1 entry: Database alias = HZY Database name = HZY Local database directory = /home/iidev20 Database release level = 14.00 Comment = Directory entry type = Indirect Catalog database partition number = 0 Alternate server hostname = Alternate server port number = $ db2 connect to hzy Database Connection Information Database server = DB2/LINUXX8664 11.1.0 SQL authorization ID = IIDEV20 Local database alias = HZY $ db2 list tables Table/View Schema Type Creation time ------------------------------- --------------- ----- -------------------------- TEST1 IIDEV20 T 2016-07-12-20.33.38.801694 1 record(s) selected. $ db2 "select * from test1" C1 ----------- 4 1 record(s) selected.
Creating Link Object
Check for the registered connectors on your sqoop server:
sqoop:000> show connector +----+------------------------+-----------------+------------------------------------------------------+----------------------+ | Id | Name | Version | Class | Supported Directions | +----+------------------------+-----------------+------------------------------------------------------+----------------------+ | 1 | kite-connector | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO | | 2 | kafka-connector | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.kafka.KafkaConnector | TO | | 3 | hdfs-connector | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO | | 4 | generic-jdbc-connector | 1.99.5-cdh5.7.0 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO | +----+------------------------+-----------------+------------------------------------------------------+----------------------+
Generic JDBC Connector in our example has a persistence Id 4 and we will use this value to create new link object for this connector:
sqoop:000> create link -c 4 Creating link for connector with id 4 Please fill following values to create new link object Name: Fist Link Link configuration JDBC Driver Class: com.ibm.db2.jcc.DB2Driver JDBC Connection String: jdbc:db2://9.112.250.80:50591/HZY Username: iidev20 Password: ******** JDBC Connection Properties: There are currently 0 values in the map: entry# New link was successfully created with validation status OK and persistent id 4
There are two things worth mentioning about DB2:
- “JDBC Driver Class” for DB2 is “com.ibm.db2.jcc.DB2Driver“
- In “JDBC Connection String”, I specify the IP address of the client where DB2 instance resides. Then I specify the port number that is for DB2. Lastly I specify the database name we just created. The port number “50591” can be obtained by:
$ db2 get dbm cfg | grep SVCE TCP/IP Service name (SVCENAME) = iidev20 SSL service name (SSL_SVCENAME) = $ grep xiidev20 /etc/services xiidev20 50591/tcp xiidev20_int 50592/tcp
Our new link object was created with assigned id 4.
Let us create another link object but this time for the hdfs-connector instead:
sqoop:000> create link -c 3 Creating link for connector with id 3 Please fill following values to create new link object Name: Second Link Link configuration HDFS URI: hdfs://quickstart.cloudera:8020/ New link was successfully created with validation status OK and persistent id 5
- “quickstart.cloudera” is the hostname, which can be obtained by
hostname
command in CDH.
Now, we have created two links. You can check by running show link --all
:
sqoop:000> show link --all 2 link(s) to show: link with id 4 and name Fist Link (Enabled: true, Created by iidev20 at 8/4/16 11:57 AM, Updated by iidev20 at 8/4/16 11:57 AM) Using Connector generic-jdbc-connector with id 4 Link configuration JDBC Driver Class: com.ibm.db2.jcc.DB2Driver JDBC Connection String: jdbc:db2://9.112.250.80:50591/HZY Username: iidev20 Password: JDBC Connection Properties: link with id 5 and name Second Link (Enabled: true, Created by iidev20 at 8/4/16 12:36 PM, Updated by iidev20 at 8/4/16 12:36 PM) Using Connector hdfs-connector with id 3 Link configuration HDFS URI: hdfs://quickstart.cloudera:8020/
Creating Job Object
Now, we create the job object:
sqoop:000> create job -f 4 -t 5 Creating job for links with from id 4 and to id 5 Please fill following values to create new job object Name: sqoopy From database configuration Schema name: IIDEV20 Table name: TEST1 Table SQL statement: Table column names: Partition column name: C1 Null value allowed for the partition column: Boundary query: ToJob configuration Override null value: Null value: Output format: 0 : TEXT_FILE 1 : SEQUENCE_FILE Choose: 0 Compression format: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY 8 : CUSTOM Choose: 0 Custom compression format: Output directory: /cloudera/ Throttling resources Extractors: 2 Loaders: 2 New job was successfully created with validation status OK and persistent id 3 sqoop:000> show job +----+--------+----------------+--------------+---------+ | Id | Name | From Connector | To Connector | Enabled | +----+--------+----------------+--------------+---------+ | 3 | sqoopy | 4 | 3 | true | +----+--------+----------------+--------------+---------+
The idea here is quite simple. We use JDBC connector to read data from our DB2 table (indicated by -f 4
), and then we write the data to HDFS by using -t 5
.
Here, you need to pay attention to “Partition column name: C1”. If you don’t specify the partition column, you may hit the error when starting the job:
Exception has occurred during processing command Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception
Our new job object was created with assigned id 3.
Start Job (Data Transfer)
Now we are ready to start the sqoop job we just created:
sqoop:000> start job -j 3 Submission details Job ID: 3 Server URL: http://9.181.139.59:12000/sqoop/ Created by: iidev20 Creation date: 2016-08-04 11:57:43 CDT Lastly updated by: iidev20 External ID: job_1469730693462_0125 http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/ 2016-08-04 11:57:43 CDT: BOOTING - Progress is not available
You can interatively check your running job status with status job
command:
sqoop:000> status job -j 3 Submission details Job ID: 3 Server URL: http://9.181.139.59:12000/sqoop/ Created by: iidev20 Creation date: 2016-08-04 11:57:43 CDT Lastly updated by: iidev20 External ID: job_1469730693462_0125 http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/ 2016-08-04 11:58:30 CDT: RUNNING - 0.00 % sqoop:000> status job -j 3 Submission details Job ID: 3 Server URL: http://9.181.139.59:12000/sqoop/ Created by: iidev20 Creation date: 2016-08-04 11:57:43 CDT Lastly updated by: iidev20 External ID: job_1469730693462_0125 http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/ 2016-08-04 11:58:47 CDT: RUNNING - 50.00 % sqoop:000> status job -j 3 Submission details Job ID: 3 Server URL: http://9.181.139.59:12000/sqoop/ Created by: iidev20 Creation date: 2016-08-04 11:57:43 CDT Lastly updated by: iidev20 External ID: job_1469730693462_0125 http://quickstart.cloudera:8088/proxy/application_1469730693462_0125/ 2016-08-04 12:00:02 CDT: SUCCEEDED Counters: org.apache.hadoop.mapreduce.FileSystemCounter FILE_LARGE_READ_OPS: 0 FILE_WRITE_OPS: 0 HDFS_READ_OPS: 1 HDFS_BYTES_READ: 110 HDFS_LARGE_READ_OPS: 0 FILE_READ_OPS: 0 FILE_BYTES_WRITTEN: 373985 FILE_BYTES_READ: 17 HDFS_WRITE_OPS: 2 HDFS_BYTES_WRITTEN: 2 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 0 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.JobCounter MB_MILLIS_MAPS: 20448256 TOTAL_LAUNCHED_MAPS: 1 VCORES_MILLIS_REDUCES: 82496 TOTAL_LAUNCHED_REDUCES: 2 VCORES_MILLIS_MAPS: 19969 SLOTS_MILLIS_REDUCES: 82496 MB_MILLIS_REDUCES: 84475904 SLOTS_MILLIS_MAPS: 19969 MILLIS_REDUCES: 82496 MILLIS_MAPS: 19969 OTHER_LOCAL_MAPS: 1 org.apache.sqoop.submission.counter.SqoopCounters ROWS_READ: 1 ROWS_WRITTEN: 1 Shuffle Errors CONNECTION: 0 WRONG_LENGTH: 0 BAD_ID: 0 WRONG_MAP: 0 WRONG_REDUCE: 0 IO_ERROR: 0 org.apache.hadoop.mapreduce.TaskCounter MAP_OUTPUT_MATERIALIZED_BYTES: 17 MERGED_MAP_OUTPUTS: 2 SPILLED_RECORDS: 2 REDUCE_INPUT_RECORDS: 1 VIRTUAL_MEMORY_BYTES: 4516757504 MAP_INPUT_RECORDS: 0 SPLIT_RAW_BYTES: 110 FAILED_SHUFFLE: 0 REDUCE_SHUFFLE_BYTES: 17 MAP_OUTPUT_BYTES: 3 PHYSICAL_MEMORY_BYTES: 466186240 GC_TIME_MILLIS: 1305 REDUCE_INPUT_GROUPS: 1 COMBINE_OUTPUT_RECORDS: 0 SHUFFLED_MAPS: 2 REDUCE_OUTPUT_RECORDS: 1 MAP_OUTPUT_RECORDS: 1 COMBINE_INPUT_RECORDS: 0 CPU_MILLISECONDS: 7470 COMMITTED_HEAP_BYTES: 393216000 Job executed successfully
start job -j 3 -s
allows you to start a sqoop job and observe job running status.stop job -j 3
will not stop running job at any time.
Check Result
we can check the result in CDH:
[cloudera@quickstart ~]$ hdfs dfs -ls /cloudera/ Found 2 items -rw-r--r-- 1 sqoop2 supergroup 0 2016-08-04 09:59 /cloudera/33895be5-a670-4e25-aada-a66fc2cf1919.txt -rw-r--r-- 1 sqoop2 supergroup 2 2016-08-04 09:59 /cloudera/ffe359d6-afe9-40e9-baf9-d2e29937a86c.txt [cloudera@quickstart ~]$ hdfs dfs -cat /cloudera/33895be5-a670-4e25-aada-a66fc2cf1919.txt [cloudera@quickstart ~]$ hdfs dfs -cat /cloudera/ffe359d6-afe9-40e9-baf9-d2e29937a86c.txt 4
Extra 2 Minutes …
In this section, we transfer data back from HDFS to DB2 table:
sqoop:000> create job -f 5 -t 4 Creating job for links with from id 5 and to id 4 Please fill following values to create new job object Name: h2d From Job configuration Input directory: /cloudera/ Override null value: Null value: To database configuration Schema name: IIDEV20 Table name: TEST1 Table SQL statement: Table column names: Stage table name: Should clear stage table: Throttling resources Extractors: 2 Loaders: 2 New job was successfully created with validation status OK and persistent id 4
Then, we start the job:
sqoop:000> start job -j 4 -s Submission details Job ID: 4 Server URL: http://9.181.139.59:12000/sqoop/ Created by: iidev20 Creation date: 2016-08-04 14:53:20 CDT Lastly updated by: iidev20 External ID: job_1469730693462_0134 http://quickstart.cloudera:8088/proxy/application_1469730693462_0134/ 2016-08-04 14:53:20 CDT: BOOTING - Progress is not available 2016-08-04 14:53:37 CDT: BOOTING - 0.00 % 2016-08-04 14:53:48 CDT: BOOTING - 0.00 % 2016-08-04 14:53:58 CDT: BOOTING - 0.00 % 2016-08-04 14:54:11 CDT: RUNNING - 0.00 % 2016-08-04 14:54:21 CDT: RUNNING - 0.00 % 2016-08-04 14:54:32 CDT: RUNNING - 0.00 % 2016-08-04 14:54:43 CDT: RUNNING - 0.00 % 2016-08-04 14:54:54 CDT: RUNNING - 0.00 % 2016-08-04 14:55:05 CDT: RUNNING - 0.00 % 2016-08-04 14:55:16 CDT: RUNNING - 50.00 % 2016-08-04 14:55:27 CDT: RUNNING - 50.00 % 2016-08-04 14:55:38 CDT: RUNNING - 50.00 % 2016-08-04 14:55:49 CDT: RUNNING - 50.00 % 2016-08-04 14:55:59 CDT: RUNNING - 50.00 % 2016-08-04 14:56:10 CDT: RUNNING - 50.00 % 2016-08-04 14:56:21 CDT: RUNNING - 100.00 % 2016-08-04 14:56:35 CDT: SUCCEEDED Counters: org.apache.hadoop.mapreduce.FileSystemCounter FILE_LARGE_READ_OPS: 0 FILE_WRITE_OPS: 0 HDFS_READ_OPS: 8 HDFS_BYTES_READ: 427 HDFS_LARGE_READ_OPS: 0 FILE_READ_OPS: 0 FILE_BYTES_WRITTEN: 494580 FILE_BYTES_READ: 17 HDFS_WRITE_OPS: 0 HDFS_BYTES_WRITTEN: 0 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 0 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.JobCounter MB_MILLIS_MAPS: 109466624 TOTAL_LAUNCHED_MAPS: 2 VCORES_MILLIS_REDUCES: 135729 TOTAL_LAUNCHED_REDUCES: 2 VCORES_MILLIS_MAPS: 106901 SLOTS_MILLIS_REDUCES: 135729 MB_MILLIS_REDUCES: 138986496 SLOTS_MILLIS_MAPS: 106901 MILLIS_REDUCES: 135729 MILLIS_MAPS: 106901 OTHER_LOCAL_MAPS: 2 org.apache.sqoop.submission.counter.SqoopCounters ROWS_READ: 1 ROWS_WRITTEN: 1 Shuffle Errors CONNECTION: 0 WRONG_LENGTH: 0 BAD_ID: 0 WRONG_MAP: 0 WRONG_REDUCE: 0 IO_ERROR: 0 org.apache.hadoop.mapreduce.TaskCounter MAP_OUTPUT_MATERIALIZED_BYTES: 29 MERGED_MAP_OUTPUTS: 4 SPILLED_RECORDS: 2 REDUCE_INPUT_RECORDS: 1 VIRTUAL_MEMORY_BYTES: 6016950272 MAP_INPUT_RECORDS: 0 SPLIT_RAW_BYTES: 420 FAILED_SHUFFLE: 0 REDUCE_SHUFFLE_BYTES: 29 MAP_OUTPUT_BYTES: 3 PHYSICAL_MEMORY_BYTES: 670957568 GC_TIME_MILLIS: 4852 REDUCE_INPUT_GROUPS: 1 COMBINE_OUTPUT_RECORDS: 0 SHUFFLED_MAPS: 4 REDUCE_OUTPUT_RECORDS: 1 MAP_OUTPUT_RECORDS: 1 COMBINE_INPUT_RECORDS: 0 CPU_MILLISECONDS: 9510 COMMITTED_HEAP_BYTES: 453255168 Job executed successfully
Finally, we verify the result from DB2’s perspective:
$ db2 "select * from test1" C1 ----------- 4 4 2 record(s) selected.