In this article, Apache Sqoop Introduction, we will primarily discuss why this tool exists. Apache sqoop is part of Hadoop Core project or part of Hadoop Ecosystem project.
Bigdata tools which we use for transferring data between Hadoop and relational database servers is what we call Sqoop. Sqoop primarily stands for Sql for Hadoop.
In addition, there are several processes which Apache Sqoop automates, such as relying on the database to describe the schema to import data. Moreover, to import and export the data, Sqoop uses MapReduce. Also, offers parallel operation as well as fault tolerance. Basically, we can say Sqoop is provided by the Apache Software Foundation.
Basically, Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool. It offers the following capabilities:
- Generally, helps to Import individual tables or entire databases to files in HDFS
- Also can Generate Java classes to allow you to interact with your imported data
- Moreover, it offers the ability to import from SQL databases straight into your Hive data warehouse.
Why Apache Sqoop
For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS.
So, for this analysis, the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from the relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.
Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of the remaining part.
Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
Apache Sqoop Tutorial: Key Features of Sqoop
Sqoop provides many salient features like:
- Full Data Load: Apache Sqoop can load all the table using a single command.
- Incremental Load: Apache Sqoop provides the facility of incremental load where you can load parts of table whenever it is updated.
- Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
- Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
- Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec argument. You can also load compressed table in Apache Hive.
- Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the entire circumference.
- Kerberos Security Integration: Kerberos is a computer network authentication protocol which works on the basis of ‘tickets’ to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Sqoop supports Kerberos authentication.
- Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is a NoSQL database.
Sqoop 2 Design Approach & Architecture
The import tool imports individual tables from the database to HDFS. Each row in a table is treated as a record in HDFS.
When we submit a Sqoop command, the main task gets divided into subtasks which are handled by individual Map Task internally. Map Task is the subtask, which imports part of the data to the Hadoop Ecosystem. Collectively, all Map tasks import the whole data.