Downloading And Installing Apache Sqoop

This guide helps you in downloading and installing apache sqoop. Apache Sqoop supports the Linux operating system, and there are several installation options. One option is the source tarball that is provided with every release. This tarball containsonly the source code of the project. You cant use it directly and will need to first compilethe sources into binary executables.

For your convenience, the Sqoop community providesa binary tarball for each major supported version of Hadoop along with the sourcetarball.In addition to the tarballs, there are open source projects and commercial companies that provide operating system-specific packages. One such project, Apache Bigtop,provides rpm packages for Red Hat, CentOS, SUSE, and deb packages for Ubuntu andDebian. The biggest benefit of using packages over tarballs is their seamless integrationwith the operating system: for example, Configuration files are stored in /etc/ and logs in /var/log.

You can download the binary tarballs from the Apache Sqoop website. All binary tarballscontain a .bin__hadoop string embedded in their name, followed by the Apache Hadoopmajor version that was used to generate them. For Hadoop 1.x, the archive namewill include the string .bin__hadoop-1.0.0. While the naming convention suggests this tarball only works with version 1.0.0, in fact, its fully compatible not only with theentire 1.0.x release branch but also with version 1.1.0. Its very important to downloadthe binary tarball created for your Hadoop version. Hadoop has changed internal interfacesbetween some of the major versions; therefore, using a Sqoop tarball that wascompiled against Hadoop version 1.x with, say, Hadoop version 2.x, will not work.

Installing packages is simpler than using tarballs. They are already integrated with theoperating system and will automatically download and install most of the required dependenciesduring the Sqoop installation.

Installing Apache Sqoop via Bigtop

Bigtop provides repositories that can be easily added into your system in order to findand install the dependencies. Bigtop installation instructions can be found in the Bigtopproject documentation. Once Bigtop is successfullydeployed, installing Sqoop is verysimple and can be done with the following commands:

To install Sqoop on a Red Hat, CentOS, or other yum system:

$ sudo yum install sqoop

To install Sqoop on an Ubuntu, Debian, or another deb-based system:

$ sudo apt-get install sqoop

To install Sqoop on a SLES system:

$ sudo zypper install sqoop

Sqoops main configuration file sqoop-site.xml is available in the configuration directory(conf/ when using the tarball or /etc/sqoop/conf when using Bigtop packages).While you can further customize Sqoop, the defaults will suffice in a majority ofcases. All available properties are documented in the sqoop-site.xml file

Apache Sqoop & Cloudera Quick Start

Cloudera Quick Starts already has sqoopavailable and you don’t need to do anything. Cloudera quick start VM install mysqljdbcdriver and you can practice it.

Installing JDBC Drivers

Sqoop requires the JDBC drivers for your specific database server (MySQL, Oracle, etc.)in order to transfer data. They are not bundled in the tarball or packages.

You need to download the JDBC drivers and then install them into Sqoop. JDBC driversare usually available free of charge from the database vendors websites. Some enterprisedata stores might bundle the driver with the installation itself. After youve obtained thedriver, you need to copy the drivers JAR file(s) into Sqoops lib/ directory. If youreusing the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If youre using packages, you will need to copy the driver files intothe /usr/lib/sqoop/lib directory.

Each database vendor has a slightly different method for retrieving the JDBC driver.Most of them make it available as a free download from their websites

Installing Specialized Connectors

Some database systems provide special connectors, which are not part of the Sqoopdistribution, and these take advantage of advanced database features. If you want to takethe advantage of these optimizations, you will need to individually download and installthose specialized connectors.

On the node running Sqoop, you can install the specialized connectors anywhere onthe local filesystem. If you plan to run Sqoop from multiple nodes, you have to installthe connector on all of those nodes. To be clear, you do not have to install the connectoron all nodes in your cluster, as Sqoop will automatically propagate the appropriate JARs as needed throughout your cluster.

In addition to installing the connector JARs on the local filesystem, you also need toregister them with Sqoop. First, create a directory manager.d in the Sqoop configurationdirectory (if it does not exist already). The configuration directory might be in a differentlocation, based on how youve installed Sqoop. With packages, its usually in the /etc/sqoop directory, and with tarballs, its usually in the conf/ directory. Then, inside thisdirectory, you need to create a file (naming it after the connector is a recommendedbest practice) that contains the following line:

connector.fully.qualified.class.name=/full/path/to/the/jar

In addition to the built-in connectors, there are many specialized connectors availablefor download. Some of them are further described in this book. For example, OraOop and Cloudera Connector for Teradata. More advanced users can develop their own connectors by following theguidelines listed in the Sqoop Developers Guide.