Apache Hive Best Practice

As big data engineer, you must know the apache hive best practices. As you know Apache Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC. Hive lets you use SQL on Hadoop, but tuning SQL on a distri­buted system is different. Apache Hive doesn’t run queries the way an RDBMS does. Here are few the list of best practices

Best Practice Tip 1: Don’t Use Map Reduce

Apache MapReduce It is slow on its own, and it’s really slow under Hive. Though Apache Hive builds and writes a very efficient MapReduce program, after all, it is MapReduce. If you’re on Horton­work’s distri­bution, you can throw set hive.e­xec­uti­on.e­ng­ine=tez at the top of a script. On Cloudera, use Impala.  If these are not your choice, use Apache Spark

Best Practice Tip 2: Don’t do a join on a subquery

You’re better off creating a temporary table, then joining against the temp table instead of asking Hive to be smart about how it handles subque­ries. Meaning doesn’t do this:

select a.* from tbl1 a 
inner join 
(
select ... from someth­ingelse union b select ... from anothe­rthing c)  d 
on a.key1 = d.key1 
and a.key2 = b.key2 
where a.cond­ition=1

Instead, do this:

create var_temp as select ... from someth­ingelse 
b union select ... from anothe­rthing c and 
then select a.* from something a inner join 
from var_temp b 
where 
a.key1­=b.key1 
and a.key2­=b.key2 
where a.cond­ition=1

It really shouldn’t be tons faster at this point in Hive’s evolution, but it is, generally.

Best Practice Tip 3: Use hashes for column compar­isons

If you’re comparing the same 10 fields in every query, consider using hash() and comparing the sums. These are sometimes so useful you might shove them in an output table. Note that the hash in Hive 0.12 is a low resolu­tion, but better hashes are available in 0.13.

Best Practice Tip 4: Partitioning

If you have this one item in many places where clauses like a date (but ideally not a range) or a location repeat, you might have your partition key! Partitions basically mean “split this into its own direct­ory,” which means instead of looking at a big file, Hive looks at the one file because you have it in your join/where clause saying you’re only looking at locati­on=­’NC’, which is a small subset of your data. Also, unlike with column values, you can push partitions in your LOAD DATA statem­ents. However, remember that HDFS does not love small files.

Best Practice Tip 5: If you can, put the largest table last

Part Of Optimizing The Queries In Hive, What Should Be The Order Of Table Size In A Join Query. In a join query, the smallest table to be taken in the first position and largest table should be taken in the last position.

Best Practice Tip 6: Consider MapJoin optimi­zations

If you do an explain on your query, you may find that recent versions of Hive are smart enough to apply the optimi­zation automa­tic­ally. But you may need to tweak them.

Best Practice Tip 6: Enable statistics

Hive does somewhat boneheaded things with joins unless statistics are enabled. You may also want to use query hints in Impala.

Best Practice Tip 7: Check your container size

You may need to increase your container size for Impala or Tez. Also, the “recom­mended” sizes may not apply to your system if you have larger node sizes. You might want to make sure your YARN queue and general YARN memory are approp­riate. You might also want to peg it to something that isn’t the default queue all the peasants use.

Best Practice Tip 8: Don’t use structs in a join

We all have to admit our native­-brain SQL syntax is about SQL-92 era, so I don’t tend to use structs anyhow. But if you’re doing something super-­rep­etitive like ON clauses for compound PKs, structs are handy. Unfort­una­tely, Hive chokes on them — partic­ularly in the ON clause. Of course, it doesn’t do so at smaller data sets and yields no errors much of the time. In Tez, you get a fun vector error. This limitation isn’t documented anywhere that I know of. Consider this a fun way to get to know the innards of your execution engine!

Best Practice Tip 9: Try turning vector­ization on and off

Add following on the top of your scripts.

set hive.v­ect­ori­zed.ex­ecu­tio­n.e­nabled = true
set hive.v­ect­ori­zed.ex­ecu­tio­n.r­edu­ce.e­nabled = true

Try it with them on and off because vector­ization seems proble­matic in recent versions of Hive

Best Practice Tip 10: Use ORC (Better) or Parquet

Use Parquet or ORC, but don’t convert to them for sport

That is, use Parquet or ORC as opposed to, say, TEXTFILE. However, if you have text data coming in and are massaging it into something slightly more struct­ured, do the conversion to the target tables. If your system can not LOAD DATA from a text file into an ORC then do the initial load into a text.

When you create other tables against which you’ll ultimately run most of your analysis, do your ORCing there because converting to ORC or Parquet takes time and isn’t worth it as step one in your ETL process. If you have simple flat files coming in and aren’t doing any tweaking, then you’re stuck loading into a temporary table and doing a select create into an ORC or Parquet.

Best Practice Tip 2: Don’t do string matching in SQL

Importantly in Apache Hive. If you stick a like string match where a clause should be, you’ll generate a cross-­product warning. If you have a query that runs in seconds, with string matching it will take minutes. Your best altern­ative is to use one of many tools that allow you to add search to Hadoop. Look at Elasti­cse­arch’s Hive integr­ation or Lucidw­ork’s integr­ation for Solr. Also, there’s Cloudera Search. RDBMSes were never good at this, but Hive is worse.

Comments are closed.