As big data engineer, you must know the apache hive best practices. As you know Apache Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC. Hive lets you use SQL on Hadoop, but tuning SQL on a distributed system is different. Apache Hive doesn’t run queries the way an RDBMS does. Here are few the list of best practices
Best Practice Tip 1: Don’t Use Map Reduce
Apache MapReduce It is slow on its own, and it’s really slow under Hive. Though Apache Hive builds and writes a very efficient MapReduce program, after all, it is MapReduce. If you’re on Hortonwork’s distribution, you can throw set hive.execution.engine=tez at the top of a script. On Cloudera, use Impala. If these are not your choice, use Apache Spark
Best Practice Tip 2: Don’t do a join on a subquery
You’re better off creating a temporary table, then joining against the temp table instead of asking Hive to be smart about how it handles subqueries. Meaning doesn’t do this:
select a.* from tbl1 a inner join ( select ... from somethingelse union b select ... from anotherthing c) d on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1
Instead, do this:
create var_temp as select ... from somethingelse b union select ... from anotherthing c and then select a.* from something a inner join from var_temp b where a.key1=b.key1 and a.key2=b.key2 where a.condition=1
It really shouldn’t be tons faster at this point in Hive’s evolution, but it is, generally.
Best Practice Tip 3: Use hashes for column comparisons
If you’re comparing the same 10 fields in every query, consider using hash() and comparing the sums. These are sometimes so useful you might shove them in an output table. Note that the hash in Hive 0.12 is a low resolution, but better hashes are available in 0.13.
Best Practice Tip 4: Partitioning
If you have this one item in many places where clauses like a date (but ideally not a range) or a location repeat, you might have your partition key! Partitions basically mean “split this into its own directory,” which means instead of looking at a big file, Hive looks at the one file because you have it in your join/where clause saying you’re only looking at location=’NC’, which is a small subset of your data. Also, unlike with column values, you can push partitions in your LOAD DATA statements. However, remember that HDFS does not love small files.
Best Practice Tip 5: If you can, put the largest table last
Part Of Optimizing The Queries In Hive, What Should Be The Order Of Table Size In A Join Query. In a join query, the smallest table to be taken in the first position and largest table should be taken in the last position.
Best Practice Tip 6: Consider MapJoin optimizations
If you do an explain on your query, you may find that recent versions of Hive are smart enough to apply the optimization automatically. But you may need to tweak them.
Best Practice Tip 6: Enable statistics
Hive does somewhat boneheaded things with joins unless statistics are enabled. You may also want to use query hints in Impala.
Best Practice Tip 7: Check your container size
You may need to increase your container size for Impala or Tez. Also, the “recommended” sizes may not apply to your system if you have larger node sizes. You might want to make sure your YARN queue and general YARN memory are appropriate. You might also want to peg it to something that isn’t the default queue all the peasants use.
Best Practice Tip 8: Don’t use structs in a join
We all have to admit our native-brain SQL syntax is about SQL-92 era, so I don’t tend to use structs anyhow. But if you’re doing something super-repetitive like ON clauses for compound PKs, structs are handy. Unfortunately, Hive chokes on them — particularly in the ON clause. Of course, it doesn’t do so at smaller data sets and yields no errors much of the time. In Tez, you get a fun vector error. This limitation isn’t documented anywhere that I know of. Consider this a fun way to get to know the innards of your execution engine!
Best Practice Tip 9: Try turning vectorization on and off
Add following on the top of your scripts.
set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true
Try it with them on and off because vectorization seems problematic in recent versions of Hive
Best Practice Tip 10: Use ORC (Better) or Parquet
Use Parquet or ORC, but don’t convert to them for sport
That is, use Parquet or ORC as opposed to, say, TEXTFILE. However, if you have text data coming in and are massaging it into something slightly more structured, do the conversion to the target tables. If your system can not LOAD DATA from a text file into an ORC then do the initial load into a text.
When you create other tables against which you’ll ultimately run most of your analysis, do your ORCing there because converting to ORC or Parquet takes time and isn’t worth it as step one in your ETL process. If you have simple flat files coming in and aren’t doing any tweaking, then you’re stuck loading into a temporary table and doing a select create into an ORC or Parquet.
Best Practice Tip 2: Don’t do string matching in SQL
Importantly in Apache Hive. If you stick a like string match where a clause should be, you’ll generate a cross-product warning. If you have a query that runs in seconds, with string matching it will take minutes. Your best alternative is to use one of many tools that allow you to add search to Hadoop. Look at Elasticsearch’s Hive integration or Lucidwork’s integration for Solr. Also, there’s Cloudera Search. RDBMSes were never good at this, but Hive is worse.