Apache Hive Vectorization was introduced newly in Apache Hive to improve query performance. By default, the Apache Hive query execution engine processes one row of a table at a time. The one row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, … read the rest
Apache Hive Release 3.1.1 is the version which is compatible with Hadoop 3.x.y and fixes 4 bugs and one new Feature
Apache Hive Release 3.1.1 Release Note
Following Bug Fixes are part of this release
- [HIVE-18767] – Some alterPartitions invocations throw ‘NumberFormatException: null’
- [HIVE-18778] – Needs to capture input/output entities in explain
- [HIVE-20906] –
Apache Hive Cheat Sheet is a summary of all functions and syntax for big data engineers and developers reference. It is divided into 5 parts.
Apache Hive Cheat Sheet – Query Syntax
Apache Hive Cheat Sheet – Metadata
Apache Hive Cheat Sheet – Query Compatibility
Apache Hive Cheat Sheet – Command Line
Apache Hive Cheat Sheet – Shell & CLI… read the rest
As big data engineer, you must know the apache hive best practices. As you know Apache Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC. Hive lets you use SQL on Hadoop, but tuning SQL on a distributed system is different. … read the rest
Apache Hive development has shifted from the original Hive server (HiveServer1) to the new server (HiveServer2), and hence users and developers need to move to the new access tool. However, there’s more to this process than simply switching the executable name from “hive” to “beeline”. Apache Hive was a heavyweight command-line tool that accepted the command and runs … read the rest
Immutability and RDD Interface in Spark are key concepts and it must be understood in detail. Spark defines an RDD interface with the properties that each type of RDD must implement. These properties include the RDD’s dependencies and information about data locality that are needed for the execution engine to compute that RDD. Since RDDs are statically typed and immutable, calling a transformation on … read the rest
Spark In-Memory Persistence and Memory Management must be understood by engineering teams. Spark’s performance advantage over MapReduce is greatest in use cases involving repeated computations. Much of this performance increase is due to Spark’s use of in-memory persistence. Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, … read the rest
Spark Model of Parallel Computing and sometimes also called RDD is an important API. Spark Model of Parallel Computing internally uses RDD and part of Spark Core library.
Spark allows users to write a program for the driver (or master node) on a cluster computing system that can perform operations on data in parallel. Spark represents large datasets as RDDs—immutable, distributed collections of … read the rest
Why you should be worried about how Apache Spark works? To get the most out of Spark, it is important to understand some of the principles used to design Spark and, at a cursory level, how Spark programs are executed this article introduces the overall design of Spark as well as its place in the big data ecosystem. Spark is often considered … read the rest
In this article, Apache Sqoop Introduction, we will primarily discuss why this tool exists. Apache sqoop is part of Hadoop Core project or part of Hadoop Ecosystem project.
Bigdata tools which we use for transferring data between Hadoop and relational database servers is what we call Sqoop. Sqoop primarily stands for Sql for Hadoop.
In addition, there are … read the rest