Hadoop 3.0 GPU : Hadoop is still behind high performance capacity due to CPUs’ limited parallelism, though. GPU (Graphical Processing Unit) accelerated computing involves the use of a GPU together with a CPU to accelerate applications to data processing on GPU cluster toward higher efficiency. However, GPU cluster has low level data storage capacity
Leveraged Hadoop 3.0 GPU Computing
MapReduce programming model has few constraints and it looks like this
- The Hadoop MapReduce model is applied to large batch-oriented computations that are primarily concerned with time to job completion and not real-time computations
- All the intermediate output from each map and reduce stage is materialised to disk before it can be consumed by the next stage or produce output. This strategy is a simple and elegant checkpoint/restart fault tolerance mechanism that is essential for large-scale commodity machines with high-failure rates but comes with a performance price
- Each node within a MapReduce cluster is typically a 2 – 4 core CPU and the network link between nodes are typically 1Gb/s Ethernet links. No GPU co-processors are attached to the nodes and in order to get the processing speed-up hundreds or thousands of servers must be used. This leads to an initial substantial up-front investment required to build your own private large-scale MapReduce cluster plus high on-going power consumption costs
- Moving in-house MapReduce computations to an external MapReduce cloud service, whilst eliminating initial upfront hardware build and operational costs, may still result in substantial usage fees if hundreds or thousands of machines are required. Furthermore there is still the problem of moving large data sets to the cloud if your MapReduce jobs consume hundreds of terabytes of data. This is a data transfer issue that is sometimes overlooked
There are many approaches to tackle those Hadoop 3.0 GPU constraints,e.g. modified MapReduce for real-time analysis, graph and matrix processing, but maybe the fast involvements in GPU Computing would give us some insights.
GPU (Graphics Processing Unit), was designed originally for graphics processing. Now, because of its highly parallel feature and has developed towards a more general purpose processor, GPGPU(General Purpose GPU), for scientific and engineering applications.
Hadoop 3.0 GPU : Challenges Ahead & Approaches
Many past research works have been done to explore and implement MapReduce framework on GPUs, such as Mars, StreamMR, MapCG, GPMR, etc. They have encountered several common challenges and come with their own approaches, but we may wonder whether their concerns and assumptions are still valid if we move the implementations of MapReduce to the GPU cluster cloud?
The main challenges for implementation of MapReduce on GPU and possible approaches are listed below:
- The current GPU provides no lock support, which imposes a challenge for synchronization among large number of processors. Most of the MapReduce implementations provide atomic free or lock free scheme to tackle this issue. For Mars, it has a two-step scheme to prefix sum and count the intermediate results and then write results into the output array. Both StreamMR and MapCG argued that this scheme had a large overhead and inefficiency. StreamMR proposed opportunistic preprocessing to reduce unnecessary counting computation and implemented atomic-free hash tables to group intermediate results. MapCG also used hash table to group intermediate results and implemented a specialized memory allocator composed of global buffer with atomicAdd() operation provided by GPU to ensure parallelism. This challenge still stays with us when we implement in GPU cluster on Amazon EC2.
- It is difficult to ensure even workload allocation across GPU threads to maximize GPU’s utility. This challenge has been addressed by Mars, GPMR, etc. In order to achieve massive threads parallelism, Mars would initialize a large number of threads, and assign a certain number of key value pairs automatically to each thread while GPMR implemented concepts of chunk, partial reduction and accumulation for MapReduce.
- The commonly used GPU library, CUBLAS, is still at relatively low level and lacks the most conventional operations for MapReduce such as string processing and files manipulation. Many past papers have provide basic API, library and configurations for MapReduce.
- Handling datasets greater than the device memory is another concern of Mars and GPMR in the past, but will the GPU on Amazon EC2 relieve this concern? The experiment metrics for the past research could be modified in terms of size and scale. For example, the largest data size for StringMatch(SM) is 256MB and for matrix multiplication is 2048 * 2048, and this may not be suitable for GPU on Amazon EC2.
- Most of the implementations of MapReduce are for solo GPU, not GPU cluster. In Amazon EC2, there are 2 x NVIDIA Tesla ”Fermi” M2050 GPUs in the cluster. The synchronization challenge has been addressed by GPMR model.
- Co-prossessing GPU and CPU is another synchronization challenge. In MapCG, a new MapReduce framework was designed to have MapCG API to translate to CPU code and GPU cod at runtime. MapCG evaluated the performance for CPU+GPU co-processing was not significant better than same type of processors due to application types and overheads. We could explore how to reduce the overhead and have a classifier to assign applications or tasks to the most suitable processors.