Home  >  Blog  >   Mapreduce  > 

Understanding Data Parallelism in MapReduce

Rating: 4
  
 
4836

In order to understand the goals of MapReduce, it is important to realize for which scenarios MapReduce is optimized. The MapReduce programming model is created for processing data which requires “DATA PARALLELISM”, the ability to compute multiple independent operations in any order (King). In parallel processing, commutative operations are operations where the order of execution does not matter to the results of the equation. Commutativity can apply to complex operations and even processes, as long as they don’t manipulate the same memory. For example, in the figure below, as long as foo(a) and bar(b) don’t manipulate the same variable, they can occur in parallel in different threads. However, the write operation must wait for both foo() and bar() to complete. The figure below illustrates a dependency graph between foo(a), bar(a) and the write command.

Interested in mastering MapReduce? Enroll now for FREE demo on MapReduce Training.

Figure 1 – Parallelism Dependency Graph

One of the goals of parallelism is identifying the logical “tasks” or units which can be run in parallel as threads. Parallel programming techniques require developers to implement dependency graphs, which can become much more as the amount of shared information and sequence of operations increases. Techniques such as locks and barriers, critical sections, semaphores, monitors, RPC and rendezvous have been proposed to aid in the design of multi threaded and distributed. In Parallel and Distributed processing, intelligent task design attempts to eliminate as many synchronization points as possible, but some will still be required. Patterns such as “Master/Worker” and “Producer/Consumer” are different patterns that developers can use to implement parallel thread processing.

Frequently Asked MapReduce Interview Questions & Answers

 MindMajix YouTube Channel


MapReduce provides a programming model which abstracts many of the aforementioned complexities of parallel processing from the software engineer. The MapReduce implementation performs much of the “wiring” associated with parallel processing, leaving the developer to implement relatively simple methods. The use of MapReduce does come with some constraints, making it less appropriate for some tasks. MapReduce models are optimized for tasks where a large number of key*value input lists must be processed somewhat independently. MapReduce map() method must be commutative, in order for the MapReduce implementation to make use of parallelization. MapReduce enables the parallelization across hundreds and even thousands of CPU’s.

Explore MapReduce Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

List of Other Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

 

Join our newsletter
inbox

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
NameDates
MapReduce TrainingApr 01 to Apr 16
MapReduce TrainingApr 04 to Apr 19
MapReduce TrainingApr 08 to Apr 23
MapReduce TrainingApr 11 to Apr 26
Last updated: 30 March 2023
About Author
Remy Sharp
Ravindra Savaram

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

Recommended Courses

1 /15