Introduction
In the ancient ruins of a lost civilization, a group of modern-day explorers stumbled upon a hidden temple dedicated to the god of knowledge and wisdom. The temple’s walls were adorned with intricate hieroglyphs, holding the secrets of an advanced data processing system used by the ancient priests.
One of the explorers, a skilled Hadoop engineer, took on the role of a High Priest, deciphering the hieroglyphs and unlocking the temple’s mysteries. The goal was to reconstruct the ancient data processing system, leveraging the power of Hadoop’s distributed cache to efficiently process large datasets, just as the ancient priests did centuries ago.
Prepare the Dataset and Code
In this step, we’ll set up the necessary files and code to simulate the ancient data processing system.
First, change the user to hadoop
and then switch to the home directory of the hadoop
user:
su - hadoop
Create a new directory called distributed-cache-lab
and navigate to it:
mkdir distributed-cache-lab
cd distributed-cache-lab
Next, create a text file named ancient-texts.txt
with the following content:
The wisdom of the ages is eternal.
Knowledge is the path to enlightenment.
Embrace the mysteries of the universe.
This file will represent the ancient texts we want to process.
Now, create a Java file named AncientTextAnalyzer.java
with the following code:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class AncientTextAnalyzer {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: AncientTextAnalyzer <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "Ancient Text Analyzer");
job.setJarByClass(AncientTextAnalyzer.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This code is a simple MapReduce program that counts the occurrences of each word in the input file. We’ll use this code to demonstrate the usage of the distributed cache in Hadoop.
Compile and Package the Code
In this step, we’ll compile the Java code and create a JAR file for deployment.
First, make sure you have the Hadoop core JAR file in your classpath. You can download it from the Apache Hadoop website or use the one provided in your Hadoop installation.
Compile the AncientTextAnalyzer.java
file:
javac -source 8 -target 8 -classpath "/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" AncientTextAnalyzer.java
Now, create a JAR file with the compiled class files:
jar -cvf ancient-text-analyzer.jar AncientTextAnalyzer*.class
Run the MapReduce Job With Distributed Cache
In this step, we’ll run the MapReduce job and leverage the distributed cache to provide the input file to all nodes in the cluster.
First, copy the input file ancient-texts.txt
to the Hadoop Distributed File System (HDFS):
hadoop fs -mkdir /input
hadoop fs -put ancient-texts.txt /input/ancient-texts.txt
Next, run the MapReduce job with the distributed cache option:
hadoop jar ancient-text-analyzer.jar AncientTextAnalyzer -files ancient-texts.txt /input/ancient-texts.txt /output
This command will run the AncientTextAnalyzer
MapReduce job, using the -files
option to distribute the ancient-texts.txt
file to all nodes in the cluster. The input path is /input/ancient-texts.txt
, and the output path is /output
.
After the job completes, you can check the output:
hadoop fs -cat /output/part-r-00000
You should see the word count output, similar to:
Embrace 1
Knowledge 1
The 1
ages 1
enlightenment. 1
eternal. 1
is 2
mysteries 1
of 2
path 1
the 4
to 1
universe. 1
wisdom 1
Summary
In this lab, we explored the power of Hadoop’s distributed cache feature by implementing an ancient text analysis system. By leveraging the distributed cache, we were able to efficiently distribute the input file to all nodes in the cluster, enabling parallel processing and reducing the overhead of transferring data across the network.
Through this hands-on experience, I gained a deeper understanding of how Hadoop’s distributed cache can optimize data processing in distributed computing environments. By caching frequently accessed data across the cluster, we can significantly improve performance and reduce network traffic, especially when dealing with large datasets or complex computations.
Additionally, this lab provided me with practical experience in working with Hadoop MapReduce, Java programming, and executing jobs on a Hadoop cluster. The combination of theoretical knowledge and hands-on practice has enhanced my proficiency in big data processing and prepared me for more advanced Hadoop-related challenges.
🚀 Practice Now: Ancient Wisdom of Distributed Cache
Want to Learn More?
- 🌳 Learn the latest Hadoop Skill Trees
- 📖 Read More Hadoop Tutorials
- 💬 Join our Discord or tweet us @WeAreLabEx