Filtering Data using MapReduce, PIG & HIVE:
I am just trying exporting how to filter data available in a file Using MapReduce, PIG & HIVE. Sample data looks like below. Now lets filter all the Bees in the file and get the id as output.
1,Bat
2,Bed
3,Bees
4,Beetles
5,Birch
6,Black
7,Bluegrass
8,Booklouse
9,Borers
10,Borer
11,Boxelder
12,Bristly
13,Brown
14,Budworms
15,Bumblebees
16,Butterflies
17,bugs
18,ticks
19,asparagus
20,bark
21,bean
Filtering data using Map-Reduce code:
I have updated the Map & reducer code to filter the data by using in Wordcount programme. It is always easy to modify/update the existing MapReduce code and the final code looks like below:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FilterMREx
{
static class FilterMapTask extends Mapper<Object, Text, DoubleWritable, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String[] cols = value.toString().split(",");
if (cols[1].equals("Bees"))
{
DoubleWritable i= new DoubleWritable(Integer.parseInt(cols[0]));
context.write( i, value );
}
}
}
static class FilterReduceTask extends Reducer<DoubleWritable, Text, DoubleWritable, NullWritable>
{
public void reduce(DoubleWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
for (Text value: values)
{
context.write(key, NullWritable.get());
}
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: FilterMREx <input path> <output path>");
System.exit(-1);
}
Job job = new Job(conf,"KEYVALUE-JOB");
job.setJarByClass(FilterMREx.class);
job.setMapperClass(FilterMapTask.class);
job.setReducerClass(FilterReduceTask.class);
job.setOutputKeyClass(DoubleWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)? 0:1);
}
}
Exported the FilerMREx.jar and placed it in Hadoop_jar Folder.
Please the source file in hadoop by using put command.
Running the FilterMREx.jar using hadoop:
Lets see the output:
Got the required output we have only one Bees got the id as output from MapReduce.
Filter data using PIG:
Note: Pig will eat every thing which we are passing. It is our responsibility to take care of
1. Input path
2. What is the input file format
3. In which format the data should be stored
Connecting Pig in local mode and loading FilterMRExSampleData.csv as below. When we are connecting as a local mode the input path should be a local folder, I copied data to /home/training/input
Loading data using pig commands as below
got the required output with 4 line of pig code.
storing data filtered data in to local path
Filtering Data using HIVE:
As we have the data available in HDFS. I am creating an external hive table using
No comments:
Post a Comment