Foro Formación Hadoop

Apache Tika: Procesamiento de ficheros con MapReduce

 
Imagen de Fernando Agudo
Apache Tika: Procesamiento de ficheros con MapReduce
de Fernando Agudo - lunes, 23 de febrero de 2015, 12:50
 

Apache Tika

The Apache Tika toolkit is a free open source project used to read and extract text and other metadata from various types of digital documents, such as Word documents, PDF files, or files in rich text format. To see a basic example of how the API works, create an instance of the Tika class and open a stream by using the instance.

Listing 1. Example of Tika
import org.apache.tika.Tika;
...
private String read() 
{
	Tika tika = new Tika();
	FileInputStream stream = new FileInputStream("/path_to_input_file.PDF");
	String output = tika.parseToString(stream);
	return output;
}

If your document format is not supported by Tika (Outlook PST files are not supported, for example) you can substitute a different Java library in the previous code listing. Tika does support the ability to extract metadata, but that is outside the scope of this article. It is relatively simple to add that function to the code.

Jaql

Jaql is primarily a query language for JSON, but it supports more than just JSON. It enables you to process structured and non-traditional data. Using Jaql, you can select, join, group, and filter data stored in HDFS in a manner similar to a blend of Pig and Hive. The Jaql query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into low-level queries consisting of Java MapReduce jobs. This article demonstrates how to create a Jaql I/O adapter over Apache Tika to read various document formats, and to analyze and transform them all within this one language.

MapReduce classes used to analyze small files

Typically, MapReduce works on large files stored on HDFS. When writing to HDFS, files are broken into smaller pieces (blocks) according to the configuration of your Hadoop cluster. These blocks reside on this distributed file system. But what if you need to efficiently process a large number of small files (specifically, binary files such as PDF or RTF files) using Hadoop?

Several options are available. In many cases, you can merge the small files into a big file by creating a sequence file, which is the native storage format for Hadoop. However, creating sequence files in a single thread can be a bottleneck and you risk losing the original files. This article offers a different way to manipulate a few Java classes used in MapReduce. Traditional classes require each individual file to have a dedicated mapper. But this process is inefficient when there are many small files.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

As an alternative to traditional classes, process small files in Hadoop by creating a set of custom classes to notify the task that the files are small enough to be treated in a different way from the traditional approach.

At the mapping stage, logical containers called splits are defined, and a map processing task takes place at each split. Use custom classes to define a fixed-sized split, which is filled with as many small files as it can accommodate. When the split is full, the job creates a new split and fills that one as well, until it's full. Then each split is assigned to one mapper.

MapReduce classes for reading files

Three main MapReduce Java classes are used to define splits and read data during a MapReduce job: InputSplit, InputFormat, and RecordReader.

When you transfer a file from a local file system to HDFS, it is converted to blocks of 128 MB. (This default value can be changed in InfoSphere BigInsights.) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input for a MapReduce job, the same blocks are usually mapped, one by one, to splits. In this case, the file is divided into 10 splits (which implies means 10 map tasks) for processing. By default, the block size and the split size are equal, but the sizes are dependent on the configuration settings for the InputSplit class.

From a Java programming perspective, the class that holds the responsibility of this conversion is called an InputFormat, which is the main entry point into reading data from HDFS. From the blocks of the files, it creates a list of InputSplits. For each split, one mapper is created. Then each InputSplit is divided into records by using the RecordReader class. Each record represents a key-value pair.

FileInputFormat vs. CombineFileInputFormat

Before a MapReduce job is run, you can specify the InputFormat class to be used. The implementaion of FileInputFormat requires you to create an instance of the RecordReader, and as mentioned previously, the RecordReader creates the key-value pairs for the mappers.

FileInputFormat is an abstract class that is the basis for a majority of the implementations of InputFormat. It contains the location of the input files and an implementation of how splits must be produced from these files. How the splits are converted into key-value pairs is defined in the subclasses. Some example of its subclasses are TextInputFormat, KeyValueTextInputFormat, and CombineFileInputFormat.

Hadoop works more efficiently with large files (files that occupy more than 1 block). FileInputFormat converts each large file into splits, and each split is created in a way that contains part of a single file. As mentioned, one mapper is generated for each split. Figure 1 depicts how a file is treated using FileInputFormat and RecordReader in the mapping stage.

Figure 1. FileInputFormat with a large file

Image shows how a file is treated using FileInputFormat and RecordReader in the mapping stage

However, when the input files are smaller than the default block size, many splits (and therefore, many mappers) are created. This arrangement makes the job inefficient. Figure 2 shows how too many mappers are created when FileInputFormat is used for many small files.

Figure 2. FileInputFormat with many small files

Image shows blocks divided into many splits, mappers

To avoid this situation, CombineFileInputFormat is introduced. This InputFormat works well with small files, because it packs many of them into one split so there are fewer mappers, and each mapper has more data to process. Unlike other subclasses of FileInputFormat, CombineFileInputFormat is an abstract class that requires additional changes before it can be used. In addition to these changes, you must ensure that you prevent splitting the input. Figure 3 shows how CombineFileInputFormat treats the small files so that fewer mappers are created.

Figure 3. CombineFileInputFormat with many small files

Image shows 1:1 correlation of splits to mappers, fewer splits

MapReduce classes used for writing files

You need to save the text content of the documents in files that are easy to process in Hadoop. You can use sequence files, but in this example, you create delimited text files that contain the contents of each file in one record. This method makes the content easy to read and easy to use in downstream MapReduce jobs. The Java classes used for writing files in MapReduce are OutputFormat and RecordWriter. These classes are similar to InputFormat and RecordReader, except that they are used for output. The FileOutputFormat implements OutputFormat. It contains the path of the output files and directory and includes instructions for how the write job must be run.

RecordWriter, which is created within the OutputFormat class, defines the way each record passed from the mappers is to be written in the output path.

 

Fuente:http://www.ibm.com/developerworks/library/ba-mapreduce-biginsights-analysis/