Foro Formación Hadoop
Apache Tika: Procesamiento de ficheros con MapReduce
Apache Tika
The Apache Tika toolkit is a free open source project used to read and extract text and other metadata from various types of digital documents, such as Word documents, PDF files, or files in rich text format. To see a basic example of how the API works, create an instance of the Tika class and open a stream by using the instance.
Listing 1. Example of Tika
import org.apache.tika.Tika; ... private String read() { Tika tika = new Tika(); FileInputStream stream = new FileInputStream("/path_to_input_file.PDF"); String output = tika.parseToString(stream); return output; }
If your document format is not supported by Tika (Outlook PST files are not supported, for example) you can substitute a different Java library in the previous code listing. Tika does support the ability to extract metadata, but that is outside the scope of this article. It is relatively simple to add that function to the code.
Jaql
Jaql is primarily a query language for JSON, but it supports more than just JSON. It enables you to process structured and non-traditional data. Using Jaql, you can select, join, group, and filter data stored in HDFS in a manner similar to a blend of Pig and Hive. The Jaql query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into low-level queries consisting of Java MapReduce jobs. This article demonstrates how to create a Jaql I/O adapter over Apache Tika to read various document formats, and to analyze and transform them all within this one language.
MapReduce classes used to analyze small files
Typically, MapReduce works on large files stored on HDFS. When writing to HDFS, files are broken into smaller pieces (blocks) according to the configuration of your Hadoop cluster. These blocks reside on this distributed file system. But what if you need to efficiently process a large number of small files (specifically, binary files such as PDF or RTF files) using Hadoop?
Several options are available. In many cases, you can merge the small files into a big file by creating a sequence file, which is the native storage format for Hadoop. However, creating sequence files in a single thread can be a bottleneck and you risk losing the original files. This article offers a different way to manipulate a few Java classes used in MapReduce. Traditional classes require each individual file to have a dedicated mapper. But this process is inefficient when there are many small files.
As an alternative to traditional classes, process small files in Hadoop by creating a set of custom classes to notify the task that the files are small enough to be treated in a different way from the traditional approach.
At the mapping stage, logical containers called splits are defined, and a map processing task takes place at each split. Use custom classes to define a fixed-sized split, which is filled with as many small files as it can accommodate. When the split is full, the job creates a new split and fills that one as well, until it's full. Then each split is assigned to one mapper.
MapReduce classes for reading files
Three main MapReduce Java classes are used to define splits and read data during a MapReduce job: InputSplit
, InputFormat
, and RecordReader
.
When you transfer a file from a local file system to HDFS, it is converted to blocks of 128 MB. (This default value can be changed in InfoSphere BigInsights.) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input for a MapReduce job, the same blocks are usually mapped, one by one, to splits. In this case, the file is divided into 10 splits (which implies means 10 map tasks) for processing. By default, the block size and the split size are equal, but the sizes are dependent on the configuration settings for the InputSplit
class.
From a Java programming perspective, the class that holds the responsibility of this conversion is called an InputFormat
, which is the main entry point into reading data from HDFS. From the blocks of the files, it creates a list of InputSplits
. For each split, one mapper is created. Then each InputSplit
is divided into records by using the RecordReader
class. Each record represents a key-value pair.
FileInputFormat
vs. CombineFileInputFormat
Before a MapReduce job is run, you can specify the InputFormat
class to be used. The implementaion of FileInputFormat
requires you to create an instance of the RecordReader
, and as mentioned previously, the RecordReader
creates the key-value pairs for the mappers.
FileInputFormat
is an abstract class that is the basis for a majority of the implementations of InputFormat
. It contains the location of the input files and an implementation of how splits must be produced from these files. How the splits are converted into key-value pairs is defined in the subclasses. Some example of its subclasses are TextInputFormat
, KeyValueTextInputFormat
, and CombineFileInputFormat
.
Hadoop works more efficiently with large files (files that occupy more than 1 block). FileInputFormat
converts each large file into splits, and each split is created in a way that contains part of a single file. As mentioned, one mapper is generated for each split. Figure 1 depicts how a file is treated using FileInputFormat
and RecordReader
in the mapping stage.
Figure 1. FileInputFormat
with a large file
However, when the input files are smaller than the default block size, many splits (and therefore, many mappers) are created. This arrangement makes the job inefficient. Figure 2 shows how too many mappers are created when FileInputFormat
is used for many small files.
Figure 2. FileInputFormat with many small files
To avoid this situation, CombineFileInputFormat
is introduced. This InputFormat
works well with small files, because it packs many of them into one split so there are fewer mappers, and each mapper has more data to process. Unlike other subclasses of FileInputFormat
, CombineFileInputFormat
is an abstract class that requires additional changes before it can be used. In addition to these changes, you must ensure that you prevent splitting the input. Figure 3 shows how CombineFileInputFormat
treats the small files so that fewer mappers are created.
Figure 3. CombineFileInputFormat
with many small files
MapReduce classes used for writing files
You need to save the text content of the documents in files that are easy to process in Hadoop. You can use sequence files, but in this example, you create delimited text files that contain the contents of each file in one record. This method makes the content easy to read and easy to use in downstream MapReduce jobs. The Java classes used for writing files in MapReduce are OutputFormat
and RecordWriter
. These classes are similar to InputFormat
and RecordReader
, except that they are used for output. The FileOutputFormat
implements OutputFormat
. It contains the path of the output files and directory and includes instructions for how the write job must be run.
RecordWriter
, which is created within the OutputFormat
class, defines the way each record passed from the mappers is to be written in the output path.
Fuente:http://www.ibm.com/developerworks/library/ba-mapreduce-biginsights-analysis/
Social networks