Tuesday, June 10, 2014

Use of Java IO/NIO Packages in HDFS 2.0



Java I/O and NIO

First, some background about Java I/O. Java models input/output as streams. InputStream (abstract) is the superclass of all input types that can be modeled as a stream. FileInputStream is a subclass of InputStream representing file I/O. A FileInputStream needs to be created based on a File. A File object contains many filesystem properties, including file type (isFile), directory structure (listFiles), etc.

Extending from the I/O package, NIO (new I/O, or non-blocking I/O) package provides richer features by exposing lower level control. The central abstraction is a Buffer class. Another interesting abstraction is Channels, which are closely related to non-blocking I/O.

Starting from Java 7, the NIO2 package (java.nio.file) is available to expose even lower level filesystem control. A Path class is presented, abstracting a file's path in the file system. The Files class is capable of many types of file operations such as creating and managing symbolic links.

Java I/O Packages in HDFS

HDFS uses a new type of input/output stream named FSInputStream/FSOutputStream (abstract). They model HDFS stream input/output. The main purpose of having custom file input/output stream is for better position tracking (they don't do much).

DFSInputStream/DFSOutputStream further extends their FS stream superclasses. DFS input/output streams handle the main HDFS logic of locating local files on DataNodes etc. 

Java NIO Packages in HDFS

HDFS only uses 2 types of NIO buffers: ByteBuffer and MappedByteBuffer.

NIO2 is not used in HDFS.


Example: ingest a local file into HDFS with copyFromLocal

(shell) CopyCommands / CommandsWithDestination

   * run
   * |-> {@link #processOptions(LinkedList)}
   * \-> {@link #processRawArguments(LinkedList)}
   *      |-> {@link #expandArguments(LinkedList)}
   *      |   \-> {@link #expandArgument(String)}*
   *      \-> {@link #processArguments(LinkedList)}
   *          |-> {@link #processArgument(PathData)}*
   *          |   |-> {@link #processPathArgument(PathData)}
   *          |   \-> {@link #processPaths(PathData, PathData...)}
   *          |        \-> {@link #processPath(PathData)}*

   *          \-> {@link #processNonexistentPath(PathData)} 
   
   * |-> copyFileToTarget
   * \-> Open an InputStream from the path
   *  |-> copyStreamToTarget
   *  \-> Create a TargetFileSystem, which is subclass if FilterFileSystem (subclass of FileSystem)
   *   |-> writeStreamToFile
   *   |-> create() a FSOutputStream from the path
   *   \-> IOUtils copyBytes() from input stream to output stream

DFSInputStream/DFSOutputStream

DFSOutputStream is a subclass of FSOutputSummer (weird name!). FSOutputSummer takes care of checksumming the data packets. The write() method in FSOutputSummer basically writes each chunk into the buf[] and updates the checksum values. Then the writeChunk() function (overridden in DFSOutputStream) takes care of putting the data in dataQueue in the waitAndQueueCurrentPacket() method.