Zhe's thoughts

Wednesday, July 23, 2014

Leaving IBM, joining Cloudera

Starting this August I'll join Cloudera as a Software Engineer in the HDFS team. Waving good-bye to my 4.5 year tenure as a Researcher, first at Oak Ridge National Lab and then at IBM Watson Lab. At this point I'd like to log my thoughts behind this choice. Around every job change there are two factors: compensation and happiness. The second factor, which includes sense of accomplishment, recognition from other people, and pure interest, will be the covered in this post.

Politics etc.
I didn't actually get particularly bad big company politics at IBM. Most of my Research or non-Research colleagues are very reasonable. However, in a company with 400,000 employees, it is very easy to step on other people's toes. You'll find yourself spending more time than you want on planning rather than doing the real work.

Another consequence of the large size is that executives know relatively little about the technical details, including the state-of-the-art. Meanwhile your performance score depends on how they like your demo. Therefore, the incentive mechanism is sometimes misaligned with what you know as the right things to do.

Breadth vs. Depth
During my 3.5 years at IBM Research I have worked on VM provisioning, cloud BFT, VM caching, software license management, and a little bit of HDFS as a side project. From my observation that's more or less the case for many researchers here. IBM's business model decides that we do a lot of "integration of XXX with YYY", and "XXX-as-a-service" here. I totally see the value-add and the technical challenges of those projects. However, I just prefer to go "crazy deep" on the XXX problem itself before moving to the next subject.

Things I will miss about IBM Research:

The "never-jam" Taconic Parkway. Seriously, I will miss Westchester -- easy commute to NYC, (relatively) cheap housing, Bear Mountain, ...
The "real scientists" running around the lab in white coats. Seriously, I will miss the resourcefulness -- you can grab an expert in any area (you name it, material science, partial differential equations, ...) if your project needs.
The close connections with academia.
The T. J. Watson Center. The remarkable building, the beautiful library, and neat offices.

Things I won't miss about IBM Research:

"Welcome to the teleconference service, please enter your access code" ... "There are 17 participants on the call, you are joining as a participant". Seriously, there are just so many hour-long meetings where you only speak for 5 seconds.
Having to pay for coffee, having to use 4 year old laptop, having to try so hard to get a 23'' monitor. The HR department just doesn't consider it a priority to enhance employee morale and productivity.

Things I look forward to about Cloudera:

The "hacker mentality" and pride as a hardcore engineer.
A zoomed-in view of how people are using large distributed storage systems in production.

Tuesday, June 10, 2014

Use of Java IO/NIO Packages in HDFS 2.0

Java I/O and NIO

First, some background about Java I/O. Java models input/output as streams. InputStream (abstract) is the superclass of all input types that can be modeled as a stream. FileInputStream is a subclass of InputStream representing file I/O. A FileInputStream needs to be created based on a File. A File object contains many filesystem properties, including file type (isFile), directory structure (listFiles), etc.

Extending from the I/O package, NIO (new I/O, or non-blocking I/O) package provides richer features by exposing lower level control. The central abstraction is a Buffer class. Another interesting abstraction is Channels, which are closely related to non-blocking I/O.

Starting from Java 7, the NIO2 package (java.nio.file) is available to expose even lower level filesystem control. A Path class is presented, abstracting a file's path in the file system. The Files class is capable of many types of file operations such as creating and managing symbolic links.

Java I/O Packages in HDFS

HDFS uses a new type of input/output stream named FSInputStream/FSOutputStream (abstract). They model HDFS stream input/output. The main purpose of having custom file input/output stream is for better position tracking (they don't do much).

DFSInputStream/DFSOutputStream further extends their FS stream superclasses. DFS input/output streams handle the main HDFS logic of locating local files on DataNodes etc.

Java NIO Packages in HDFS

HDFS only uses 2 types of NIO buffers: ByteBuffer and MappedByteBuffer.

NIO2 is not used in HDFS.

Example: ingest a local file into HDFS with copyFromLocal

(shell) CopyCommands / CommandsWithDestination

* run

* |-> {@link #processOptions(LinkedList)}

* \-> {@link #processRawArguments(LinkedList)}

* |-> {@link #expandArguments(LinkedList)}

* | \-> {@link #expandArgument(String)}*

* \-> {@link #processArguments(LinkedList)}

* |-> {@link #processArgument(PathData)}*

* | |-> {@link #processPathArgument(PathData)}

* | \-> {@link #processPaths(PathData, PathData...)}

* | \-> {@link #processPath(PathData)}*

* \-> {@link #processNonexistentPath(PathData)}

* |-> copyFileToTarget

* \-> Open an InputStream from the path

* |-> copyStreamToTarget

* \-> Create a TargetFileSystem, which is subclass if FilterFileSystem (subclass of FileSystem)

* |-> writeStreamToFile
* |-> create() a FSOutputStream from the path
* \-> IOUtils copyBytes() from input stream to output stream

DFSInputStream/DFSOutputStream

DFSOutputStream is a subclass of FSOutputSummer (weird name!). FSOutputSummer takes care of checksumming the data packets. The write() method in FSOutputSummer basically writes each chunk into the buf[] and updates the checksum values. Then the writeChunk() function (overridden in DFSOutputStream) takes care of putting the data in dataQueue in the waitAndQueueCurrentPacket() method.

Monday, March 3, 2014

Choice of Java Containers in HDFS 2.0

When programming in a high-level language (Java, Python), we face the choice among numerous types of containers. What's the pros and cons?

I did a quick analysis on the choice of containers in HDFS 2.0 code.

There are 935 classes in total
211 of them have imported List
137 have imported ArrayList
116 have imported Map
80 have imported HashMap
31 have imported Set
27 have imported LinkedList
17 have imported HashSet
14 have imported TreeMap
10 have imported SortedTree
5 have imported Queue
4 have imported SortedMap
3 have imported Stack
3 have imported CopyOnWriteArrayList
4 have imported SortedSet
1 has imported Deque

Plain Array
INodeFile just uses a plain Array blocks to store the set of blocks. Whenever more blocks are added, it discards the current array and use a new one.

ArrayList
First, let's look at the popular ArrayList. It is an implementation of the List interface based on an array. It should be used if the contained items are not updated frequently. For instance, in DFSClient, the list of usable local interface addresses were initialized once, and used by random (getLocalInterfaceAddrs). After all it's the fastest to access an item of an array by index.

HashMap
HashMap should be used when insert and lookup are the most common operations. For instance, in DFSClient it is used to store the list of blocks stored at each datanode (getBlockStorageLocations). It is also used frequently to store additional information of a basic data structure.

HashSet
HashSet comes handy when you only want to operate on values without worrying about keys. For example, in FSNameSystem->getNamespaceEditsDirs, it is used to deduplicate a list. Note that LinkedHashSet is used in this example, to preserver the ordering of inserted elements.

Misc -- Sorting
Collection.sort() has been used by 7 classes.

Python list and dictionary are similar to Java List and Set. But you can easily sort a Python list with either sorted function (new list) or the sort method (in place).

Misc -- Superclass vs. Subclass
Why are some collections declared as superclass and initialized as a concrete class? Like in LeaseManager, private final Collection<String> paths = new TreeSet<String>(); Reasons found here.

Popular Container Syntax

ArrayList

add(e): constant time, resizing is optimized by doubling the size every time
add(i, e): linear time complexity
contains(e), get(i), set(i, e)
remove(i), remove(e) -- first occurrence
toArray()
size()

LinkedList

addFirst(e), addLast(e), removeFirst(), removeLast()
getFirst(), getLast()

Python List

append()

Python Set

Learning Erlang

It is based on the concept of Lambda calculus (http://en.wikipedia.org/wiki/Lambda_calculus). A key distinction is the use of nested functions.

It is an appropriate abstraction when processing SPMD workloads. Compared to OpenMP/MPI it has built-in support for load balancing and fault tolerance; hence easier to develop.

Monday, February 3, 2014

Day project - webbot (automated web surfing)

Some web pages requires mouse clicks to display contents. So I'm writing a small robot to do the job:

http://www.geekorgy.com/index.php/2010/06/python-mouse-click-and-move-mouse-in-apple-mac-osx-snow-leopard-10-6-x/

Sunday, January 26, 2014

Corrupted /boot partition

A lot of sys admin lessons learnt recently. Today I just managed to corrupt the /boot partition of 6 physical servers (32 core and 96GB RAM each)! Personal record on biggest mess-up. What I did was to to 'make install' -> no space left on /dev -> deleted old kernel entries. At one time I forgot to delete old entries and did 'make install' 2 or 3 times, I guess that made /boot unhappy.

But compared to Google's mess-up, I feel little again...