Monday, March 3, 2014

Choice of Java Containers in HDFS 2.0



When programming in a high-level language (Java, Python), we face the choice among numerous types of containers. What's the pros and cons?

I did a quick analysis on the choice of containers in HDFS 2.0 code.


  • There are 935 classes in total
  • 211 of them have imported List
  • 137 have imported ArrayList
  • 116 have imported Map
  • 80 have imported HashMap
  • 31 have imported Set
  • 27 have imported LinkedList
  • 17 have imported HashSet
  • 14 have imported TreeMap
  • 10 have imported SortedTree
  • 5 have imported Queue
  • 4 have imported SortedMap
  • 3 have imported Stack
  • 3 have imported CopyOnWriteArrayList
  • 4 have imported SortedSet
  • 1 has imported Deque
Plain Array
INodeFile just uses a plain Array blocks to store the set of blocks. Whenever more blocks are added, it discards the current array and use a new one.

ArrayList
First, let's look at the popular ArrayList. It is an implementation of the List interface based on an array. It should be used if the contained items are not updated frequently. For instance, in DFSClient, the list of usable local interface addresses were initialized once, and used by random (getLocalInterfaceAddrs). After all it's the fastest to access an item of an array by index.


HashMap
HashMap should be used when insert and lookup are the most common operations. For instance, in DFSClient it is used to store the list of blocks stored at each datanode (getBlockStorageLocations). It is also used frequently to store additional information of a basic data structure. 

HashSet
HashSet comes handy when you only want to operate on values without worrying about keys. For example, in FSNameSystem->getNamespaceEditsDirs, it is used to deduplicate a list. Note that LinkedHashSet is used in this example, to preserver the ordering of inserted elements.

Misc -- Sorting
Collection.sort() has been used by 7 classes.

Python list and dictionary are similar to Java List and Set. But you can easily sort a Python list with either sorted function (new list) or the sort method (in place).

Misc -- Superclass vs. Subclass
Why are some collections declared as superclass and initialized as a concrete class? Like in LeaseManager, private final Collection<String> paths = new TreeSet<String>(); Reasons found here.

Popular Container Syntax

  • ArrayList
    • add(e): constant time, resizing is optimized by doubling the size every time
    • add(i, e): linear time complexity
    • contains(e), get(i), set(i, e)
    • remove(i), remove(e) -- first occurrence 
    • toArray()
    • size()
  • LinkedList
    • addFirst(e), addLast(e), removeFirst(), removeLast()
    • getFirst(), getLast()
  • Python List
    • append()
  • Python Set
    • add




Learning Erlang

It is based on the concept of Lambda calculus (http://en.wikipedia.org/wiki/Lambda_calculus). A key distinction is the use of nested functions.

It is an appropriate abstraction when processing SPMD workloads. Compared to OpenMP/MPI it has built-in support for load balancing and fault tolerance; hence easier to develop.