tag:blogger.com,1999:blog-81537440224076464612024-03-05T03:12:20.327-08:00Zhe's thoughtsAnonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-8153744022407646461.post-10689620114360643902014-07-23T12:22:00.000-07:002014-07-23T12:27:54.772-07:00Leaving IBM, joining ClouderaStarting this August I'll join Cloudera as a Software Engineer in the HDFS team. Waving good-bye to my 4.5 year tenure as a Researcher, first at Oak Ridge National Lab and then at IBM Watson Lab. At this point I'd like to log my thoughts behind this choice. Around every job change there are two factors: compensation and happiness. The second factor, which includes sense of accomplishment, recognition from other people, and pure interest, will be the covered in this post.<br />
<b><br /></b>
<b>
Politics etc.</b><br />
I didn't actually get particularly bad big company politics at IBM. Most of my Research or non-Research colleagues are very reasonable. However, in a company with 400,000 employees, it is very easy to step on other people's toes. You'll find yourself spending more time than you want on planning rather than doing the real work.<br />
<br />
Another consequence of the large size is that executives know relatively little about the technical details, including the state-of-the-art. Meanwhile your performance score depends on how they like your demo. Therefore, the incentive mechanism is sometimes misaligned with what you know as the right things to do.<br />
<b><br /></b>
<b>
Breadth vs. Depth</b><br />
During my 3.5 years at IBM Research I have worked on VM provisioning, cloud BFT, VM caching, software license management, and a little bit of HDFS as a side project. From my observation that's more or less the case for many researchers here. IBM's business model decides that we do a lot of "integration of XXX with YYY", and "XXX-as-a-service" here. I totally see the value-add and the technical challenges of those projects. However, I just prefer to go "crazy deep" on the XXX problem itself before moving to the next subject.<br />
<b><br /></b>
<b>
Things I will miss about IBM Research:</b><br />
<ol>
<li>The "never-jam" Taconic Parkway. Seriously, I will miss Westchester -- easy commute to NYC, (relatively) cheap housing, Bear Mountain, ...</li>
<li>The "real scientists" running around the lab in white coats. Seriously, I will miss the resourcefulness -- you can grab an expert in any area (you name it, material science, partial differential equations, ...) if your project needs. </li>
<li>The close connections with academia.</li>
<li>The T. J. Watson Center. The remarkable building, the beautiful library, and neat offices.</li>
</ol>
<div>
<b>
Things I won't miss about IBM Research:</b></div>
<div>
<ol>
<li>"Welcome to the teleconference service, please enter your access code" ... "There are 17 participants on the call, you are joining as a participant". Seriously, there are just so many hour-long meetings where you only speak for 5 seconds.</li>
<li>Having to pay for coffee, having to use 4 year old laptop, having to try so hard to get a 23'' monitor. The HR department just doesn't consider it a priority to enhance employee morale and productivity.</li>
</ol>
<b>
Things I look forward to about Cloudera:</b></div>
<div>
<ol>
<li>The "hacker mentality" and pride as a hardcore engineer.</li>
<li>A zoomed-in view of how people are using large distributed storage systems in production.</li>
</ol>
</div>
Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0tag:blogger.com,1999:blog-8153744022407646461.post-88945769808205376432014-06-10T08:28:00.002-07:002014-06-16T11:58:11.326-07:00Use of Java IO/NIO Packages in HDFS 2.0<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFBIiK89_v705BuU02zN2I8rgBLrvquN8L20z9pt2ImflKoR2PsQDKRW3T52OFVilPPR5_Bo2bPUSpj5eoK_6apSS1k5C36D4ufjRIImewPJ96wSoK5wDODkkUcczmbstfniQZQdtQRfO3/s1600/IBM_0665-30_HDD_1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Times, Times New Roman, serif;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFBIiK89_v705BuU02zN2I8rgBLrvquN8L20z9pt2ImflKoR2PsQDKRW3T52OFVilPPR5_Bo2bPUSpj5eoK_6apSS1k5C36D4ufjRIImewPJ96wSoK5wDODkkUcczmbstfniQZQdtQRfO3/s1600/IBM_0665-30_HDD_1.jpg" height="191" width="320" /></span></a></div>
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<br />
<h3>
<span style="font-family: Times, Times New Roman, serif;">
Java I/O and NIO</span></h3>
<span style="font-family: Times, Times New Roman, serif;">First, some background about <a href="http://docs.oracle.com/javase/7/docs/api/java/io/package-summary.html">Java I/O</a>. Java models input/output as streams. <a href="http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html">InputStream</a> (abstract) is the superclass of all input types that can be modeled as a stream. <a href="http://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html">FileInputStream</a> is a subclass of InputStream representing file I/O. A FileInputStream needs to be created based on a <a href="http://docs.oracle.com/javase/7/docs/api/java/io/File.html">File</a>. A File object contains many filesystem properties, including file type (isFile), directory structure (listFiles), etc.</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<span style="font-family: Times, Times New Roman, serif;">Extending from the I/O package, <a href="http://en.wikipedia.org/wiki/Non-blocking_I/O_(Java)">NIO</a> (new I/O, or non-blocking I/O) package provides richer features by exposing lower level control. The central abstraction is a <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/Buffer.html">Buffer</a> class. Another interesting abstraction is <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/channels/package-summary.html">Channels</a>, which are closely related to non-blocking I/O.</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<span style="font-family: Times, Times New Roman, serif;">Starting from Java 7, the <a href="http://docs.oracle.com/javase/tutorial/essential/io/fileio.html">NIO2</a> package (java.nio.file) is available to expose even lower level filesystem control. A <a href="http://docs.oracle.com/javase/tutorial/essential/io/pathClass.html">Path</a> class is presented, abstracting a file's path in the file system. The <a href="http://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html">Files</a> class is capable of many types of file operations such as creating and managing symbolic links.</span><br />
<h3>
<span style="font-family: Times, Times New Roman, serif;">
Java I/O Packages in HDFS</span></h3>
<span style="font-family: Times, Times New Roman, serif;">HDFS uses a new type of input/output stream named FSInputStream/FSOutputStream (abstract). They model HDFS stream input/output. The main purpose of having custom file input/output stream is for better position tracking (they don't do much).</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<span style="font-family: Times, Times New Roman, serif;">DFSInputStream/DFSOutputStream further extends their FS stream superclasses. DFS input/output streams handle the main HDFS logic of locating local files on DataNodes etc. </span><br />
<h3>
<span style="font-family: Times, Times New Roman, serif;">
Java NIO Packages in HDFS</span></h3>
<span style="font-family: Times, Times New Roman, serif;">HDFS only uses 2 types of NIO buffers: <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html">ByteBuffer</a> and <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html">MappedByteBuffer</a>.</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<span style="font-family: Times, Times New Roman, serif;">NIO2 is not used in HDFS.</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<br />
<h3>
<span style="font-family: Times, Times New Roman, serif;">Example: ingest a local file into HDFS with <i>copyFromLocal</i></span></h3>
<h4>
<span style="font-family: Times, Times New Roman, serif;"><u>(shell) CopyCommands / CommandsWithDestination</u></span></h4>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * run
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * |-> {@link #processOptions(LinkedList)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * \-> {@link #processRawArguments(LinkedList)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * |-> {@link #expandArguments(LinkedList)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * | \-> {@link #expandArgument(String)}*
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * \-> {@link #processArguments(LinkedList)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * |-> {@link #processArgument(PathData)}*
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * | |-> {@link #processPathArgument(PathData)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * | \-> {@link #processPaths(PathData, PathData...)}
</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<span style="font-family: Times, Times New Roman, serif;"> * | \-> {@link #processPath(PathData)}*
</span></div>
<!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->
<br />
<div>
<span style="font-family: Times, Times New Roman, serif;"> * \-> {@link #processNonexistentPath(PathData)} </span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> </span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> * |-> copyFileToTarget</span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> * \-> Open an InputStream from the path</span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> * |-> copyStreamToTarget</span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> * \-> Create a TargetFileSystem, which is subclass if FilterFileSystem (subclass of FileSystem)</span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"> * |-> writeStreamToFile</span><br />
<span style="font-family: Times, Times New Roman, serif;"> * |-> create() a FSOutputStream from the path</span><br />
<span style="font-family: Times, Times New Roman, serif;"> * \-> IOUtils copyBytes() from input stream to output stream</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<h4>
<span style="font-family: Times, Times New Roman, serif;"><u>DFSInputStream/DFSOutputStream</u></span></h4>
<span style="font-family: Times, Times New Roman, serif;">DFSOutputStream is a subclass of FSOutputSummer (weird name!). FSOutputSummer takes care of checksumming the data packets. The write() method in FSOutputSummer basically writes each chunk into the buf[] and updates the checksum values. Then the writeChunk() function (overridden in DFSOutputStream) takes care of putting the data in dataQueue in the </span>waitAndQueueCurrentPacket() method. </div>
<div>
<span style="font-family: Times, Times New Roman, serif;"><br /></span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"><br /></span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"><br /></span></div>
<div>
<span style="font-family: Times, Times New Roman, serif;"><br /></span></div>
Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0tag:blogger.com,1999:blog-8153744022407646461.post-12603577356920150102014-03-03T14:24:00.000-08:002014-04-14T09:43:35.074-07:00Choice of Java Containers in HDFS 2.0<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://cincinnati.com/blogs/newintown/files/2011/10/container-store.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://cincinnati.com/blogs/newintown/files/2011/10/container-store.jpg" height="213" width="320" /></a></div>
<br />
When programming in a high-level language (Java, Python), we face the choice among numerous types of containers. What's the pros and cons?<br />
<br />
I did a quick analysis on the choice of containers in HDFS 2.0 code.<br />
<br />
<br />
<ul>
<li>There are 935 classes in total</li>
<li><i><span style="color: blue;">211</span></i> of them have imported <i><span style="color: blue;">List</span></i></li>
<li><i><span style="color: blue;">137</span></i> have imported <i><span style="color: blue;">ArrayList</span></i></li>
<li><i><span style="color: blue;">116</span></i> have imported <i><span style="color: blue;">Map</span></i></li>
<li><i><span style="color: blue;">80</span></i> have imported <i><span style="color: blue;">HashMap</span></i></li>
<li><span style="color: blue;"><i>31</i></span> have imported <span style="color: blue;"><i>Set</i></span></li>
<li><i><span style="color: blue;">27</span></i> have imported <i><span style="color: blue;">LinkedList</span></i></li>
<li><i><span style="color: blue;">17</span> </i>have imported<i> <span style="color: blue;">HashSet</span></i></li>
<li><i><span style="color: blue;">14</span> </i>have imported<i> <span style="color: blue;">TreeMap</span></i></li>
<li><i><span style="color: blue;">10</span> </i>have imported <i><span style="color: blue;">SortedTree</span></i></li>
<li><i><span style="color: blue;">5</span> </i>have imported <i><span style="color: blue;">Queue</span></i></li>
<li><i><span style="color: blue;">4</span> </i>have imported <i><span style="color: blue;">SortedMap</span></i></li>
<li><i><span style="color: blue;">3</span> </i>have imported <i><span style="color: blue;">Stack</span></i></li>
<li><i><span style="color: blue;">3</span> </i>have imported <i><span style="color: blue;">CopyOnWriteArrayList</span></i></li>
<li><i><span style="color: blue;">4</span> </i>have imported <i><span style="color: blue;">SortedSet</span></i></li>
<li><i><span style="color: blue;">1</span> </i>has imported <i><span style="color: blue;">Deque</span></i></li>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<ul>
</ul>
<b>Plain Array</b><br />
<i>INodeFile</i> just uses a plain Array <i>blocks</i> to store the set of blocks. Whenever more blocks are added, it discards the current array and use a new one.<br />
<br />
<b>
ArrayList</b><br />
<span style="font-weight: normal;">First, let's look at the popular ArrayList. It is an implementation of the List interface based on an array. It should be used if the contained items are not updated frequently. For instance, in DFSClient, the list of usable local interface addresses were initialized once, and used by random (</span><span style="font-weight: normal;">getLocalInterfaceAddrs). After all it's the fastest to access an item of an array by index.</span><span style="font-weight: normal;"><br /></span><br />
<b><br /></b>
<b>
HashMap</b><br />
<span style="font-weight: normal;">HashMap should be used when <i>insert</i> and <i>lookup</i> are the most common operations. For instance, in DFSClient it is used to store the list of blocks stored at each datanode (getBlockStorageLocations). It is also used frequently to store additional information of a basic data structure. </span><br />
<span style="font-weight: normal;"><br /></span><b>
HashSet</b><br />
<span style="font-weight: normal;">HashSet comes handy when you only want to operate on values without worrying about keys. For example, in FSNameSystem->getNamespaceEditsDirs, it is used to deduplicate a list. Note that LinkedHashSet is used in this example, to preserver the ordering of inserted elements.</span><br />
<span style="font-weight: normal;"><br /></span><b>
Misc -- Sorting</b><br />
<i>Collection.sort()</i> has been used by <i><span style="color: blue;">7</span></i> classes.<br />
<br />
Python list and dictionary are similar to Java List and Set. But you can easily sort a Python list with either <i>sorted</i> function (new list) or the <i>sort</i> method (in place).<br />
<br />
<b>Misc -- Superclass vs. Subclass</b><br />
Why are some collections declared as superclass and initialized as a concrete class? Like in LeaseManager, <i>private final Collection<String> paths = new TreeSet<String>(); </i>Reasons found <a href="http://stackoverflow.com/questions/9852831/polymorphism-why-use-list-list-new-arraylist-instead-of-arraylist-list-n">here</a>.<br />
<span style="font-weight: normal;"><br /></span>
<b>Popular Container Syntax</b><br />
<br />
<ul>
<li>ArrayList</li>
<ul>
<li>add(e): constant time, resizing is optimized by doubling the size every time</li>
<li>add(i, e): linear time complexity</li>
<li>contains(e), get(i), set(i, e)</li>
<li>remove(i), remove(e) -- first occurrence </li>
<li>toArray()</li>
<li>size()</li>
</ul>
<li>LinkedList</li>
<ul>
<li>addFirst(e), addLast(e), removeFirst(), removeLast()</li>
<li>getFirst(), getLast()</li>
</ul>
<li>Python List</li>
<ul>
<li>append()</li>
</ul>
<li>Python Set</li>
<ul>
<li>add</li>
</ul>
</ul>
<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0tag:blogger.com,1999:blog-8153744022407646461.post-37910411872137778552014-03-03T11:23:00.001-08:002014-03-03T11:42:31.150-08:00Learning ErlangIt is based on the concept of Lambda calculus (http://en.wikipedia.org/wiki/Lambda_calculus). A key distinction is the use of nested functions.<br />
<br />
It is an appropriate abstraction when processing SPMD workloads. Compared to OpenMP/MPI it has built-in support for load balancing and fault tolerance; hence easier to develop.Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0tag:blogger.com,1999:blog-8153744022407646461.post-23813446840575419622014-02-03T11:33:00.001-08:002014-02-03T11:33:47.328-08:00Day project - webbot (automated web surfing)Some web pages requires mouse clicks to display contents. So I'm writing a small robot to do the job:<br />
<br />
http://www.geekorgy.com/index.php/2010/06/python-mouse-click-and-move-mouse-in-apple-mac-osx-snow-leopard-10-6-x/Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0tag:blogger.com,1999:blog-8153744022407646461.post-73622451972415773292014-01-26T21:12:00.001-08:002014-01-27T12:32:22.689-08:00Corrupted /boot partitionA lot of sys admin lessons learnt recently. Today I just managed to corrupt the /boot partition of 6 <i>physical</i> servers (32 core and 96GB RAM each)! Personal record on biggest mess-up. What I did was to to 'make install' -> no space left on /dev -> deleted old kernel entries. At one time I forgot to delete old entries and did 'make install' 2 or 3 times, I guess that made /boot unhappy.<br />
<br />
But compared to Google's mess-up, I feel little again...Anonymoushttp://www.blogger.com/profile/07051741483059231362noreply@blogger.com0