Daily Archives: February 18, 2013

Hadoop – too much hype

These days one hears about Hadoop all the time. Everyone is trying to do everything with Hadoop whether it fits the model or not. It has almost become fashionable just like all hype surrounding the so called “Cloud Computing“. It is in many instances touted as a key enabler of big data computing in the cloud. Now what does the “cloud” really stand for? What is big data? These are open-ended questions and stand for different things to different people. To me “Cloud Computing” means primarily an Operational and Management Model based around a set of pre-existing technologies put together in a certain way. It is not a technology in itself. This is the difference, I feel, that most people miss.

Hadoop is bang in the middle of all this jargon being thrown around and hyped beyond it’s usefulness point. There are a few issues I can see:

  1. Hadoop treats the computer as a black box. This is very bad, very very very bad.
    • In the name of scale out algorithms take all the decades and decades of R&D that has gone into computer hardware and architecture and throw it out of the window.
    • Just consider high-level algorithms and let the Java Virtual Machine handle all the machine level crap, and everything is hunky dory. I’m sorry but the real world is a tad more complex than that. The Java Junkies do not realize that do they? Then the real world Java applications deal with memory leak issues that the Garbage collector was designed to avoid and in addition have to deal with scalability issues that the GC introduces. Ain’t it fun? For large heap applications GC simply does not scale so we use off-heap managed storage. What problem did the Garbage Collector really solve? Does anyone remember the unassuming malloc() and free()? Oops I’m beginning to troll … back to the point.
    • I feel parallel and distributed software should take a holistic view of the real world and optimize at every level of the stack right from instruction level parallelism up to large-scaleĀ  distributed MapReduce algorithms.
    • Google, the progenitors of MapReduce do just that. They even go one step further and tune upwards from hardware. Their motherboards are custom. Their CPUs are specially made etc.
    • Intel and AMD (and now ARM) keep enhancing their CPUs to no end and the Hadoop gang is oblivious of all that. Do they even understand Vectorization?
    • The JVM JIT (and for that matter a compiler) cannot do all kinds of optimizations. Some things have to designed and implemented by the software engineer. For a JIT there are time and space tradeoffs to be made when optimizing compared to an offline compiler. The real-time profile data helps to some extent.
  2. Sticking only to high-level algorithms and scaling horizontally by adding more nodes means you are using compute resources inefficiently. A lot of compute cycles on a node is going to waste. This also means wasted energy spent in maintaining availability of those unused compute cycles and in the end wastes electricity. Where is my green datacenter?
  3. Many jobs that require hundreds if not thousands of nodes can be done efficiently with a fraction of those compute resources wasting very little energy. Want proof? Check this out: S3071 – Computing the Quadrillionth Digit of Pi: A Supercomputer in the GarageThe URL takes you to the sessions page where you can press CTRL+F and type Hadoop. In short using a single CUDA-enabled computer, the presenter was able to do double the amount of processing as compared to an 8000-core Hadoop cluster. If we consider dual-socket servers with 12 cores per socket that comes down to 333 nodes!
  4. One-size fits all solutions do not work. The real world is a heterogeneous place. We have RISC and CISC CPUs, FPGAs, Intel’s Xeon PHI, GPGPUs from Nvidia and ATI etc. We need to leverage and exploit the full capabilities of all these hardware.

For the record I have worked extensively on Java (along with C/C++, Python, Tcl/TK, Unix Shell, Borland Pascal, AWK and even Visual C++ among others) for several years and did not like the shell that the JVM inserts between the software development process and understanding of fundamental computer architecture. I am sick of interviewing so called J2EE and Java Beans developers who do not know what is HotSpot and how it works. In some cases they did not even know what is the “CLASSPATH”! These people are not Computer Engineers, they are Java Engineers.

Hadoop takes all these to the next level. The computer is the black box … my foot.