Hadoop – too much hype

These days one hears about Hadoop all the time. Everyone is trying to do everything with Hadoop whether it fits the model or not. It has almost become fashionable just like all hype surrounding the so called “Cloud Computing“. It is in many instances touted as a key enabler of big data computing in the cloud. Now what does the “cloud” really stand for? What is big data? These are open-ended questions and stand for different things to different people. To me “Cloud Computing” means primarily an Operational and Management Model based around a set of pre-existing technologies put together in a certain way. It is not a technology in itself. This is the difference, I feel, that most people miss.

Hadoop is bang in the middle of all this jargon being thrown around and hyped beyond it’s usefulness point. There are a few issues I can see:

  1. Hadoop treats the computer as a black box. This is very bad, very very very bad.
    • In the name of scale out algorithms take all the decades and decades of R&D that has gone into computer hardware and architecture and throw it out of the window.
    • Just consider high-level algorithms and let the Java Virtual Machine handle all the machine level crap, and everything is hunky dory. I’m sorry but the real world is a tad more complex than that. The Java Junkies do not realize that do they? Then the real world Java applications deal with memory leak issues that the Garbage collector was designed to avoid and in addition have to deal with scalability issues that the GC introduces. Ain’t it fun? For large heap applications GC simply does not scale so we use off-heap managed storage. What problem did the Garbage Collector really solve? Does anyone remember the unassuming malloc() and free()? Oops I’m beginning to troll … back to the point.
    • I feel parallel and distributed software should take a holistic view of the real world and optimize at every level of the stack right from instruction level parallelism up to large-scaleĀ  distributed MapReduce algorithms.
    • Google, the progenitors of MapReduce do just that. They even go one step further and tune upwards from hardware. Their motherboards are custom. Their CPUs are specially made etc.
    • Intel and AMD (and now ARM) keep enhancing their CPUs to no end and the Hadoop gang is oblivious of all that. Do they even understand Vectorization?
    • The JVM JIT (and for that matter a compiler) cannot do all kinds of optimizations. Some things have to designed and implemented by the software engineer. For a JIT there are time and space tradeoffs to be made when optimizing compared to an offline compiler. The real-time profile data helps to some extent.
  2. Sticking only to high-level algorithms and scaling horizontally by adding more nodes means you are using compute resources inefficiently. A lot of compute cycles on a node is going to waste. This also means wasted energy spent in maintaining availability of those unused compute cycles and in the end wastes electricity. Where is my green datacenter?
  3. Many jobs that require hundreds if not thousands of nodes can be done efficiently with a fraction of those compute resources wasting very little energy. Want proof? Check this out: S3071 – Computing the Quadrillionth Digit of Pi: A Supercomputer in the GarageThe URL takes you to the sessions page where you can press CTRL+F and type Hadoop. In short using a single CUDA-enabled computer, the presenter was able to do double the amount of processing as compared to an 8000-core Hadoop cluster. If we consider dual-socket servers with 12 cores per socket that comes down to 333 nodes!
  4. One-size fits all solutions do not work. The real world is a heterogeneous place. We have RISC and CISC CPUs, FPGAs, Intel’s Xeon PHI, GPGPUs from Nvidia and ATI etc. We need to leverage and exploit the full capabilities of all these hardware.

For the record I have worked extensively on Java (along with C/C++, Python, Tcl/TK, Unix Shell, Borland Pascal, AWK and even Visual C++ among others) for several years and did not like the shell that the JVM inserts between the software development process and understanding of fundamental computer architecture. I am sick of interviewing so called J2EE and Java Beans developers who do not know what is HotSpot and how it works. In some cases they did not even know what is the “CLASSPATH”! These people are not Computer Engineers, they are Java Engineers.

Hadoop takes all these to the next level. The computer is the black box … my foot.

5 thoughts on “Hadoop – too much hype

  1. Shiv

    My observations with limited/no experience of hadoop.
    Hadoop has been used by facebook since 2007 with data running into several hundred PBs by 2012. The scheduling & map reduce part of hadoop have been replaced by their own “corona” recently.

    If hadoop is enabling companies like facebook/yahoo to deal with their data, it is solving important problems.

    Current contributors take the project in the direction that they deem fit and are able to. It is open for anyone to contribute to take it in other directions.

    Given the fact that it is open source, people will attempt to use it in ways not imagined before. Users (such as yahoo/facebook) will eventually test the limits make improvements and contribute back to the project or create new projects.
    In some cases it might be a misfit ending up with negative experience to the user. Onus in such cases would be with the user.

    A reasonable way for a project to evolve I would say.

    Reply
    1. moinakg Post author

      True but the question here is different. The question is not about whether Hadoop works. MapReduce is a beautiful algorithm and works very well, it is proven. The question is one of efficiency and ability to leverage advancements in processors and computer hardware architecture in general. Treating the computer as a black box separates us from that aspect in total. If we just focus on high-level algorithms we’d just as well be happy with tens of thousands of 80386 processors since Hadoop will scale horizontally quite easily.

      I have interacted with Hadoop folks from Yahoo and I know some people personally. They are far removed from the Systems aspect. It is quite easy to fall into the horizontal scalability trap since Hadoop will scale across nodes quite easily ignoring systems advancements. However if only 50% of each box’s capability is utilized then 50% of the datacenter capability is just wasted. With systems advancing by the day this can potentially get worse.

      I do not know how Facebook is approaching all this but I think they might be in a better position with respect to systems since they are the ones to have initiated OpenCompute. However as long as Hadoop just remains fixated on pure Java it is losing out on some of the architectural advancements as far a parallel and vector computing goes. The example with one CUDA computer doing double the work of 8000 Hadoop cores may be a very specific example but it demonstrates the point amply.

      The final question is that of Hype which can destroy even an excellent product. Just look at how many companies are offering a hadoop big data analytics component in their portfolio.

      Reply
  2. Jean-Francois Im (@jeanfrancoisim)

    It’s certainly possible to write relatively high performance on the JVM; just because there is a crapton of poor Java developers doesn’t mean that the technology itself sucks. It’s not perfect (I don’t think it does SIMD ops yet on x86/x64) but it certainly does allow one to write reasonably fast code in a fraction of the time it would take to do so in C/C++ or assembly (for example, I added clustering and multi-node distributed processing to an app in a short afternoon on the JVM, which would’ve been a major pain in C++ — even with 0MQ).

    As we both agree, though, people who don’t understand how the computer works underneath and all the layers in between will write crap code that’s suboptimal.

    Reply
    1. moinakg Post author

      Yes I agree that the HotSpot JIT does a darn well job, though it can require tweaking via the code cache size parameter. However my big beef is with the GC piece. I see a significant effort being directed to handle GC related issues in most Java software R&D leading eventually to fancy off-heap storage for big datasets which looks like new/delete in disguise. I have the odd wish of seeing a Java variant with explicit memory management, no GC at all. The VM could be made a lot more lightweight in that case.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s