PyData Stack

Anand Bisen bio photo By Anand Bisen Comment

Pydata Stack

It’s really amazing how many data analysis toolkits are now available in the Python ecosystem. And these tools seems to be never ending, I encounter a new neat little module every few weeks. But for the longest time the tool that I was yearning for the most was for a good bridge to somehow interface with Hadoop in a seamless manner.

The only clean mechanism for the longest time was Hadoop Streaming, which was complicated and time consuming at best. There is a new entrant in this arena that completely fills this void and more. Apache Spark with it’s native hadoop module pyspark provides a seamless interface to access and process data on Hadoop in a distributed manner. Actually it goes a step further and provides in-memory computing capability to drastically accelerate the performance of algorithms that needs to iterate over the data-set many times.

With Apache Spark in place I believe the “PyData” stack is really complete and could be called a swiss army knife for a Data Scientist. Below is a diagram which I think provides an high-level overview of the stack.


Am I missing something from this picture that I have not encountered yet?

comments powered by Disqus