Calling Spark from IPython Notebook

Thu 03 April 2014

Another note to self: Apache Spark is a powerful platform for data analysis and adds a great deal of oomph to PyData ecosystem with it's Python integration. I really like the fact that it's very easy to configure (QuickStart) both on a cluster or locally in standalone mode.

Here are the instructions on how to instantiate Spark from the IPython notebook bypassing the need to call pyspark shell script.

The advantage of using this method is that you don't necessarily have to instantiate spark unless you would like to use it from your session. Additionally if you are like me it's possible that you have multiple instances of python installed on your laptop (Anaconda, Virtualenv etc.) and you can call spark from any of those instances.

import os
import sys

# Set the path for spark installation
# this is the path where you have built spark using sbt/sbt assembly
os.environ['SPARK_HOME']="/home/abisen/opt/spark"

# Append to PYTHONPATH so that pyspark could be found
sys.path.append("/home/abisen/opt/spark/python/")


# Now we are ready to import Spark Modules
try:
    from pyspark import SparkContext
    from pyspark import SparkConf

except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)

Once the modules are loaded successfully. You can now proceed as you would with your spark code. Once you instantiate SparkContext the spark subsystem should fire up in the background and you should be able to access the web-console at http://<local-ip>:4040

# Optionally configure Spark Settings
conf=SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
conf.setAppName("My App")

## Initialize SparkContext
sc = SparkContext('local', conf=conf)

This entry was tagged as python ipython notebook spark

blog comments powered by Disqus