Calling Spark from IPython Notebook

Anand Bisen bio photo By Anand Bisen Comment

Another note to self: Apache Spark is a powerful platform for data analysis and adds a great deal of oomph to PyData ecosystem with it’s Python integration. I really like the fact that it’s very easy to configure QuickStart both on a cluster or locally in standalone mode.

Here are the instructions on how to instantiate Spark from the IPython notebook bypassing the need to call pyspark shell script.

The advantage of using this method is that you don’t necessarily have to instantiate spark unless you would like to use it from your session. Additionally if you are like me it’s possible that you have multiple instances of python installed on your laptop (Anaconda, Virtualenv etc.) and you can call spark from any of those instances.

import os
import sys

# Set the path for spark installation
# this is the path where you have built spark using sbt/sbt assembly

# Append to PYTHONPATH so that pyspark could be found

# Now we are ready to import Spark Modules
    from pyspark import SparkContext
    from pyspark import SparkConf

except ImportError as e:
    print ("Error importing Spark Modules", e)

Once the modules are loaded successfully. You can now proceed as you would with your spark code. Once you instantiate SparkContext the spark subsystem should fire up in the background and you should be able to access the web-console at http://<local-ip>:4040

# Optionally configure Spark Settings
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
conf.setAppName("My App")

## Initialize SparkContext
sc = SparkContext('local', conf=conf)
comments powered by Disqus