Another note to self: Apache Spark is a powerful platform for data analysis and adds a great deal of oomph to PyData ecosystem with it’s Python integration. I really like the fact that it’s very easy to configure QuickStart both on a cluster or locally in standalone mode.
Here are the instructions on how to instantiate Spark from the IPython notebook bypassing the need to call pyspark shell script.
The advantage of using this method is that you don’t necessarily have to instantiate spark unless you would like to use it from your session. Additionally if you are like me it’s possible that you have multiple instances of python installed on your laptop (Anaconda, Virtualenv etc.) and you can call spark from any of those instances.
import os import sys # Set the path for spark installation # this is the path where you have built spark using sbt/sbt assembly os.environ['SPARK_HOME']="/home/abisen/opt/spark" # Append to PYTHONPATH so that pyspark could be found sys.path.append("/home/abisen/opt/spark/python/") # Now we are ready to import Spark Modules try: from pyspark import SparkContext from pyspark import SparkConf except ImportError as e: print ("Error importing Spark Modules", e) sys.exit(1)
Once the modules are loaded successfully. You can now proceed as you would
with your spark code. Once you instantiate
SparkContext the spark subsystem
should fire up in the background and you should be able to access the
# Optionally configure Spark Settings conf=SparkConf() conf.set("spark.executor.memory", "1g") conf.set("spark.cores.max", "2") conf.setAppName("My App") ## Initialize SparkContext sc = SparkContext('local', conf=conf)