In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned.
First step was to install sparkmagic using PIP, but soon I got some errors due to other missing libraries. I ended up installing the following before I could install sparkmagic, but your system may already have these:
yum install krb5-devel
python2.7 -m pip install decorator --upgrade
Then I was able to install sparkmagic: (I did it for both python 2 and 3, but not sure if it was necessary)
python2.7 -m pip install sparkmagic
python3.6 -m pip install sparkmagic
In my case, I was getting a new error:
cannot import name 'DataError'
This is caused by incompatibility of latest version of pandas library (0.23.0) with sparkmagic. After downgrading pandas to 0.22.0, things started working:
python2.7 -m pip install pandas==0.22.0
python3.6 -m pip install pandas==0.22.0
Make sure to follow instructions on the sparkmagic GitHub page to setup and configure it. It already creates the kernels needed for Spark and PySpark, and even R.
Another issue I had to fix was to correctly define PYTHONPATH environment variable. If you see my previous post, I was defining a bunch of these path variables in the kernel.json file, but since I'm now using sparkmagic, I had to make sure needed variables are defined somewhere else. Looking at my "~/.bash_profile" I noticed the only missing variable (in compare to my old kernel.json) was PYTHONPATH, so I added it:
Running on Yarn
If instead of local mode, Yarn is used to run the spark job from your notebook, add the following to your .sparkmagic/config.json:
Oh, BTW, you need Livy
sparkmagic needs Livy to talk to spark, so if you do not have Livy installed, do that first. I installed Livy in the same /home/hadoop folder where I installed hadoop and spark:
mv livy-0.5.0-incubating-bin livy
If running in Yarn mode, modify livy.conf:
mv livy/conf/livy.conf.template livy/conf/livy.conf
and set correct execution mode:
livy.spark.master = yarn-cluster
And finally start livy: