Massoud Mazar

Sharing The Knowledge


Getting Jupyterhub 0.9.1 to work with my spark cluster and Python 3.6

My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each.

I ended up installing the following for my cluster:

  • Java 8
  • Hadoop 3.1.0
  • Spark 2.3.1
  • Python 3.6
  • Jupyterhub 0.9.1

Configuring Hadoop and Spark was not too complicated. I have done  Hadoop setup when Hadoop was V 0.20.0 and interestingly, it is not much different now. I followed instructions from the following sources (with slight modification) to get my Hadoop and Spark cluster up and running:

It is a lot of steps, but nothing really complicated. One big difference was "slaves" config file is now called "workers" (I guess the reason was to be politically correct).

Problem started when I decided to add Jupyterhub so I can use PySpark in Jupyter notebooks. There are some good instructions in the link above on how to configure Jupyterhub for Spark, but somehow it took me a lot of time to get everything working correctly. Having multiple versions of python resulted in some conflicts and I overlooked the fact that py4j library in my Spark version had a different name (dauh!!).

Here is what I did on my master node:

yum install yum-utils
yum groupinstall development
yum install
yum install python36u
python3.6 -V
yum install python36u-pip
yum install python36u-devel

yum install npm
npm install -g configurable-http-proxy

python3.6 -m pip install jupyter
python3.6 -m pip install jupyterhub


Don't forget to install python 3.6 on all worker nodes:

yum install
yum install python36u
python3.6 -V

On master node, create the PySpark kernel for Jupyter:

mkdir /usr/share/jupyter/kernels/pyspark2
nano /usr/share/jupyter/kernels/pyspark2/kernel.json

and make sure to adjust the contents of your kernel.json according to paths to your installations:

  "argv": [
  "display_name": "Python3.6 + Pyspark(Spark 2.2.0)",
  "language": "python",
  "env": {
    "PYSPARK_PYTHON": "/usr/bin/python3.6",
    "SPARK_HOME": "/home/hadoop/spark",
    "HADOOP_CONF_DIR": "/home/hadoop/hadoop/etc/hadoop",
    "PYTHONPATH": "/home/hadoop/spark/python/lib/",
    "PYTHONSTARTUP": "/home/hadoop/spark/python/pyspark/",
    "PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"

To run Jupyterhub using sudo, I ended up doing this:

mkdir /etc/jupyterhub
sudo jupyterhub --generate-config -f /etc/jupyterhub/
sudo jupyterhub -f /etc/jupyterhub/

I noticed the first login to Jupyter works, but if I logout and try to login again, it fails. Found that you need to set the following in your

c.PAMAuthenticator.open_sessions = False

At this point Jupyterhub works nicely. To run it as a service on my Centos 7 machine, I followed instructions from link above. Create a systemd service file:

nano /lib/systemd/system/jupyterhub.service

With this content:


ExecStart=/usr/bin/jupyterhub --JupyterHub.spawner_class=sudospawner.SudoSpawner


And setup my service:

python3.6 -m pip install sudospawner
sudo systemctl daemon-reload
sudo systemctl enable jupyterhub
sudo systemctl start jupyterhub


Add comment