Massoud Mazar | All posts by admin

Azure HDInsight performance benchmarking

11. September 2018 10:08 / Administrator / / Comments (0)

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all. More...

1fffc736-19c4-4b5b-9b8c-65584691a06b|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

GPU assisted Machine Learning: Benchmark

11. August 2018 13:13 / Administrator / / Comments (0)

A recent project at work, involving binary classification using a Keras LSTM layer with 1000 nodes which took almost an hour to run initiated my effort to speedup this type of problems. In my previous post, I explained the hardware and software configuration I'm about to use for this benchmark. Now I'm going to run the same training exercise with and without GPU and compare the runtimes. More...

49b28de6-a426-4599-9238-8e2535fa7005|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

GPU and ML: Setting up CUDA + Ubuntu 18.04 on Supermicro X10 server board

10. August 2018 09:17 / Administrator / / Comments (0)

There are lots of blog posts explaining how to setup a Machine Learning system with GPU support, but what I ended up going through I could not find anywhere. Due to specific hardware and software combination I'm using, I had to figure out how to do thing and in what order for this to work. I may have gone through a dozen full reinstalls before I got a stable and working setup. That's why I'm writing it down here so it may save someone else a lot of time.More...

cf644450-b2f6-4cee-b548-e9702e588bf3|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

Azure Spark (HDInsight) performance is terrible, here is why

14. July 2018 12:46 / Administrator / / Comments (0)

From my recent few posts you can see I'm experimenting with a small Spark cluster I built on my server at home. Although this machine was built with server grade parts, it was built 4 years ago, so not top of the line by any standard. One Xeon processor running at 3.1 GHz with 4 cores, 32 GB of DDR3 RAM and consumer (not server grade) SSD. I'm running 3 VMs on this machine, each one using only one core. Naturally I did not expect Spark processing on my cluster to be performant, but to my surprise, performance of these one core machines beats an Azure's HDInsight cluster with D12 v2 machines which have 4 cores each.More...

0b015931-91f7-41cf-86c9-047abe7991d0|1|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

Correct way of setting up Jupyter Kernels for Spark

14. July 2018 08:45 / Administrator / / Comments (0)

In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned. More...

6aaadac3-c548-410d-b44c-f1936dc02858|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

matplotlib charts in Jupyter notebook

9. July 2018 15:32 / Administrator / / Comments (0)

When displaying graphs and charts in PySpark Jupyter notebook, you will have to jump through some hoops. To demonstrate, I'm assuming I have my K-Means clustering results as follows:

model = KMeans(k=5, seed=1).fit(features.select('features'))
predictions = model.transform(features)

You have to create a Temp View for this data, so you can run SQL on it: More...

71fe3a3f-ae13-4bc0-b142-1428efe4e599|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

Getting Jupyterhub 0.9.1 to work with my spark cluster and Python 3.6

8. July 2018 12:12 / Administrator / / Comments (0)

My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each. More...

01ca84ef-eebf-4782-98e8-8d201a3295cf|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

Show progress bar when pre-loading data in Shiny app

1. January 2018 15:32 / Administrator / / Comments (0)

When finishing my capstone project for Coursera's Data Science Certificate track, I needed to load relatively large amount of data (more than 400 MB when loaded). This load operation was taking a few seconds so to let the user of my Shiny app know they need to wait, I decided to use a progress bar. More...

380c2721-18f7-4c44-bc1d-ed9d7538fa4f|2|4.5|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

IoT on Azure: Why you should not use EventHub as shock abzorber

4. September 2017 15:06 / Administrator / Azure . EventHub . IoT / Comments (2)

In a common IoT scenario, millions of devices will be sending data to your back end. It is possible that a large percentage of these devices could flood your back end for whatever reason. Few years back, I experienced it first had when a bad software update created a tsunami of requests towards a relatively scaleable back end, and caused an outage which lasted a whole weekend.

One approach (which I have been dealing with in the past few months) to prevent such flood on the back end is to create a so called "shock absorber" using a queue like message delivery system. More...

d6783f8c-2940-4536-b5bb-26d669c01634|4|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

How replacing ElasticSearch with Azure DocumentDB (CosmosDB) turned out to be a bad idea

12. August 2017 20:41 / Administrator / NoSQL - ElasticSearch . ElasticSearch / Comments (7)

Disclaimer: this is my personal opinion and not the opinion of my colleagues or my employer.

History

We used to store Terra Bytes of data in ElasticSearch in form of JSON documents. As the size of data stored in cluster grew, we had to create new clusters with lots of nodes and it turned to a maintenance and cost nightmare. Microsoft Azure team suggested we move to DocumentDB to reduce the cost, and since it can scale infinitely, there won't be any maintenance needed. More...

3bfde4fd-4af1-4390-9edd-c5103b38c2e8|6|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :