14. July 2018 08:45
/
Administrator
/
/
Comments (0)
In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned. More...
6aaadac3-c548-410d-b44c-f1936dc02858|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
9. July 2018 15:32
/
Administrator
/
/
Comments (0)
When displaying graphs and charts in PySpark Jupyter notebook, you will have to jump through some hoops. To demonstrate, I'm assuming I have my K-Means clustering results as follows:
model = KMeans(k=5, seed=1).fit(features.select('features'))
predictions = model.transform(features)
You have to create a Temp View for this data, so you can run SQL on it: More...
71fe3a3f-ae13-4bc0-b142-1428efe4e599|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
8. July 2018 12:12
/
Administrator
/
/
Comments (0)
My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each. More...
01ca84ef-eebf-4782-98e8-8d201a3295cf|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
1. January 2018 15:32
/
Administrator
/
/
Comments (0)
When finishing my capstone project for Coursera's Data Science Certificate track, I needed to load relatively large amount of data (more than 400 MB when loaded). This load operation was taking a few seconds so to let the user of my Shiny app know they need to wait, I decided to use a progress bar. More...
380c2721-18f7-4c44-bc1d-ed9d7538fa4f|1|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
In a common IoT scenario, millions of devices will be sending data to your back end. It is possible that a large percentage of these devices could flood your back end for whatever reason. Few years back, I experienced it first had when a bad software update created a tsunami of requests towards a relatively scaleable back end, and caused an outage which lasted a whole weekend.
One approach (which I have been dealing with in the past few months) to prevent such flood on the back end is to create a so called "shock absorber" using a queue like message delivery system. More...
d6783f8c-2940-4536-b5bb-26d669c01634|4|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
Disclaimer: this is my personal opinion and not the opinion of my colleagues or my employer.
History
We used to store Terra Bytes of data in ElasticSearch in form of JSON documents. As the size of data stored in cluster grew, we had to create new clusters with lots of nodes and it turned to a maintenance and cost nightmare. Microsoft Azure team suggested we move to DocumentDB to reduce the cost, and since it can scale infinitely, there won't be any maintenance needed. More...
3bfde4fd-4af1-4390-9edd-c5103b38c2e8|5|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
29. May 2017 19:37
/
Administrator
/
/
Comments (0)
When querying Azure DocumentDB (recently renamed to Cosmos DB, to make you really believe it can handle anything you throw at it), best practice is to have a Partition Id. In case of high volume scenarios, it's mandatory to partition your data.
You may select a good partition key, but there are always scenarios where you need to query your data without knowing the Partition Id. These cross partition queries, although slower, are possible by specifying EnableCrossPartitionQuery = true in your FeedOptions, when you are creating your query:More...
040e8c01-9f03-4d86-92d5-ff307e064813|4|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
9. January 2017 13:25
/
Administrator
/
/
Comments (0)
If you are using Go extension for Visual Studio Code to develop Google App Engine backend code on a Windows machine, you may have encountered strange problems related to environment variables GOROOT and GOPATH, specially when you want to use 3rd party libraries and expect the extension to correctly highlight code errors.
After lots of experiments, I ended up with the following setup which seems to be working as desired: More...
fa182e25-b42d-4553-8106-4a2ffb355d80|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
23. November 2016 14:35
/
Administrator
/
/
Comments (0)
ReactJS seems to have picked up some die hard fans and recently I was looking at how to use D3 for charting in a React based UI. There are a few implementation of some of the D3 libraries and I picked reactd3 for my experiment. The documentation site has some examples of how this can be done, but they all use ReactDOM.render to directly render to a DOM element you have put in your HTML template. My preferred approach is to not rely on existence of a predefined HTML tag in the template, but use an element which is rendered by my chart component. Here is what I ended up doing:More...
930ec911-bcce-477f-ab87-3c738aadaa41|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :
I know I have been stuck with SQL server for too long and a lot behind in adopting newer database technologies, but late is better than never. While updating a web application which uses SQL Server, I realized the current relational structure of my database is not really the optimal solution. I have more than 120 million rows of data in one of the tables which represent Option Chains for Stocks, one row per Option. I store this data as snapshots in time and do not change them after they are stored. Anyone familiar with this type of data knows that these individual rows are not really interesting by themselves and they are normally looked at alongside others which belong to the same Stock, and with same expiration. In real world you are presented with the whole chain (see an example of such data here). More...
35034884-487d-4e54-aab1-d72c98ab90aa|1|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04
Tags :