Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

Setting Postgres transaction isolation level through Flask-SqlAlchemy

Recently I had to deal with a concurrency issue in a legacy Flask application which did not enforce uniqueness. Multiple API clients trying to create a record in the database with the same value in 'name' column would end up creating duplicate entries. Applying a unique constraint to the name column required cleaning the duplicates first, and cleaning up data was not easy, so I decided to first implement a workaround to prevent more duplicates, and then go about cleaning up the data.

The solution was to use SET TRANSACTION ISOLATION LEVEL SERIALIZABLE statement in Postgres to lock the table and first check to see if the record exists, before inserting a new one.More...

Ship Prefect logs using filebeat

After few years of working with Airflow for job scheduling, I have settled with Prefect these days. It fixes all problems I have had with Airflow and is the best upgrade in my opinion.

After implementing our logging and monitoring using Logz.io which is a hosted ELK solution, I found it useful to ship the logs from our Prefect flow executions to Logz.io to benefit from centralized logging, monitoring and alerting. More...

VPN DNS resolution problem with CNAME

This problem took me a day to fix and it worth documenting here for others.

On my Mac laptop, I connect to a VPN, which has its own DNS server defined to resolve internal host names. For example, I could use 

nslookup host.foobar.net

and get a good response. So far so good. But when I tried 

nslookup host.sub.foobar.net

I did not get a response. After talking to IT team, it turned out this address was a CNAME record and not a A record. Still, I expected the DNS resolution to correctly return an IP address for this host, but it didn't. More...

Raspberry Pi SMB server to use with Time Machine

Last week I had a bit of free time (which is very rare these days), and decided it is finally time to build a file server to be used for backing up my laptops (both Mac and PC), and also as a general purpose shared drive. After doing some research I learned Apple supports SMB protocol for Time Machine, and SMB is obviously compatible with Windows as well.

My criteria to select the hardware was simple:

  • Gigabit Ethernet
  • USB 3
  • Support for Ubuntu

More...

Mobile Sensors: Easy data collection, labeling and model deployment

Disclaimer: My targets for this article are data scientists which may not be necessarily coming from a software engineering background. Pardon me if you find this over simplified.

Mobile devices provide a rich set of sensors to allow us get a feel of where the device is being used and how. Sensors can tell us about environment, motion and orientation of the device, among other things. A list of sensors supported by Android can be found here.

There are a lot of cool applications for data coming from these sensors and a lot of those applications could benefit from machine learning models to infer more meaning from the sensor data. A famous example is the classic Human Activity Detection using mobile phones which can be found easily on the internet. But what if you need to collect data and label it for a different purpose? Here I show an easy way to build a mobile app which can run on both Android and iOS for both data collection and testing of the trained model. More...

Kafka stream processing: lookup against hive data

Here is a scenario which in my opinion should be very common:

Suppose you need to build an ETL kafka stream which read data from one stream and checks it against a blacklist before writing to destination stream. This blacklist gets updated daily, andhas the same key as your source stream. One way to implement this is to use a Kafka Table (ktable) and join your stream with the table to find the matches.More...

Ingesting 250 million daily IoT messages with Hadoop and Hive 3.0 in Azure: Lessons Learned

250 Million records per day may not be a lot of data for large environments with billions of users, but those companies have huge budgets and countless servers to do it. It's a different story in startup world and you have to squeeze the resources to get the job done with less budget. I will highlight what I learned during optimization of an analytics backend which was designed based on Azure HDInsight. Is Hadoop+Hive most suitable for this purpose is a question for another time and I'm not advocating these technologies, but if you are dealing with them, specially on Azure cloud, I hope this post save you some time.More...

Custom Hadoop RecordReader to read JSON with no line breaks

This past week I had to deal with loading few terra bytes of data into our Spark cluster. This data is stored in a JSON array, and there is no line break to separate individual JSON objects. Spark can easily deal with JSON, but your JSON must be one object per line. I had to write a custom Hadoop RecordReader to work around this issue.More...

Azure HDInsight performance benchmarking

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all. More...

GPU assisted Machine Learning: Benchmark

A recent project at work, involving binary classification using a Keras LSTM layer with 1000 nodes which took almost an hour to run initiated my effort to speedup this type of problems. In my previous post, I explained the hardware and software configuration I'm about to use for this benchmark. Now I'm going to run the same training exercise with and without GPU and compare the runtimes. More...