250 Million records per day may not be a lot of data for large environments with billions of users, but those companies have huge budgets and countless servers to do it. It's a different story in startup world and you have to squeeze the resources to get the job done with less budget. I will highlight what I learned during optimization of an analytics backend which was designed based on Azure HDInsight. Is Hadoop+Hive most suitable for this purpose is a question for another time and I'm not advocating these technologies, but if you are dealing with them, specially on Azure cloud, I hope this post save you some time.
Before I started this optimization, our team was using Azure HDInsight which is based on Hortonworks Hadoop distribution. Time-series data was being loaded to a Hive table on an hourly schedule. For GDPR reasons, hive transactional was chosen so we can delete user data if needed. To store 10 TB of new data each month, Azure Blob store was picked over local HDFS storage. Just to clarify, Azure cloud was chosen for the backend due to historical reasons and I'm not advocating for Azure, in fact, if it was up to me I would not use Azure.
Problems with Azure
1. HDInsight is not cost effective
It is so easy to deploy an HDInsight hadoop cluster with just a few clicks, but you do not get the opportunity to optimize for your needs. Number and size of different types of nodes cannot be fully adjusted, so you end up with something that costs more and provides less capacity. An HDInsight cluster with 4 x D13 V2 worker nodes gives you 32 worker cores and 224 GB worker memory and it costs $4,000 a month. If you build an unmanaged Hortonworks cluster with 27 D4s V3 worker nodes, it gives you 108 worker cores and 432 GB worker memory and it also costs $4,000 a month. Which one would you pick?
2. HDInsight software is old
As of this writing, HDInsight 4.0 which is based on hadoop-hive 3.0 is still in preview, and when I tried it, basic hive-spark features did not work.
3. HDInsight forces you to have multiple clusters
Hive LLAP is not offered on the same Spark cluster, so you will need multiple clusters. For LLAP to work, you have to go with the predefined cluster and node size, and if you try to change these sizes, it will break.
4. Azure Blob Storage is 3x slower than HDFS on managed disks
In a previous post, I benchmarked Azure Blob store performance against local HDFS on managed disks, and hive queries were at least 3 times slower. If it was not for the scalability of the blob store, we would not use it.
5. Azure Blob storage is not designed for hierarchical folder structure
Hive uses partitions to speedup data retrieval, The more partitions you have, the more folders and files are created. When number of partitions grow (in our case more than 2) Hive query performance drops exponentially. This is due to Blob store being inefficient in handling directory hierarchies. Azure is introducing "Azure Data Lake Storage Gen 2) and claiming it is designed for hierarchical storage, but it is still in preview and not available in the region where our data is located, so I have not tested it yet.
1. We built a custom cluster
Latest version of hadoop-hive was installed directly using Ambari. It is not something you can do with a few clicks, but in the long run it worth it. You get to build a cluster with latest stable code, and configure it as you please, and in the mean time get a bigger bang for your buck.
2. Use of hive streaming
We are loading large amounts of data to hive transactional table with more than 100 columns which half of them are complex types. Hortonworks provides the HiveWarehouseConnector library as part of their distribution, but this library was not able to do handle complex types, so I modified the code and submitted it to their repository for approval. That modified code is currently running on our cluster.
3. Limit number of partition levels
Since we are stuck with Azure blob at this time, our only option was to limit our partitioning scheme to only 2 levels, plus buckets. This limits how we can optimize for faster data retrieval of various queries, but we managed to handle most query types.
4. Dealing with compactions
Every change in data stored in hive transactional, results in "deltas". Hive streaming batches the "inserts" to create less deltas, but at 50,000 inserts per batch we still end up with few hundred deltas. Too many deltas means too many files and folders and as stated above, Azure blobs cannot handle them. Increasing the batch size increases the memory requirements so that was not an option. To deal with this problem, we are inserting these batches to a temporary table which is located in managed disk HDFS (which is way faster than blobs) and then copy to master table in blob store. This reduces the number of deltas from an average 150 per hour to 1. Regular hive compaction jobs are taking care of deltas on Azure blob.
Shortcomings in Azure HDInsight and Blob storage forced us to come up with solutions, and learned a few new things along the way. The new custom cluster is replacing 4 other clusters which are costing us about $24,000 a month.