In a common IoT scenario, millions of devices will be sending data to your back end. It is possible that a large percentage of these devices could flood your back end for whatever reason. Few years back, I experienced it first had when a bad software update created a tsunami of requests towards a relatively scaleable back end, and caused an outage which lasted a whole weekend.
One approach (which I have been dealing with in the past few months) to prevent such flood on the back end is to create a so called "shock absorber" using a queue like message delivery system.
EventHub as Shock Absorber
In theory it may make sense. In many cases, you are just receiving telemetry from devices and just need to acknowledge it back to the device. Device doesn't care what you do with the message. So you put messages on an EventHub, and send acknowledgement back to the device. On the other side of EventHub, you have a bunch of services which will pick up messages from EventHub and process them with their own pace, without being overloaded.
What is wrong with that?
In some scenarios, this may very well be the answer, for example when you do not care in what order you process messages from the same device. If order of processing is important (which I cannot think of an example where it isn't) then you are in trouble. Here is why:
When you acknowledge receipt of message 1 to a device, it will send you message 2, and so on. For sake of simplicity, let's assume these messages are put on the EventHub in correct order, mixed with messages from other devices.
To make sure these messages are processed with the same order they are received, it is essential to make sure they are processed by the same EventHub consumer node. If node A picks up message 1 and before it is done processing it, node B finishes processing message 2, the order is changed.
To prevent change of order, you will have to rely on EventHub partitions, so all messages from the same device will end up in the same partition. Each consumer (worker node) can read messages from multiple partitions, but multiple consumers cannot lease (read from) the same partition. This introduces scalability limitation, because EventHub is limited to 32 partition and it means you cannot have mode than 32 worker nodes to process messages.
To clarify, my claims here are not just based on the theory. This "shock absorber" architecture is limiting our ability to scale, and we are dealing with consequences right now. We are running 32 large VMs and we could be out of capacity soon if we didn't make big architectural changes.
Wrong type of shock absorber
Every engineer knows (I hope!) internet connection could be lost, specially wireless. That is why internet protocols are designed for resiliency against latency and loss of connection. If a protocol does not expect some type of acknowledgment from receiver, it means that piece of data was not important and nobody cares if it's lost. Communication protocols rely on acknowledgement or other types of signaling to guarantee delivery, and they retry if there is a failure, so why not rely on the protocol?
A shock absorber is introduced to prevent overload of back end when there is extra unpredicted load. (If that load was predicted, back end would have been designed to not be overloaded). One common example is when back end is down for whatever reason and after it comes back up , will need to respond to more devices because they are retrying to send data from the time it was down.
If we are not able to temporarily increase the processing capacity to deal with the extra load, the other option is to just wait it out. Devices have logic and internal memory to keep the data and retry later. If the back end is designed correctly, at peak time of day it still should have 20% to 30% extra capacity, and that extra capacity should gradually take care of the extra load.
EventHub limitations is not the only problem
Even if your traffic volume is small and limitation on maximum number of worker nodes is not a concern, to handle correct order of processing of requests per client, your back end code and logic will be unnecessarily over-complicated. In a high traffic system, you will need to process messages in parallel to achieve highest throughput possible, but still you would want requests from the same device processed sequentially.
In another scenario, if you need to process a message and send an appropriate response back to the device, an acknowledgement cannot be sent back without processing the message anyways.
What I learned from my previous experiences in high traffic environments dealing with devices sending data (up to a billion requests per day, 15000 requests per second during peak time), we successfully handled flood situations without a shock absorber. Clients (up to 60 million of them per day) would detect a non-responding back end and would retry, sometimes with a back-off algorithm.
There are processes which need to complete before sending response back to device, such as state update, triggers, and looking up response values. These operations can normally be completed relatively fast, before replying to device.
Then there are more time consuming operation, such as sending SMS or Email notifications, which do not need to be completed before a response is sent back to the device. For these types of processes, you can use a queue to hand off the work to anther process. Heck, even EventHub may be enough for such a purpose.
Maybe I sit down and put together a high traffic IoT back end architecture as my next post.