Using Storm for real time distributed computations has become a widely adopted approach, and today one can easily find more than a few posts on Storm’s architecture, internals, and what have you (e.g., Storm wiki, Understanding the parallelism of a storm topology, Understanding storm internal message buffers, etc).
So you read all these posts and and got yourself a running Storm cluster. You even wrote a topology that does something you need, and managed to get it deployed. “How cool is this?”, you think to yourself. “Extremely cool”, you reply to yourself sipping the morning coffee. The next step would probably be writing some sort of a validation procedure, to make sure your distributed Storm computation does what you think it does, and does it well. Here at Outbrain we have these validation processes running hourly, making sure our realtime layer data is consistent with our batch layer data – which we consider to be the source of truth.
It was when the validation of a newly written computation started failing, that we embarked on a great journey to the land of “How does one go about debugging a distributed Storm computation?”, true story. The validation process was reporting intermittent inconsistencies when, intermittent being the operative word here, since it was not like the new topology was completely and utterly messed up, rather, it was failing to produce correct results for some of the input, all the time (by correct results I mean such that match our source of truth).