Yes it’s been a long time since we last updated this blog – Shame on us!
A lot has happened since our last blog post, which came while we we dealing the effects of hurricane Sandy. In the end, our team handled it bravely and effectively, with no downtime and no business impact. However, a storm is still a storm, and did have to do an emergency evacuation from our old New York data center and move to a new one.
More things have happened since and today I want to focus on one major aspect of our life in the last year. We have made some cultural decisions that somehow changed the way we treat our work. Yes, the Devops movement has its influence here. When we stood in front of the decision of “NOC or NOT”, Basically, we adopted the theme of “You build it, You run it!”.
Instead of hiring 10 students, attempting to train them on the “moving target” of a continuously changing production setup , we decided to hire 2 engineers and concentrate effort on building strong monitoring system that will allow engineers to take ownership on monitoring their systems
Now, Outbrain is indeed a high scale system. Building a monitoring system that enables more then 1000 machines and more then 100 services to report metrics every minute is quite a challenge. We chose the stack of Logstash, RabbitMQ and Graphite for that mission. In addition we developed an open source project called Graphitus which enables us to build dashboards from graphite metrics. Since adopting it we have more then 100 dashboards the teams are using daily. We also developed Dashanty which enables each team to develop an operational dashboard for itself.
On the alerting front we stayed with Nagios but improved it’s data sources. Instead of Nagios polling metrics by itself, we developed a Nagios/Graphite plugin where Nagios querys Graphite for the latest metrics and according to thresholds shoots appropriate alerts to relevant people. On top of that, the team developed an application called RedAlert that enable each and every team/engineer to configure their own alerts on their own owned services, configure when alerts are critical and when such alert should be pushed to them. This data goes into Nagios that start monitoring the metric in Graphite and will fire an alert if something goes wrong. “Push” alerts are configured to go to PagerDuty that will be able to locate the relevant engineer, email, text or call him as needed.
Now that’s on the technical part. What is more important to make it happen is the cultural side that this technology supports:
We truly believe in “End to End Ownership”. “You build it, You run it!” is one way to say that. In an environment where everybody can (and should) change production at any moment , putting someone else to watch the systems makes it impossible. We were also very keen about MTTR (Mean Time To Recover). We don’t promise our business people 100% fault free environment, but we do promise fast recovery time. When we put these two themes in front of us, we came to the conclusion it is best that alerts will be directed to owner engineers as fast as we can, with fewer mediators on the way. So, we came up with the following:
- We put a baseline of monitoring systems to support the procedure – and we continuously improve it.
- Engineers/teams are owners of services (very SOA architecture). Lets use the term “Owner”. We try to eliminate services without clear owners.
- Owners push metrics into graphite using calls on code or other collectors.
- Owners define alerts on these metrics using RedAlert system.
- Each team defined “on call schedule” on PagerDuty. “On call” engineer is the point of contact for any alerting service under the team ownership.
- Ops are owners for the infrastructure (Servers/Network/software infra) – they also have “Ops on shift” – awake 24/7 (we use the team distribution between NY and IL for that).
- Non push alerts that does not require immediate action are gathered along non working hours and treated during working hours.
- Push Alerts are routed via PagerDuty the following way: Ops on shift get them and if he can address them or correlate them with infrastructure issue – he acknowledge them. In case Ops on Shift doesn’t know what to do with it, Pager duty continues and rout the alerts to the engineer on call.
- Usually the next thing that will happen is that both of then will jump on the HipChat and start tackling the issue to shorten MTTR and resolve it.
The biggest benefit of this method is increased sense of “ownership” for everyone in the team. The virtual wall between Ops and Dev (which was initially somehow low in Outbrain) was completely removed. Everybody is more “Production sensitive”.
Few things that helped us through it:
- Our team. As management we encouraged it and formalized it but the motivation came from the team. It is very rare to see engineers that want (not to say hardly push) to take more ownership on their products and to really “Own them”. I feel lucky that we have such team. It made our decisions much simpler.
- Being so tech-ish and pushing our monitoring capabilities to such edges instead of going to the easy, labor intensive, half ass solution (AKA NOC).
- A 2 week “Quality Time” of all engineering that was devoted to improving MTTR and building all necessary to support this procedure. – All Credits to Erez Mazor for running this week.