Finding a needle in a Storm-stack

Using Storm for real time distributed computations has become a widely adopted approach, and today one can easily find more than a few posts on Storm’s architecture, internals, and what have you (e.g., Storm wiki, Understanding the parallelism of a storm topology, Understanding storm internal message buffers, etc).

So you read all these posts and and got yourself a running Storm cluster. You even wrote a topology that does something you need, and managed to get it deployed. “How cool is this?”, you think to yourself. “Extremely cool”, you reply to yourself sipping the morning coffee. The next step would probably be writing some sort of a validation procedure, to make sure your distributed Storm computation does what you think it does, and does it well. Here at Outbrain we have these validation processes running hourly, making sure our realtime layer data is consistent with our batch layer data – which we consider to be the source of truth.


It was when the validation of a newly written computation started failing, that we embarked on a great journey to the land of “How does one go about debugging a distributed Storm computation?”, true story. The validation process was reporting intermittent inconsistencies when, intermittent being the operative word here, since it was not like the new topology was completely and utterly messed up, rather, it was failing to produce correct results for some of the input, all the time (by correct results I mean such that match our source of truth).

Read more >

UPDATE #2: Outbrain Security Breach

Earlier today, Outbrain was the victim of a hacking attack by the Syrian Electronic Army. Below is a description of how the attack unfolded to help others protect against similar attempts. Updates will continue to be posted to this blog.

On the evening of August 14th, a phishing email was sent to all employees at Outbrain purporting to be from Outbrain’s CEO. It led to a page asking Outbrain employees to input their credentials to see the information. Once an employee had revealed their information, the hackers were able to infiltrate our email systems and identify other credentials for accessing some of our internal systems.

At 10:23am EST SEA took responsibility for hack of, changing a setting through Outbrain’s admin console to label Outbrain recommendations as “Hacked by SEA.

At 10:34am Outbrain internal staff became aware of the breach.

By 10:40am Outbrain network operations began investigating and decided to shut down all serving systems, degrade gracefully and block all external access to the system.

By 11:03am Outbrain finished turning off its service from all sites where we operate.

We are continuing to review all systems before re-initiating service.

UPDATE #1: Outbrain Security Breach

We are aware that Outbrain was hacked earlier today and we took down service as soon as it was apparent.

The breach now seems to be secured and the hackers blocked out, but we are keeping the service down for a little longer until we can be sure it’s safe to turn it back on securely. Please stay tuned here or to our Twitter feed for updates.

Devops Shmevops… we call it Ownership.

Yes, it’s been a long time since we last updated this blog – Shame on us!

A lot has happened since our last blog post, which came while we dealing the effects of Hurricane Sandy.  In the end, our team handled it bravely and effectively, with no downtime and no business impact. However, a storm is still a storm, and did have to do an emergency evacuation from our old New York data center and move to a new one.

More things have happened since and today I want to focus on one major aspect of our life in the last year. We have made some cultural decisions that somehow changed the way we treat our work. Yes, the Devops movement has its influence here. When we stood in front of the decision of “NOC or NOT”,  Basically, we adopted the theme of “You build it, You run it!”.

Instead of hiring 10 students, attempting to train them on the “moving target” of a continuously changing production setup , we decided to  hire 2 engineers and concentrate effort on building strong monitoring system that will allow engineers to take ownership on monitoring their systems

Now, Outbrain is indeed a high scale system. Building a monitoring system that enables more then 1000 machines and more then 100 services to report metrics every minute is quite a challenge. We chose the stack of Logstash, RabbitMQ and Graphite for that mission. In addition, we developed an open source project called Graphitus which enables us to build dashboards from graphite metrics. Since adopting it we have more then 100 dashboards the teams are using daily. We also developed Dashanty which enables each team to develop an operational dashboard for itself.

On the alerting front, we stayed with Nagios but improved it’s data sources. Instead of Nagios polling metrics by itself, we developed a Nagios/Graphite plugin where Nagios querys Graphite for the latest metrics and according to thresholds shoots appropriate alerts to relevant people. On top of that, the team developed an application called RedAlert that enable each and every team/engineer to configure their own alerts on their own owned services, configure when alerts are critical and when such alert should be pushed to them. This data goes into Nagios that start monitoring the metric in Graphite and will fire an alert if something goes wrong. “Push” alerts are configured to go to PagerDuty that will be able to locate the relevant engineer, email, text or call him as needed.

Now that’s on the technical part. What is more important to make it happen is the cultural side that this technology supports:

We truly believe in  “End to End Ownership”. “You build it, You run it!” is one way to say that. In an environment where everybody can (and should) change production at any moment , putting someone else to watch the systems makes it impossible. We were also very keen on MTTR (Mean Time To Recover). We don’t promise our business people 100% fault-free environment, but we do promise fast recovery time. When we put these two themes in front of us, we came to the conclusion it is best that alerts will be directed to owner engineers as fast as we can, with fewer mediators on the way. So, we came up with the following:

  • We put a baseline of monitoring systems to support the procedure – and we continuously improve it.
  • Engineers/teams are owners of services (very SOA architecture). Lets use the term “Owner”. We try to eliminate services without clear owners.
  • Owners push metrics into graphite using calls on code or other collectors.
  • Owners define alerts on these metrics using RedAlert system.
  • Each team defined “on call schedule” on PagerDuty. “On call” engineer is the point of contact for any alerting service under the team ownership.
  • Ops are owners for the infrastructure (Servers/Network/software infra) – they also have “Ops on shift” – awake 24/7 (we use the team distribution between NY and IL for that).
  • Non push alerts that does not require immediate action are gathered along nonworking hours and treated during working hours.
  • Push Alerts are routed via PagerDuty the following way: Ops on shift get them and if he can address them or correlate them with infrastructure issue – he acknowledges them. In case Ops on Shift doesn’t know what to do with it, Pager duty continues and rout the alerts to the engineer on call.
  • Usually the next thing that will happen is that both of them will jump on the HipChat and start tackling the issue to shorten MTTR and resolve it.

The biggest benefit of this method is increased the sense of “ownership” for everyone in the team. The virtual wall between Ops and Dev (which was initially somehow low in Outbrain) was completely removed. Everybody is more “Production sensitive”.

Few things that helped us through it:

  1. Our team. As management we encouraged it and formalized it but the motivation came from the team. It is very rare to see engineers that want (not to say hardly push) to take more ownership on their products and to really “Own them”. I feel lucky that we have such team. It made our decisions much simpler.
  2. Being so tech-ish and pushing our monitoring capabilities to such edges instead of going to the easy, labor-intensive, half-ass solution (AKA NOC).
  3. A 2 week “Quality Time” of all engineering that was devoted to improving MTTR and building all necessary to support this procedure. – All Credits to Erez Mazor for running this week.
This post will be followed by more specific posts about the systems we developed and will be written by the actual people that build them.

Hurricane Sandy – Outbrain Service updates

Hi all!

As Hurricane Sandy is about to hit the east coast US, and as Outbrain’s main Datacenter is located in downtown Manhattan, we are taking measures to make as little service interruption as possible for our partners and customers. Outbrain is normally serving from 3 data centers and in case of NY data center loss, we will supply the service from one the other data centers. On this page, below – we will update on any service interruption and ETAs for problem-solving. We assume all will go well and we will not have to update but… just in case 🙂

[UPDATE – Nov 3rd 3:45 pm EST] – At this time Utility power is back to all our datacenters and HQ office. It is now time to restore the service from NY and get the office back to work. This will take some time but systems will gradually be put back up over the next week or so. There should be no effect on users, publishers or clients.

Our HQ will also start working gradually depending on the availability of public transportation.

We are here closing this reporting post – if you see any issues, please report to or your rep.

I hope the storm of the century will be the last one for the next century (at least).

[UPDATE – Nov 1st 9:30 am EST] – Our HQ, located on 13th between 5th and 6th in downtown New York City is still without power and therefore closed. Thankfully, our NY-based team is safe and in dry locations, and will continue to try and work as best they can. We highly appreciate the concern and best wishes we received from our partners and clients across the globe; thank you!

We are doing our best to continue to provide the best in class service, one we hope you’ve come to expect from us. As an update, our datacenter in NY is still without power and we expect it to be down for a few more days. We will continue to serve from our other datacenters located in Chicago and Los Angeles. To reiterate, our service did not go down, and we are currently still serving across our client’s sites. As of this morning, we recovered and updated all our reporting capabilities, so we should be back to 100%.

If you are experiencing any difficulties or seeing different, please reach out to your respective contacts. We’ll also continue to operate under emergency mode until Monday, you can reach us 24/7 at (am = Account Management).

[UPDATE – Oct 31st 6:46 am EST] – Serving still holds strong from our LA and Chicago data centers and we are not aware of any disruption to our service. We are working hard to recover our dashboard reporting capabilities, but it will probably take a couple more days before we’re able to get back to normal mode. Sorry for any inconvenience caused by this. Send us a note to if you have any request, and one of us from around the world will respond as soon as possible.

[UPDATE – 6:51 pm EST]  – Again, not much to update – All is stable with both LA and Chicago datacenters. It’s the end of the day here in Israel and we are trying to get some rest. Our teammates in the US are keeping an eye on the system and will alert us if there is anything wrong. Good night.

[UPDATE – 3:35 am EST] – Actually not much to update about the service. All is pretty much stable. we are safely serving from LA and Chicago. most back-end services are running in LA Datacenter and our tech team in Israel and NY are monitoring and handling issues as they raise. Our Datacenter vendors in NY are working with FDNY to pump the water from the flooded generator room so it will take a while to recover this datacenter 🙂

[UPDATE – 10:50 am EST] – The clients dashboard is back up.

[UPDATE – 10 am EST] – The client’s dashboard on our site is periodically down – we are handling the issues there and will update soon.

[UPDATE – 5 am EST] Our NY Datacenter went down. Our service is fully operational and we are serving through our Chicago and LA Datacenters. If you’re accessing your Outbrain dashboard you may experience some delays in data freshness. We are working to resolve this issue and will continue to update.

[UPDATE – 2am EST] – Our NY Datacenter went completely off – We are fully serving from our Chicago and LA Datacenters. External reports on our site are still down but we are working to fail overall services from the LA Datacenter. – we will follow with updates.

[Update – 12:50 am EST] – power just went all off in our NY Datacenter and the provider has evacuated the facility – we are taking our measures to move all functionality to other datacenters.

[UPDATE]  – at 9 pm EST]  commercial power went down in our NY Datacenter. Provider failed over to the generator and we continue to serve smoothly from this Datacenter. We continue to monitor the service closely and ready to take actions if needed.

Slides – Cassandra for Sysadmins

At Outbrain, we like things that are awesome.

Cassandra is awesome.

Ergo, we like Cassandra.

We’ve had it in production for a few years now.

I won’t delve into why the developers like it, but as a Sysadmin on-call in the evenings, I can tell you straight out I’m glad it has my back.

We have MySQL deployed pretty heavily, and it is fantastic at what it does.  However, MySQL has a bit of an administrative overhead compared to a lot of the new alternative data stores out there, especially when making MySQL work in a large geographically distributed environment.

If you can model your data in Cassandra, are educated about the trade-offs, and have an undying wish not to have to worry too deeply about managing replication and sharding, it is a no-brainer.

I did a presentation on Cassandra (with Jake Luciani from Datastax) to the NYC Chapter of the League of Professional System Administrators (LOPSA) from the standpoint of an Admin.

Us Sysadmins fear change because it is our butt on the line if there is an outage.  With executives anxiously pacing behind us and revenue flushing down the drain, we’re the last line of defense if there is an issue and we’re the ones who will be torn away from families in the evenings to handle an outage.

So, yeah… we’re a conservative lot 🙂

That being said, change and progress can be good, especially when it frees you up.  Cassandra is resilient, fault-graceful and elastic. Once you understand how so, you’ll be slightly less surly. Your developers might not even recognize you!

These slides are for the SysAdmin, noble fellow, to assuage his fears and get him started with Cassandra.

Cassandra for Sysadmins

View more presentations from Nathan Milford

Visualizing Our Deployment Pipeline

When large numbers start piling up, in order to make sense of them,  they need to be visualized.
I still work as a consultant at Outbrain about one day a week, and most of the time I’m in charge of the deployment system last described here. The challenges that are encountered when we develop the system are good challenges, and every day we have too many deployments to be easily followed, so I decided to visualize them.
On an average day, we usually have a dozen or two deployments (to production, not including test clusters) so I figured why don’t I use my google-visualization-fo0 and draw some nice graphs. Here are the results and explanations follow.
Before I begin, just to put things in context, Outbrain had been practicing  Continuous Deployment for a while (6 months or so) and although there are a few systems that helped us get there, one of the main pillars was a relatively new tool written by the fine folks at LinkedIn (and in particular Yan— Thanks Yan!), so just wanted to give a fair shout out to them and thank Yan for the nice tool, API and ongoing awesome support. If you’re looking for a deployment tool do give glu a try, it’s pretty awesome! Without glu and it’s API all the nice graphs and the rest of the system would not have seen the light of day.

The Annotated Timeline
This graph may seem intimidating at first, so don’t be scared and let’s dive right into it… BTW, you may click on the image to enlarge it.

First, let’s zoom into the right-hand side of the graph. This graph uses Google’s annotated timeline graph which is really cool for showing how things change over time and correlate them to events, which is what I do here — the events are the deployments and the x-axis is the time while they is the version of the deployed module.
On the right-hand side you see a list of deployment events —  for example, the one at the top has “ERROR www @tom…” and the one next is “BehavioralEngine @yatirb…” etc. This list can be filtered so if you type a name of one of the developers such as @tom or @yatirb you see only the deployments made by him (of course all deployments are made by devs, not by ops, hey, we’re devopsy, remember?).
If you type into the filter box only www you see all the deployments for the www component, which by no surprise is just our website.
If you type ERROR you see all deployments that had errors (and yes, this happens too, not a big deal).
The nice thing about this graph from is first that while you filter the elements on the graph that are filtered out disappear, so, for example, let’s see only deployments to www (click on the image to enlarge):
You’d notice that not only the right-hand side list is shrunk and contains only deployments to www, but also the left-hand side graph now only has the appropriate markers. The rest of the lines are still there but only the markers for the www line are on the graph right now.
Now let’s have a look at the graph. One of the coolest things is that you can zoom into a specific timespan using the controls at the lower part of the graph. (click to enlarge)

In this graph, the x-axis shows the time (date and time of day) and the y-axis shows the svn revision number. Each colored line represents a single module (so we have one line for www and one line for the BehavioralEngine etc).

What you would usually see is for each line (representing a module) a monotonically increasing value over time, a line from the bottom left corner towards the top right corner, however, in relatively rare cases where a developer wants to deploy an older version of his module, then you clearly see it by the line suddenly dropping down a bit instead of climbing up; this is really nice, helps find unusual events.

The Histogram
In the next graph, you see an overview of deployments per day.

This is more of a holistic view of how things went the last couple of days, it just shows how many deployments took place each day (counts production clusters only) and colors the successful ones in green and the failed ones in red.

This graph is an executive summary that can tell the story of – in case there are too many reds (or there are reds at all), then someone needs to take that seriously and figure out what needs to be fixed (usually that someone is me…) – or in case the bars aren’t high enough, then someone needs to kick developer’s buts and get them deploying something already…

Like many other graphs from Google’s library (this one’s a Stacked Column Chart, BTW), it shows nice tooltips when hovering over any of the columns with their x values (the date) and their y value (number of successful/failed deployments)

Versions DNA Mapping
The following graph shows the current variety of versions that we have in our production systems for each and every module. It was attributed as a DNA mapping by one of our developers b/c of the similarity in how they look but that’s how far this similarity goes…

The x-axis lists the different modules that we have (names were intentionally left out, but you can imagine having www and other folks there). The y-axis shows the svn versions of them in production. It uses glu’s live model as reported by glu’s agents to zookeeper.

Let’s zoom in a bit:

What this diagram tells us is that the module www has versions starting from 41268 up to 41463 in production. This is normal as we don’t necessarily deploy everything to all servers at once, but this graph helps us easily find hosts that are left behind for too long, so for example, if one of the modules had not been deployed in a while then you’d see it falling behind low on the graph. Similarly, if a module has a large variability inversions in production, chances are that you want to close that gap pretty soon. The following graph illustrates both cases:

To implement this graph I used a crippled version of the Candle Stick Chart, which is normally used for showing stock values; it’s not ideal for this use case but it’s the closest I could find.

That’s all, three charts is enough for now and there are other news regarding our evolving deployment system, but they are not as visual; if you have any questions or suggestions for other types of graphs that could be useful don’t be shy to comment or tweet (@rantav).

Leader Election with Zookeeper

Recently we had to implement an active-passive redundancy of a singleton service in our production environment where the general rule is always have “more than one of anything”. The main motivation is to alleviate the need to manually monitor and manage these services, whose presence is crucial to the overall health of the site.

This means that we sometimes have a service installed on several machines for redundancy, but only one of the is active at any given moment. If the active services goes down for some reason, another service rises to do its work. This is actually called leader election. One of the most prominent open source implementation facilitating the process of leader election is Zookeeper. So what is Zookeeper?

Originally developed by Yahoo research, Zookeepr acts as a service providing reliable distributed coordination. It is highly concurrent, very fast and suitable mainly for read-heavy access patterns. Reads can be done against any node of a Zookeeper cluster while writes a quorum-based. To reach a quorum, Zookeeper utilizes an atomic broadcast protocol. So how does it work?

Connectivity, State and Sessions

Zookeeper maintains an active connection with all its clients using a heartbeat mechanism. Furthermore, Zookeeper keeps a session for each active client that is connected to it. When a client is disconnected from Zookeeper for more than a specified timeout the session expires. This means that Zookeeper has a pretty good picture of all the animals in its zoo.

Data Model

The Zookeeper data model consists of a hierarchy of nodes, called ZNodes. ZNodes can hold a relatively small (efficiency is key here) amount of data, they are versioned and timestamped . There are several properties a ZNode can have that make them particularly useful for different use cases. Each node in Zookeeper can have the persistent, ephemeral and sequential flags. These determine the naming of the node and its behavior with respect to the client session.

  • The persistent node is basically a managed data bin
  • The ephemeral node exists for the lifetime of its client session
  • The sequential node, when created, gets a unique number (sequence) suffixed to its name

The latter two provide the means to implementing a variety of distribution tasks such as locks, queues, barriers, transactions, elections and other synchronization related tasks.

Here’s what an election path looks like in the Solr Cloud Admin Console:

Service Election Path in Solr Clould Zookeeper Admin page


Zookeeper allows its clients to watch for different events in its node hierarchy. This way clients can get notified of different changes in the distributed state of affairs and act accordingly. These watches are one timers and should be persisted again by the client after notification. The client is also responsible of handling session expiration which means that ephemeral nodes should be re-persisted after an expiration.

Client Implementation

Zookeeper requires a lot of boilerplate code, mostly around connectivity and for the majority of the time you will be doing the same things over and over. Luckily Stefan Groschupf and Patrick Hunt wrote a client abstraction called ZkClient. I published a maven artifact for this on OSS so it’s available to our build system. The library also provides a persistent event notification mechanism in the form of listeners.

The next thing to do was to cook up a Spring factory bean for ZkClient and a template style class to act as an abstraction layer to Zookeeper operations. This ties in nicely into the Spring container which we use extensively:

The ZooKeeperClientStatsCollector is a listener implementation which collects stats about session connects/disconnects, exported to JMX as an MBean.

Now that we have a working data access layer we can start with the good stuff.

Leader Election

The Zookeeper documentation describes in general terms how leader election is to be performed. The general idea is that all participants of the election process create an ephemeral-sequential node on the same election path. The node with the smallest sequence number is the leader. Each “follower” node listens to the node with the next lower sequence number to prevent a herding effect when the leader goes away. In effect, this creates a linked list of nodes. When a node’s local leader dies it goes to election either find a smaller node or becoming the leader if it has the lowest sequence number.

The following image describes a scenario with 3 clients participating in the election process:

Leader Election with Zookeeper

Each client participating in this process has to:

  1. Create an ephemeral-sequential node to participate under the election path
  2. Find its leader and follow (watch) it
  3. Upon leader removal go to election and find a new leader, or become the leader if no leader is to be found
  4. Upon session expiration check the election state and go to election if needed

One thing to consider here is the nature of the work being done by the leader. Make sure its state can be preserved if its leadership is revoked. Leader loss could be caused by any number of reasons including initiated restarts due to maintenance and releases. It could also be brought about by network partitioning.

Designing services for graceful recovery is a requirement for distributed systems not leader election.

Spring helps here because interception can be used to suppress method invocations of various services based on leadership status. Below is an example of an interception based leadership control:

The service myService is the one controlled by leader election, all its method are going to be suppressed or invoked based on leadership status.

Another implementation uses a quartz scheduler instance as its target:

This implementation puts a quartz scheduler on standby mode when leadership is revoked and resumes it when it’s granted (notice it will not actually stop running tasks, this will be allowed their natural completion, so in effect you may have a scheduled task running on two services due to partitioning scenarios. This means that whatever is scheduled has to be aware of another service possibly doing the same work. This problem can be easily solved with a Zookeeper barrier implementation, more on that in another post.

But there’s more than leader election you could do with Zookeeper

If you wish to run your data center the democratic way, where important decisions are made in coordination with other stakeholders, Zookeeper certainly helps.

Leader Election with Spring is on GitHub, Source shown in this post can be found on Gist.

Zookeeper documentation and wiki.

Happy Zookeeping!

Feature Flags Made Easy

I recently participated in the ILTechTalk week. Most of the talks discussed issues like Scalability, Software Quality, Company Culture, and Continuous Deployment (CD). Since the talks were hosted at Outbrain, we got many direct questions about our concrete implementations. Some of the questions and statements claimed that Feature Flags complicate your code. What bothered most participants was that committing code directly to trunk requires the addition of feature flags in some cases and that it may make their codebase more complex.

While in some cases, feature flags may make the code slightly more complicated, it shouldn’t be so in most cases. The main idea I’m presenting here is that conditional logic can be easily replaced with polymorphic code. In fact, conditional logic can always be replaced by polymorphism.

Enough with the abstract talk…

Suppose we have an application that contains some imaginary feature, and we want to introduce a feature flag. Below is a code snippet that developers normally come up with:

While this is a legitimate implementation in some cases, it does complicate your code base by increasing the cyclomatic complexity of your code. In some cases, the test for activation of the feature may recur in any place in the code, so this approach can quickly turn into a maintenance nightmare.

Luckily, implementing a feature flag using polymorphism is pretty easy. First, let’s define an interface for the imaginary feature and two implementations (old and new):

Now, let’s use the feature in our application, selecting the implementation at runtime:

Here, we initialized the imaginary feature member by reflection, using a class name specified as a system property. The createImaginaryFeature() method above is usually abstracted into a factory but kept as is here for brevity. But we’re still not done. Most of the readers would probably say that the introduction of a factory and reflection makes the code less readable and less maintainable. I have to agree — and apart from that, adding dependencies to the concrete implementations will complicate the code even more. Luckily, I have a secret weapon at my disposal. It is called IoC, (or DI). When using an IoC container such as Spring or Guice, your code can be made extremely flexible, and implementing feature flags becomes a walk in the park.

Below is a rewrite of the PolymorphicApplication using Spring dependency injection:

The spring code above defines an application and 2 imaginary feature implementations. By default, the application is initialized with the oldImaginaryFeature, but this behavior can be overridden by specifying a -DimaginaryFeature.implementation.bean=newImaginaryFeature command line argument. Only a single feature implementation will be initialized by Spring, and the implementations may have dependencies.

Bottom line is: with a bit of extra preparation and correct design decisions, feature flags shouldn’t be a burden on your code base. By extra preparation, I mean extracting interfaces for your domain objects, using an IoC container, etc, which is something we should be doing in most cases anyway.

Eran Harel is a Senior Software Developer at Outbrain.

Under the Hood of Our Algorithmic Engine – How We Serve Content Recommendations

Outbrain Algorithms Team

Let me tell you a little about how we actually give content recommendations here at Outbrain. This will be only a short introduction. We might elaborate on some of the below issues in future posts.

Our main goal is to serve good content recommendations to readers on the Internet. The typical situation is a user reading a content page. We want to recommend content for further reading, which is a “good” recommendation.

Read more >