Ori Lahav

ScyllaDB POC – (not so :) live blogging – Update #3.

Scylla attacking Olysseus's ship
Scylla attacking Olysseus’s ship

It has been a long time (more than 4 months) since we last updated.

You can read the previous update here.

It is not that we abandoned the POC, we actually continued to invest time and effort on it since there is a good progress. It is just that we did not yet ended it and got the proofs that we wanted. While there was a lot of progress in both Outbrain system side and ScyllaDB side on those 4 months, there is one things that is holding us back from showing trying to prove the main point of this POC. Our current bottleneck is the network. The network on the datacenter where we are running the tests on is 1Gbps ethernet network. We found out that although Scylla is not loaded and works with good latencies we are saturating the NICs. We did some improvements along the way to still show that Scylla is behaving better than C* but if we want to show that we can significantly reduce the number of nodes in the cluster, we need to upgrade to 10Gbps ethernet.

This upgrade will come shortly.

This is where we currently stand. However – a lot was done in those 4 months and there is a lot of learnings I want to share. The rest of the post is the way Doron and the Scylla guys describes what happened. It looks more like captain’s log but it tells the story pretty well.


  • 23/6 – We created special app server cluster to call Scylla, and delegated all calls both to C* and Scylla cluster. We wanted to do that so there will be less coupling between the C* path and the Scylla path and less mutual interruptions that will interfere our conclusions. The app servers for Scylla were configured not to use cache, so entire load (1.5M-3M RPM) went directly to Scylla. C* stayed behind cache and actually handled ~10% of the load. This went smoothly.
  • In the following ~3 weeks we tried to load the snapshot again from C* and stumbled with some difficulties, some related to bugs in Scylla, some to networking limits (1Gpbs). During this time we had to stop the writes to Scylla for few days, so the data was not sync again. Some actions we have done to resolve
    1. We investigated bursts of usage we had and decreased them in some use-cases (both for C* and Scylla). They caused the network usage to be very high for few seconds, sometimes for a few tens of milliseconds. This also helped C* a lot. The tool is now open source.
    2. We added client-server compression (LZ4). It was supported by Scylla, but client needed to configure it.
    3. Scylla added server-server compression during this period.
    4. Changed the “multi” calls back to N parallel single calls (instead of one IN request) – it better utilize the network.
    5. Scylla client was (mistakably) using latency aware over the token aware. This caused app to go to the “wrong” node a lot – causing more traffic within Scylla nodes. Removing the latency-aware helped reducing the server-server network usage and the overall latency.
  • 14/7 – with all the above fixes (and more from Scylla) we were able to load the data and stabilize the cluster with the entire load.
  • Until 29/7 I see many spikes in the latency. We are  not sure what we did to fix it… but on 29/7 the spikes stopped and the latency is stable until today.
  • During this period we have seen 1-5 errors from Scylla per minute. Those errors were explained by trying to reach partitions coming from very old C* version. It was verified by logging the partitions we fail for in the app server side. Scylla fixed that on 29/7.
  • 7/8-15/8 – we have changed the consistency level of both Scylla and C* to local one (to test C*) – this caused a slight decrease in the (already) low latencies.
  • Up to 19/8 we have seen occasional momentarily errors coming from Scylla (few hundreds every few hours). This has not happened since 19/8.. I don’t think we can explain why.
  • Current latencies – Scylla holds over 2M RPM with latency of 2 ms (99%ile) for single requests and 40-50 ms (99%ile) for multi requests of ~100 partitions in avg per request. Latencies are very stable with no apparent spikes. All latencies are measured from the app servers.

Next steps on Scylla:

  • Load the data again from C* to sync.
  • Repeat the data consistency check to verify the results from C* and Scylla are still the same.
  • Run repairs to see cluster can hold while running heavy tasks in the background..
  • Try to understand the errors I mentioned above if they repeat.
  • Create the cluster in Outbrain’s new Sacramento datacenter that have 10Gbps network. with minimum nodes (3?) and try the same load there.


  • 7/8 – we changed consistency level to local-one and tried to remove cache from Cassandra. The test was successful and Cassandra handled the full load with latency increasing from 5 ms (99%ile for single requests) to 10-15ms in the peak hours.
  • 15/8 – we changed back to local-quorum (we do not like to have local-one for this cluster… we can explain in more details why) and set the cache back.
  • 21/8 – we removed the cache again, this time with local-quorum. Cassandra handled it, but single requests latency increased to 30-40 ms for the 99%ile in the peak hours. In addition, we have started timeouts from Cassandra (timeout is 1 second) – up to 500 per minute, in the peak hours.

Next steps on C*:

  • Run repairs to see cluster can hold while running heavy tasks in the BG.
  • Try compression (like we do with Scylla).
  • Try some additional tweaks by the C* expert.
  • In case errors continue, will have to set cache back.

Current status comparison – Aug 20th:

The following table shows comparison under load of 2m RPM in peak hours.

Latencies are in 99%ile.

Cassandra ScyllaDB
Single call latency 30-40 ms (spikes to 70) 2 ms
Multi call latency 150-200 ms (spikes to 600) 40-50 ms
Errors (note: these are query times exceeding 1 second, not necessarily database failures) Up to 150 a minute every few minutes timeouts per minute, with some higher spikes every few days few hundreds every few days

Below are the graphs showing the differences.

Latency and errors graphs showing both C* and Scylla getting requests without cache (1M-2M RPM):

Latency comparison of single requests

Latency comparison of single requests

Latency comparison of multi requests

Latency comparison of multi requests

Errors (timeouts for queries > 1 second)

Errors (timeouts for queries > 1 second)


There are very good signs that Scylla DB does make a difference in throughput but due to the network bottleneck, we could not verify it. We will update as soon as we have results on a faster network. Scylla guys are working on solution for slower networks too.

Hope to update soon.

ScyllaDB POC – live blogging #2


Hi again.

As we are now few weeks into this POC and we gave you first glimpse into it in the first post. We would like to continue and update on what are the latest developments.

However, before we come to the update. There are 2 things that we needed tell you about the setup that did not get into the first post just for length reasons.

The first thing to tell is the data structure we use as Doron explains it:

The Data

Our data is held using a single key as partition key, and additional 2 columns as clustering keys, in a manner of key/value structure:

  • A – partition key (number).
  • B – Clustering key 1 (text). This is the value type (i.e. “name”, “age” etc…).
  • C – Clustering key 2 (text).
  • D – The data (text). This is the value itself (i.e. “David”, “19”, etc…).

When storing the data we store either all data for a partition key, or partial data.

When reading we always read all data for the partition, meaning:

select * from cf_name where A=XXX;

When we need multiple partitions we just fire the above query in parallel and wait for the slowest result. We have come to understand that this approach is the fastest one.

As this meant to be the fastest, we need to understand that such reads latency is measured always by the slowest key to read. If you fire tens of reads into the cluster and If, by the chance, you bump into a slow response of one of the nodes (GC can be a good example for such case) your all read attempt is delayed.

This where we thought Scylla can help us improve the latency of the system.

The second thing we wanted to tell you about is what made this POC so easily done in Outbrain and this is our Dual DAO mechanism that Doron explains below.

Dual DAO implementation for Data Store migration/test

We have come to see that in many cases we need a dual DAO implementation to replace the regular DAO implementation.

The main use-case is data migration from one cluster to the other:

  1. Start dual writes to both clusters.
  2. Migrate the data – by streaming it into the cluster (for example).
  3. Start dual read – read from the new cluster and fallback to the old.
  4. Remove the old cluster once there are no fallbacks (if the migration was smooth, there should not be any).

The dual DAO we have written holds both instances of the old and new DAO, and also a phase – a number between 1 and 5. In addition, the dual DAO implements the same interface like the regular DAO, so it is pretty easy to inject it in and out when done.

The phases:

  1. Write to old cluster and read from old cluster only.
  2. Write to both cluster and read from old cluster only.
  3. Write to both cluster and read from new cluster, fallback to old cluster if missing.
  4. Write to new cluster and read from new cluster, fallback to old cluster if missing.
  5. Write to new cluster and read from new cluster only.

The idea here is to support gradual and smooth transition between the clusters. We have come to understand that in some cases transition done only by the dual DAO can be very slow (we can move to 5 only after no fallback is done), so we had 2 ways to accelerate it:

  1. Read-repair – when in phase 3 or 4, and we fallback to the old cluster – write the data we fetch to the new cluster. This approach usually fits to use cases of heavy reads.
  2. Stream the data in manually by Data guys – this has proven to be the more effective and quick way – and this is what we have done in this case.

Dual DAO for data store test:

The ScyllaDB POC introduces a slightly different use-case. We do not want to migrate to Scylla (at least not right now) – but add it to our system in order to test it.

In order to match those demands, we have added a new flavor to our dual DAO:

  1. The write to the new DAO is done in the background – logging its results in case of failures – but does not fail the write process. When working in dual DAO for migration we want to fail if either the new or old DAO fail. When working in test-mode we do not. At first we did not implement it like this – and we had a production issue when ScyllaDB upgrade (to a pre GA version, from 0.17 to 0.19)  caused the save queries to fail (on 9/3/16). Due to those issues we have changed the dual write to not fail upon the new DAO failure.
  2. The read from the new DAO is never used. On phases 3-5 we do read from the new DAO, and place a handler in the end to time the fetch and compare the results to the old DAO. However, the actual response gets back from the old DAO, when it is done. This is not bullet-proof but reduces the chances of production issues due to issues in the test cluster.

Mid April Update:

Last time we have reported was when ScyllaBD was loaded with partial and ongoing data while the C* cluster was loaded with the full historic data. Scylla was performing much better as we showed.

The next step we had to do was to load the full data into ScyllaDB cluster. This was done by Scylla streaming process that read the data from the C* cluster and wrote it to the ScyllaDB cluster. This process uncovered a bug with it that failed to extract the data from the newest version of C* that we’ve used.

Shlomi’s explanation is as follows:

Sstables upgraded from cassandra 2.0 to cassandra 2.1 can contain range tombstone with a different format from tombstones written in cassandra 2.1 – we had to add support for the additional format

The Scylla team figured it out and fixed it pretty quickly.More info about this issue can be found in the project GitHub.

After all the data was filled correctly into the ScyllaDB cluster, We’ve seen degrade in the ScyllaDB cluster which made it perform slightly worse than the C* (Surprise!!!!). A faulty configuration we’ve found in the C* “Speculative retries” mechanism and fixed it, actually made the C* latency about 50% better than the ScyllaDB. We need to remember we are measuring latency at its 99th percentile. In our case – as Doron mentioned above it’s while using multiple reads which is even harder.

ScyllaBD guys were as surprised as we are, They took a look into the system and found a bug with their cross DC read-repair, they fixed the issue and installed a new version.

Here is the description of the issue as explained by Tzach from Scylla: (reference to issue – https://github.com/scylladb/scylla/issues/1165)

After investigation, we found out the Scylla latency issue was caused by a subtle bug in the read-repair, causing unnecessary, synchronize, cross DC query.

When creating a table with read-repair chance (10% in this case) in Scylla or Cassandra, 10% of the reads send background queries from coordinator to ALL replications, including cross DC. Normally, the coordinator waits for responses only from a subset of the replicas, based on Consistency Level, Local Quorum in our case, before responding to the client.

However, when the first replica responses, does not match, coordinator will wait for ALL responses before sending response to the client. This is where Scylla bug was hidden.

Turnout, response digest was wrongly calculated on some cases, which cause the coordinator to believe local data is not in sync, and wait for remote DC nodes to response.

This issue is now fix and backport to Scylla 1.0.1

So, we upgraded to Scylla 1.0.1 and things look better. We can now say that C* and Scylla are in same range of latency but still C* is better. Or as Doron phrased it in this project google group:

Performance definitely improved starting 13/4 19:00 (IL time). It is clear.

I would say C* and Scylla are pretty much the same now… Scylla spikes a bit lower…

That did not satisfy the Scylla guys they went back to the lab and came back with the following resolution as described by Shlomi:

In order to test the root cause for the latency spikes we decided to try poll-mode:

Scylla supports two modes of operations – the default, event triggered mode and another extreme poll-mode. In the later, scylla constantly polls the event loop without no idle time at all. The cpu waits for work, always consuming 100% per core. Scylla was initially designed with this mode only and later switched to a more sane mode of operation with calls epoll_wait. The later requires complex synchronization before blocking on the OS syscall to prevent races between the cores going in/out sleep.

We recently discovered that the default mode suffers from increased latency and thus fixed it upstream, this fix did not propagate to 1.0.x branch yet and thus we recommend to use poll-mode here.

The journey didn’t stop here since another test execution in our lab revealed that hardware Hyper Threads (HT) may increase latencies during poll mode.

Yesterday, following our tests with and without HT, I have updated scylla configuration to use poll-mode and running only on 12 physical cores  (allowing the kernel to schedule the other processes on the HT).

Near future investigation will look into why poll-mode with HT is an issue (Avi suspects it’s L1 cache eviction). Once the default, event-based code will be back ported to the release, we’ll switch to it as well.

This is how the graphs look today:

This is how the graphs look today:

As you can see in the graphs above: (Upper graph is C*/ Lower graph is Scylla)

  • Data was loaded into Scylla in 4/1 (lower graph).
  • On 4/4 Doron did the fix in C* configuration and its latency improved dramatically.
  • On 4/13 Scylla 1.0.1 was installed and did an improvement.
  • On 4/20 Shlomi did the HyperThreading poll-mode config change.
  • The current status is that Scylla base is nicely on the 50ms spiking to above 90ms whereas C* is at 60-65 spiking to 120ms in some cases. It worth to remember that those measurements are 99th percentile taken from within Outbrain’s client service which is Java and by itself suffers from unstable performance.

There is a lot to learn from such POC. The guys from Scylla are super responsive and don’t hesitate to take learnings from every unexpected results.

Our next steps are:

  1. We still see some level of results inconsistencies between the 2 systems that we want to verify where they come from and fix.
  2. Move to throughput test by decreasing the number of machines in the Scylla cluster.

That’s all for now. More to come.

Next update is here.

We are testing ScyllaDB – live blogging #1

The background

Screen Shot 2016-03-15 at 12.42.47 AMIn the last month, we have started, in Outbrain, to test ScyllaDB. I will tell you in a minute what ScyllaDb is and how we came to test it but I think what is most important is that ScyllaDB is a new database at its early stages and still before its first GA (coming soon). It is not an easy decision to be among the firsts to try such a young project that not many have used before (up until now there are about 2 other production installations) but as they say, someone have to be the first one… Both ScyllaDB and Outbrain are very happy to openly share how the test goes, what are the hurdles what works and what not.

How it all began:

I know the guys from Scylla for quite some time, we have met through the first iteration of the company (Cloudius-systems) and we’ve met at the early stages of ScyllaDB too. Dor and Avi, the founders of ScyllaDB, wanted to consult if as heavy users of Cassandra, we will be happy for the solution they are going to write. I said, “Yes,  definitely”  and I remember saying, “If you will give me Cassandra functionality and operability at the speed and throughput of Redis, You got me.”

Time went by and about 6 months ago they came back and said they are ready to start integrations with live production environments.

This is the time to tell you what ScyllaDB is.

The easiest description is “Cassandra on steroids”. That’s right but in order to do that, the guys in Scylla basically had to write all Cassandra server from scratch, meaning:

  • Keep all Cassandra interface perfectly the same so client applications will not have to change.
  • Write it all over in C++, and by that overcome the issues that JVM brings with it, mostly no GC that was hurting the high percentiles of latency.
  • Write it all in Asynchronous programming model that enable the server to run in very high throughput.
  • Shard per core approach – on top of the cluster sharding, Scylla uses shard-per-core which allows it to run lockless and scale up with the number of cores
  • Scylla uses its own cache and does not rely on the operating system cache. It saves data copy and does not slow down due to page faults

I must say that was intriguing my mind as if you are looking at OpenSource NoSQL data systems that picked up, there is one camp of  C++, High performance but, yet,  simple functionality (memcached or redis) and the heavy functionality but JVM based camp (Spark, Hadoop, Cassandra). However if you can combine the good of both worlds – it sounds great.

Where does that meet Outbrain?

Outbrain is a heavy user of Cassandra. We have few hundreds of Cassandra machines running in 20 clusters over 3 datacenters. They store 1-2 terabytes of data each. Some of the clusters are being hit on user’s query time and unexpected latency is an issue. As data, traffic and complexity grew up with outbrain it became more and more complex to maintain the cassandra clusters and keep them up to reasonable performance. It always required more and more hardware to support the growth as well as the performance.

The promise of getting stable latency, 5-10x more throughput (much less machines)without the cost of re-writing our code made a lot of sense and we decide to give it a shot.

One thing was not yet in the product that we needed deeply was Cross DC clusters. The Cassandra feature of eventual consistency across different clusters in different Data Center is key to how Outbrain operates and it was very important for us. It took the guys from ScyllaDB a couple of months to finish that feature, test and verify all works and we were ready to go.

ScyllaDB team is located in Herzliya which is very close to our office in Netanya and they were very happy to come and start the test.

The team working on this test is:

Doron Friedland – Backend engineer at Outbrain’s App Services team.

Evgeny Rachlenko – from Outbrain’s Data Operations team.

Tzach Liyatan – ScyllaDB Product manager.

Shlomi Livne – ScyllaDB VP of R&D.

The first step was to allocate the right cluster and functionality we want to run the test on. After a short consideration we chose to run this comparison test on the cluster that holds all our Documents store. It holds all information about all active documents in Outbrain’s system. We are talking about few millions of documents where each one of them have hundreds of different features represented as Cassandra columns. This store is being updated all the time and being accessed in every user request (few million requests every minute). Cassandra started struggling with this load and we started applying many solutions and optimizations in order to keep the load. We also enlarged the cluster so we can keep it up.

One more thing that we did in order to overcome the Cassandra performance issues was to add a level of application cache that consumes few more machines

by itself.

One can say, that’s why you chose a scalable solution like Cassandra so you can grow it as you wish. But when the number of servers start to rise and have significant cost, you want to look at other solutions. This is where ScyllaDB came into play.

The next step was to install a cluster, similar in size to the production cluster.

Evgeny describes below the process of installing the cluster:

Well, the  installation impressed me in the two aspects.

Configuration part was pretty same to Cassandra with few changes in parameters.

Scylla simply ignoring GC, or HEAP_SIZE parameters  and use configuration as extension of cassandra.yaml file.

Our Cassandra’s clusters  running with many components integrated into outbrain ecosystem.  Shlomi with Tzach has defined properly  the most important graphs and alerts. Services such as consul, collectd, prometheus with graphana  also has been integrated as part of POC. Most integration test passed without my intervention except light changes in the Scylla chef’s cookbook.

Tzach is describing what it looked like from their side:

Scylla installation, done by Evgeny, was using a clone of Cassandra Chef recipes, with a few minor changes. Nodetool and cqlsh was used for sanity test of the new cluster.

As part of this process, Scylla metric was directed to OutBrain existing Prometheus/ Grafana monitoring system. Once traffic was directed to the system, the application and ScylladDB metrics was all in one dashboard, for easy comparison.

Doron is describing the application level steps of the test:

    1. Create dual DAO to work with ScyllaDB in parallel to our Cassandra main storage (see elaboration on the dual DAO implementation below).
    2. Start dual writes to both clusters (in production).
    3. Start dual read (in production) to read from ScyllaDB in addition to the Cassandra store (see the test-dual DAO elaboration below).
    4. Not done yet: migrate the entire data from Cassandra to ScyllaDB by streaming the data into the ScyllaDB cluster (similar to migration between Cassandra clusters).
    5. Not done yet: Measure the test-reads from ScyllaDB and compare both the latency and the data itself – to the data taken from Cassandra.
    6. Not done yet: In case the latency from ScyllaDB is better, try to reduce the number of nodes to test the throughput.

Performance metrics:

Here are some very initial measurement results:

You can clearly see below that ScyllaDB is performing much better and in a much more stable performance.

One current disclaimer here is that Scylla still does not have all the historic data and just using data of the last week.

It’s not visible from the graph but ScyllaDB is not loaded and thus spends most of the time idling, the more loaded it will become, latency will reduce (until a limit of course).

We need to wait and see the following weeks measurements. Follow our next posts.

Read latency (99 percentile) of single entry: *First week – data not migrated (2.3-7.3):





Read latency (99 percentile) of multi entries (* see comment below): *First week – data not migrated (2.3-7.3):





* The read of multiple partitions keys is done by firing single partitions key requests in parallel, and waiting for the slowest one. We have learned that this use-case extremes evert latency issues we have in the high percentiles.

That’s where we are now. The test is moving on and we will update with new findings as we progress.

In the next post-Doron will describe the Data model and our Dual DAO which is the way we run such tests. Shlomi and Tzach will describe the Data transfer and upgrade events we had while doing it.

Stay tuned.

Next update is here.

Monitoring a Wild Beast

by Marco Supino and Ori Lahav

Yeah — I know, monitoring is a “must have” tool for every web application/operation functionality. If you have clients or partners that are dependant on your system, you don’t want to hurt their business (or your business) and react in time to issues. At Outbrain, we acknowledge that it is a tech system we are running on and tech systems are bound to fail. All you need is to catch the failure soon enough, understand the reason, react and fix. On DevOps terminology, it is called TTD (time to detect) and TTR (time to recover).  To accomplish that, you need a good system that will tell the story and wake you up if something is wrong long before it affects the business.

This is the main reason why we invested a lot in a highly capable monitoring system. With it, we are doing Continuous Deployment and a superb monitoring system is an integral part of the Immune System that allows us to react fast to flaws in the continuous stream of system changes.

Note: Most of the stuff we have used to build it are open-source parts and projects that we gathered for our use.

A monitoring system is usually worthless if no one is looking at it. The common practice here is to have a NOC that is staffed 24/7. At Outbrain, we took another approach where instead of seating low-level engineers in front of the screen all the time and alerting high-level engineers when something happens, we built a system which is smart enough to alert the high-level engineers in the shift and tell them the story. We assume this alert can catch them wherever they are (at the playground with the kids, at the supermarket, etc…) and they can react to it if needed.

The only thing an Ops engineer in shift needs next to him is his smartphone and his notebook. Most of the time in order to understand the issue — he needs no more than a smartphone Gmail mailbox that looks like this:
Gmail mailbox

Gmail Labels are used to color alerts according to the alert type — making the best use of the short space available on the Android smartphone.

Nagios alerts are also sent via Jabber… to handle situations where Mail is down. The Nagios alert that comes with it includes the details about the alert and if it is a trend graph that is going over a threshold it also has the graph that shows the trend and tells the story.



And…. yes — we use Nagios, one of the oldest and most robust monitoring systems. Its biggest advantage is that it doesn’t know to do ANYTHING. That system just says “script for me anything you want me to monitor” and by that gives you full freedom to monitor everything you wish. We did great things with Nagios from monitoring the most basic vitals of the machines up to each and every JMX metric that our services expose and on into the bandwidth between our data centers.

One of the things that Nagios does not do well enough is supplying an aggregated view of the system state. For that, we have put a “Nagios dashboard” application that puts the state in front of the engineer looking at it.

Our dashboard is based on NagLite3 with some local modifications to differently handle some situations and our naming scheme.

The dashboard is fine but sometimes the engineer is not in front of a computer and has only his smartphone. How about chatting with the monitoring system and asking it “wassup?”
We have created a custom Jabber agent, based on the Net::Jabber::Bot Perl module, and the Nagios::Status module to gather information from the running Nagios instance.
The options we support are “wassup”,”ack”,”resched”,”dis/ena” etc.
Alerts sent from Nagios have a uniq ID added to the subject, for example:
Subject: mysql17.ladc1/Mysql Health Slave Check CRITICAL [ID#:367768]
The event number can be used in the Jabber interface to acknowledge/resched that particular EventID.
The great part of using Jabber is that it’s available on our smartphones (android based), so it’s easy to communicate with Nagios from everywhere (where Internet is available).

As I said above, working with Continuous Deployments raises a lot of challenges regarding faults and the ability to investigate them, identifying the root cause and fixing it. At Outbrain, every engineer can change the production environment at any time. Deployments are usually very small and fast and the new functionality introduced is very isolated and can be found in the SVN commit note. Regular process is for the developer to commit the code with the proper note and flags inside it. Then the TeamCity builds it and runs all tests. If tests went well there is a GLU feeder that catches the SVN message hook and starts the deployment process.

As part of the deployment it hits 2 APIs:
One for Yammer where this deployment is logged.


And the second one is to the Nagios where it is registered and from now on it will be shown as a vertical line on each of the Nagios graphs. So, in case there is a problem that will be visible in the Nagios graphs, we will be able to attribute it to the deployment in proximity.
The graphs are exposing 2 pieces of data: the revision number that can be searched in Yammer and the name of the engineer responsible for it so we can refer to him to ask questions. Note: this functionality was inspired by Etsy engineering — Track Every Release.
Etsy engineering


GLU Deployments also disable/enable notifications in Nagios for the host currently being deployed to, using NRDP — again, heavily modified to match our requirements, in order to avoid false alerts while nodes are restarted because of a version roll-out.

Another thing that sometimes help to analyze issues or at least distinguish between true and false alarm is to see what were the values at the same day last week (WOW -WeekOverWeek).
In some of our Nagios graphs (where it is relevant), there is a red line showing how this metric behaved at the same time, last week.

Nagios graphs


Scaling Nagios:
Our current Nagios implementation handles around 8.5k services and 500 hosts (physical/logical), and can grow much larger.

We run it across 3 DC’s around the US,  and our Nagios runs in a distributed mode. A Nagios node in each DC handles checks of the services in its “domain” (DNS domain in our case), and sends outputs to a central Nagios. The central Nagios is responsible for sending alerts and creating graphs. In case the “remote” Nagios nodes can’t contact the central Nagios, they will enable notifications and start sending alerts on their own, until the central comes back. Some cross-DC checks are employed, but we try not to use them on regular checks, only for sanity checks.
In case a “remote” Nagios is not sending alerts on time, the Central Nagios will start polling the services for that “remote” agent that’s not working using Nagios’s “Freshness checking” option.
This image from the Nagios Site shows a bit of the architecture, but because of limits in NSCA daemon, we used NRD which has some great features and works much better then NSCA.

Some more tricks to make Nagios faster:
1. Hold all the /etc of Nagios in RamDisk (measures needs to be taken to be able to restore it if the machine crashes).
2. Hold the Nagios status file (status.dat) in RamDisk.
NOTE: tmpfs can be swapped, so we chose the Linux RamDisk and increased the RamDisk size to 128mb to hold the storage we need.
3. Use NagiosEmbbededPerl where possible, and try to make it possible… the more the better. (assuming Perl is your favorite Nagios-plugin language).
4. Use multiple Nagios instances, on different machines, and as close to the monitored service as possible.
** The average service check latency here is <0.5sec on all nodes running Nagios.

Graphs are based on NagiosGraph, again, with many custom modifications, like the “Week over Week” line, and the Deployment vertical marks.
NagiosGraph is based on RRD, and multiple Perl cgi’s to make it run/view, we update ~8.5k graphs in every 5 minutes cycle.
RRD Files can be used in other applications —  for example, we use it to build a Network Weather-Map, based on PHP WeatherMap, to create a viewable image of the links between our DC’s and the network load and latency between them, graphs and WoW info.

RRD Files


These are just few examples of monitoring improvements we have put to the system to take it to the level of comfort that will ensure that we catch problems before our customers and partners and can react fast to solve them.
If you have more questions suggestions or comments – please do not hesitate to write a comment below.

Acknowledgements/Applications used:
Nagios – The Industry Standard In IT Infrastructure Monitoring
NagiosGraph – Data collection and graphing for Nagios
Nagios Notifications – based on Frank4dd.com scripts, with changes.
NRD – NSCA Replcament.
NRDP – Nagios-Remote-Data-Processor
PHP WeatherMap – Network WeatherMap
NagLite3 – Nagios status monitor for a NOC or operations room.
Yammer – The Enterprise Social Network
GLU – Deployment and monitoring automation platform – by LinkedIn
and many more great OSS projects…

LEGO Bricks – Our Data Center Architecture

Some of you might ask, “why is he telling us about data center architecture? Don’t Cloud Services solve this already?” and some of you that already know me and what my opinions are on the subject will not be surprised. Yes, I’m not a fan of the Cloud Services and that is another discussion, however, there are some advantages for using Cloud Services that giving them up by establishing a datacenter felt somehow wrong for us.

Here are 2 of them:

1. Grow As You Go – When you build a datacenter you take on commitments for space (racks or cages) and high profile network gear that are investments you have to pay for in advance or before you really need to use them. This is not an issue for a Cloud-based setup because as you grow you spin up more instances.

2. Disaster Recovery Headroom – With a datacenter-based setup, in order to properly handle disaster recovery you need to double your setup so you can always move all your traffic to the other datacenter in case of disaster, which means doubling the hardware you buy. In the Cloud, this is also a non-issue.

These 2 arguments are very much correct, however even taking those into consideration, our setup is much more efficient in cost than any Cloud offering. The logic behind it is what I want to share here.

Traditionally, when a company’s business grows, a single rack or maybe 2 are not sufficient and you have the need to allocate adjacent racks space in a co-located data center. This makes your recurring expenses grow since you actually pay for reserved space that you don’t really use. It’s a big waste of your $$$. Once we managed to set more than one location for our service we found out that it will be much cheaper to build multiple small data centers with a small space footprint than committing to a large space that we will not use most of the time. Adjacent space of at least 4 racks is much easier to find in most co-location facilities.  More than that, our co-location provider agreed to give us 2 active racks with first right of refusal for the adjacent 2 racks so we actually pay for those we use.

This architecture also simplified much of our network gear requirements. Assuming each “LEGO Brick” is small, it needs to handle only a portion of the traffic and not all of it. This does not require high profile network gear and very cheap Linux machines are sufficient for handling most of the network roles including load balancing, etc.

We continued this approach for choosing the intra-LEGO Brick switching gear. Here we decided to use Brocade stackable switching technology. In general, it means that you can put a switch per cabinet and wire all the machines to it. When you add another cabinet you simply connect them in a chain that looks and acts like a single switch. You can grow such a stack up to 8 switches. At Outbrain, we try to eliminate single points of failure, so we have 2 stacks and machines are connected to both of them. Again, the stacking technology gave us the ability to not pay for network gear before we actually need it.

But what about Disaster Recovery (DR) headroom? (We decided to implement more than one location for disaster recovery as soon as we started generating revenue for our partners.)  As I said, this is a valid argument. When we had 2 datacenters, 50% of our computing power was dedicated to DR and not used in normal time. This was not ideal and we needed to improve that. Actually, the LEGO bricks helped here once again. This week we opened our 3rd datacenter in Chicago. The math is simple, by adding another location our headroom dropped to only 33% which is a lot of $$$ savings when your business grows. When we add the 4th it will drop to 25%, etc.

I guess now you understand the logic and we can mention some fun info about the DC implementation itself:

  • Datacenters communicate via a dedicated link, powered by our co-location vendor.
  • We use a Global DNS service to balance traffic between the datacenters.
  • In our newer datacenters, the power billing is a pay-per-use — no flat fees which again enable us to not pay for power we don’t use. It also motivates us to power off unneeded hardware and save power costs while saving the planet 🙂
  • Power is 208V which is more efficient than the regular 110v.
  • All servers are connected to a KVM to enable remote access to BIOS config if needed — much easier to manage from Israel and in general.
  • We have a lot of Dell C6100s in our datacenters so each node there is also connected to an IPMI network in order to remotely restart each node without rebooting all 4 nodes in that chassis.
  • You can read more about assembling these C6100s in Nathan’s detailed post.

I guess your question is “what does it take to manage this in terms of labor?” That answer is… not too much.


The Outbrain Operations team is a group of 4 Ops engineers. Most of the time they are not doing much related to the physical infrastructure, but like other ops teams, most of the time they handle the regular tasks of configuring infrastructure softwares (we use all of them from open source like MySQL, Cassandra, Hadoop, Hive, ActiveMQ, etc…), monitoring, code and system deployment (we heavily use Chef) etc.

In general, Operations’ role in the company is to keep the serving fast, reliable and (very important) cost-efficient.  This is the main reason why we invest time, knowledge and innovation in architecting our datacenters wisely.

I guess one of the next posts will be about our new Chicago data center and the concept of the “Dataless Datacenter.”