Category: Uncategorized

How Well Do You Know Your Servers?

It’s a sunny day in Secaucus, New Jersey. The Outbrain Facilities team heads home on a high note. They’ve just racked, stacked, and inventoried over 100 new servers for the Delivery team. As they go to sleep later that night, the Delivery team begins their day in Netanya, Israel. They provision these servers for their new Kubernetes cluster, which the Developers are eager to migrate their services to.

Fast forward… microservices are migrated from bare metal to Docker containers running on our shiny new fleet of Kubernetes servers. The developers notice performance issues, but the behavior is not consistent, so they reach out to the engineers in Delivery. Delivery narrows it down to a few problematic servers that are running at ~50% performance. They all run the same services as the well-performing servers. Their CPU throttling configurations are correctly aligned. They all run the exact same Chef recipes. With all other options exhausted, they turn to the hardware:

server says


As they dig deeper, they realize that the problematic hosts don’t stand out at all. All servers in the cluster are the same model, have the same specs and look 100% healthy by all metrics that we collect. They are the latest and greatest cloud platform servers from Dell: The PowerEdge C6320. With not much left to go on, they finally find a single difference: the IPMI system event logs show that all of the problematic hosts are running on only 1 Power Supply (Instead of 2 for redundancy).

#ipmitool sel elist
8 | 01/11/2018 | 16:20:52 | Power Supply PSU 2 Status | Power Supply AC lost | Asserted

Down the Stack, We Go

Enter Facilities team again: a 2-man team managing thousands of servers across 3 data centers. Of all the issues that could arise, these brand new servers were the least of our worries. It’s a not-so-sunny day in New Jersey when we wake up to an email listing a handful of servers running on one power supply. Luckily, the hardware was fine and they were all just cases of a loose power cable.

Down the Stack, We Go

Redundancy is fundamental when building out any production environment. A server has 2 power supplies for these exact situations. If one power strip (PDU) goes down, if a power supply fails, or a power cable is knocked loose, a server will still be up and running. So from the Operating System’s point of view, running on one power supply should not matter. But lo and behold, after plugging in those loose cables, the performance immediately goes back to 100% on all of those servers.

We contact Dell Support, which assumes it’s the CPU Frequency Scaling Governor and asks us to change the power management profile to Performance. In a nutshell, the Operating System has control of CPU throttling, and they want us to grant that control to the BIOS to rule out OS issues (Dell does not support Ubuntu 16.04, which is what we currently use).

After making these changes, the issue persists:

This is the command used in every test is: stress-ng --all 0 --class cpu --timeout 5m --metrics-brief

Power Profile: PerfPerWatt(OS)
PSUs: 2
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [152332] cpu              48697    300.19    320.79      0.06       162.22       151.77

Power Profile: PerfPerWatt(OS)
PSUs: 1
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [160954] cpu              21954    300.49    315.72      0.28        73.06        69.47

Power Profile: Performance
PSUs: 2
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [3687] cpu              48182    300.18    319.02      0.09       160.51       150.99

Power Profile: Performance
PSUs: 1
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [11623] cpu              21372    300.45    306.56      0.14        71.13        69.68

The number to pay attention to in these tests is the final one on each line (bogo ops/s). When we unplug one power supply, we see the number of operations that the CPU can handle drop dramatically. Dell involves a senior engineer, who confirms that this is actually expected behavior and references a Dell whitepaper about Power Capping on the C6320 Servers. After briefly reviewing the paper, we assume that we could just check the iDrac web interface for power capping:

We see that no power cap policy is set, so things still aren’t adding up. It’s time to really dig into this whitepaper and (with some help from Dell Support) we manage to find a root cause and a solution. Although we received these servers with power capping disabled, Emergency Power Capping was enabled.

What is the Emergency Power Capping Policy? Glad you asked! When a C6320 chassis is running on one power supply, throttle the CPU no matter what the power consumption is. This is a field that is completely hidden in the iDrac web interface, so how can you tell if this policy is enabled? Another great question! It is enabled if the 8th byte of the following ipmitool output is “01”

#ipmitool raw 0x30 0xC5
00 00 00 02 00 00 00 01 00

*You can use ipmitool raw 0x30 0xC5 to get the current policy from any node within the chassis.


Let us save you time and your sanity by explaining what you are looking at and how you can automate your way out of this hidden setting.

That 8th bit means that the chassis should throttle the CPU when running on one power supply, via PROCHOT. The other options for this bit are to emergency throttle via NM (Node manager), or to turn off emergency throttling. PROCHOT is short for processor hot, a processor technology that throttles the processor when certain conditions are met. More information on PROCHOT can be found in Intel’s whitepaper on page 7 (written for the Xeon 5500 series, but is still relevant). NM is short for Intel’s Node Manager. More information on NM can be found here.

Our testing shows that the CPU does NOT throttle on one PSU when using NM, but we can’t find a clear answer on the logic behind the NM, so we decide to avoid it completely and turn off emergency throttling. Instead, we set the Chassis Power Capping Value to 1300W. Each of the power supplies we use are 1400W, so this limit accounts for a situation where the system is running on one power supply and will throttle to avoid power failure or overheating.

We use a Chef recipe to set our desired policy with this command:

ipmitool raw 0x30 0x17 0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The first part of the command sets the chassis power capping policy:
#ipmitool raw 0x30 0x17 

These 9 bits, which are explained in the Dell whitepaper, roughly translate to: 
“Enable chassis power capping with a value of 1300W, and disable emergency throttling”. 
0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The 1st bit enables chassis power capping:0x01
The 2nd and 3rd bits indicated the power limit value:0x14 0x05
The 8th bit disables emergency throttling:0x00

Here is a quick reference for other power limit values:
0x2c 0x01 = 300W
0xf4 0x01 = 500W
0x20 0x03 = 800W
0x14 0x05 = 1300W

*This command can be run from any blade within the chassis, and we tested it on Ubuntu 16.04 and Centos 7.

Our policy only caps power at the chassis level. Dell Support strongly discourages enabling capping at both the chassis and sled/blade level, as this could lead to inconsistent results.

Let’s Improve Together

This obscure hardware issue sent ripples all the way up the stack, and thanks to the close communication between our various development and operations teams, we were able to find the root cause and solution. We hope this information saves other engineering teams time and effort in their future buildouts.

With thousands of servers to manage and a lean Facilities team, hardware automation and visibility are crucial. Have similar challenges? Have a hardware related war story? we’d love to hear about it – leave a reply below!

X tips [x>5] for building a bulletproof deployment pipeline with Jenkins

Continuous delivery is a methodology where each commit can potentially get into production in a timely manner.

Jenkins Pipeline is one of the tools out there that automates the delivery process to make it short, robust, and without human intervention as much as possible.

We have recently done such an integration on our team at Outbrain, so here are some tips and advises from our humble experience.

X. You should have done this ages ago (so do it today)

Don’t wait till you have all the building blocks in place. Start with a partial pipeline and add all the automated steps you already have in place. It will give you the motivation to add more automation and improve the visibility of the process.

The pipeline set of plugins in Jenkins are ~1 year old in its current form, So it is mature and well documented. Definitely ready to use.

X. Validate artifacts and source code consistency across pipeline

That tip I read in the Teamcity pipeline post but is relevant for Jenkins as well. Make sure that the same version of sources and artifacts is used across all stages. Otherwise, commit might be pushed while the pipeline is executing, and you might end up deploying untested version.

X. Use commit hook with message regexp

Well, if I will try to generalise this tip I would say: try to ask for as little human intervention as possible (when it is not required). A good place to start is a commit hook. It works in a way that when a developer push code with a specific commit message — in our case #d2p (deploy to production), the pipeline is automatically triggered.

Here is a code sample from Jenkinsfile (the pipeline configuration file):

gitCommitMessage = sh(returnStdout: true, script: 'git log -1 --pretty=%B').trim()
deployToProd = (gitCommitMessage =~ /#d2p/ || params.DEPLOY_TAG == "#d2p") //we also allow '#d2p' when triggering manually

X. Try the Blue Ocean view

Blue Ocean set of plugins are in release-candidate stage as of the time of writing (now it is GA). Stable enough and a very good UI— especially for parallel stages. So I would recommend using. In addition, it is working side-by-side with the old UI.

All is green
When something goes wrong

X. Ask for user authorization on sensitive operations

If you still not sure that your monitoring system is robust enough, start by automating the pipeline, and ask for developer authorization before the actual deploy to production.

Here is a code sample from Jenkinsfile:

timeout(time:5, unit:'HOURS') {
  input message: 'Deploy to production?', ok: 'Deploy!'

X. Integrate slack or other notifications

Slack is awesome and has a very documented API as well. Sending notifications on pipeline triggering and progress helps to communicate the work between team members. We use it to send start, completed, and failure notifications right now. We plan to integrate the above approval input with a slack bot so we can approve it directly from slack.

A slack notification

X. Make the pipeline fast (parallelize it)

Making pipeline turnaround time short helps to keep work efficient and fun. Set a target for the total turnaround time. Our target is less than 10 minutes. One of the easiest ways to keep it fast is by running stages that are independent in parallel. For example, we run a deployment to a test machine in parallel to the integration tests and deployment to a canary machine in parallel to our black-box tests.

Here is a code sample from Jenkinsfile:

stage("Testing: phase a") {
    parallel 'JUnit': {
        stage("junit") {
            sh '...'
    }, 'Deploy to simulator': {
        stage("Deploy to simulator") {
            sh '...'
stage("Testing: phase b") {
    parallel 'Simulator tests': {
        stage("Simulator tests") {
            sh '...'
    }, 'Canary server': {
        stage("Deploy to canary") {
            sh '...'
        stage("Tests on canary") {
            sh '...'

X more tips in the great post below:

Enjoy Piping!

P.S. – The original post was published on my personal blog

I am going to have a talk at Jenkins Use Conference that is based on this blog post

You are welcome to there for more details!

Code Retreat @ Outbrain

code retreat

Some people say writing code is kind of an art.
I don’t think so.
Well, maybe it is true if you are writing an ASCII-Art script or you are a Brainfuck programmer. I think that in most cases writing code is an engineering. Writing a program that will do something is like a car taking you from A to B that someone engineered. When a programmer writes code it should do something: data crunching, tasks automation or driving a car. Something. For me, an art is a non-productive effort of some manner. Writing a program is not such case. Or maybe it is more like a martial-art (or marshaling-art :-)) where you fight your code to do something.

So — what’s the problem?

Most of the programs I know that evolve over time, needs to have a quality which is not an art, but an ability to be maintainable. That’s usually reflect in a high level of readability.

Readability is difficult.

First, when you write some piece code, there is a mental mode you are in, and a mental model you have about the code and how it should be. When you or someone else read the code, they don’t have that model in mind, and usually, they read only a fragment of the entire code base.
In addition, there are various languages and styles of coding. When you read something that was written by someone else, with a different style or in a different language it is like reading a novel someone wrote in a different dialect or language.
That is why, when you write code you should be thoughtful to the “future you” reading the code, by making the code more readable. I am not going to talk here about how to do it, design patterns or best practices to writing your code, but I would just say that practicing and experience are important aspects in relate to code readability and maintainability.
As a programmer, when I retrospect what I did last week at work I can estimate about 50% of the time or less was coding. Among other things were writing this blog post, meetings (which I try to eliminate as much as possible) and all sort of work and personal stuff that happens during the day. When writing code, among the main goals are the quality, answering the requirements and do both in a timely manner.
From a personal perspective, one of my main goals is improving my skill-set as a programmer. Many times, I find that goal in a conflict with the goals above that were dictated by business needs.

Practice and more practice

There are few techniques that come to solve that by practicing on classroom tasks. Among them are TDD Kata’s and code-retreat days. Mainly their agenda says: “let’s take a ‘classroom’ problem, and try to solve it over and over again, in various techniques, constraints, languages and methodologies in order to improve our skill-set and increase our set of tools, rather than answering business needs”.

Code Retreat @ Outbrain — What do we do there?

So, in Outbrain we are doing code-retreat sessions. Well, we call it code-retreat because we write code and it is a classroom tasks (and a buzzy name), but it is not exactly the religious Corey-Haines-full-Saturday Code-Retreat. It’s an hour and a half sessions, every two weeks, that we practice writing code. Anyone who wants to code is invited — not only the experts — and the goals are: improve your skills, have fun, meet developers from other teams in Outbrain that usually you don’t work with (mixing with others) and learn new stuff.
We are doing it for a couple of months now. Up until now, all sessions were about fifteen minutes of presentation/introduction to the topic, and the rest was coding.
In all sessions, the task was Conway’s game of life. The topics that we covered were:

  • Cowboy programming — this was the first session we did. The game of life was presented and each coder could choose how to implement it upon her own wish. The main goal was an introduction to the game of life so in the next sessions we can concentrate on the style itself. I believe an essential part of improving the skills is the fact that we solve the same problem repeatedly.
  • The next session was about Test-Driven-Development. We watched uncle-bob short example of TDD, and had a fertile discussion while coding about some of the principles, such as: don’t write code if you don’t have a failing test.

After that, we did a couple of pair programming sessions. In those sessions, one of the challenges was matching pairs. Sometimes we do it in a lottery and sometimes people could group together by their own selection, but it was dictated by the popular programming languages that the developers choose to use: Java, Kotlin, Python or JavaScript. We plan to do language introduction sessions in the future.
In addition, we talked about other styles we might practice in the future: mob-programming and pairs switches.
These days we are having functional programming sessions, the first one was with the constraint of “all-immutable” (no loops, no mutable variables, etc’) and it will be followed by more advanced constructs of functional programming.
All-in-all I think we are having a lot of fun, a lot of retreat and little coding. Among the main challenges are keeping the people on board as they have a lot of other tasks to do, keeping it interesting and setting the right format (pairs/single/language). The sessions are internal for Outbrain employees right now, but tweet me (@ohadshai) in case you are around and would like to join.
And we also have cool stickers:
code retreatcode retreatcode retreat
P.S. — all the materials and examples are here.

The original post was published in my personal blog

From complex monolith to scalable workflow

From complex monolith to scalable workflow


One of the core functionalities in Outbrain’s solution is our crawling system.

The crawler fetches web pages (e.g. articles), and index them in our database.

The crawlers can be divided into several high-level steps:

  1. Fetch – download the page
  2. Resolve – identify the page context (e.g. what is the domain?)
  3. Extract – try to extract features out of the HTML – like title, image, description, content etc.
  4. NLP – run NLP algorithms using the extracted features, in order to classify and categorize the page.
  5. Save – store to the DB.

The old implementation

The crawler module was one of the first modules that the first Outbrainers had to implement, back in Outbrain’s small-start-up days 6-7 years ago.

In January 2015 we have decided that it was time to sunset the old crawlers, and rewrite everything from scratch. The main motivations for this decision were:

  1. Crawlers were implemented as a long monolith, without clear steps (OOP as opposed to FP).
  2. Crawlers were implemented as a library that was used in many different services, according to the use-case. Each change forced us to build and check all services.
  3. Technologies are not up-to-date anymore (Jetty vs. Netty, sync vs. async development etc.).

The new implementation

When designing the new architecture of our crawlers, we tried to follow the following ideas:

  1. Simple step-by-step (workflow) architecture, simplifying the flow as much as possible.
  2. Split complex logic to micro-services for easier scale and debug.
  3. Use async flows with queues when possible, to control bursts better.
  4. Use newer technologies like Netty and Kafka.

Our first decision was to split the main flow into 3 services. Since the crawler flow, like many others, is basically an ETL (Extract, Transform & Load) – we have decided to split the main flow into 3 different services:

  1. Extract – fetch the HTML and resolve the domain.
  2. Transform – take features out of the HTML (title, images, content…) + run NLP algorithms.
  3. Load – save to the DB.

The implementation of those services is based on “workflow” ideas. We created interface to implement a step, and each service contains several steps, each step doing a single and simple calculation. For example, some of the steps in the “Transform” service are:

  • TitleExtraction
  • DescriptionExtraction
  • ImageExtraction
  • ContentExtraction
  • Categorization

In addition, we have implemented a class called Router – that is injected with all the steps it needs to run, and is in charge of running them one after the other, reporting errors and skipping unnecessary steps (for example, no need to run categorization when content extraction failed).

Furthermore, every logic that was a bit complex was extracted out of those services to a dedicated micro-service. For example, the fetch part (download the page from the web) was extracted to a different micro-service. This helped us encapsulate fallback logic (between different http clients) and some other related logics we had outside of the main flow. This is also very helpful when we want to debug – we just make an API call to that service to get the same result the main flow gets.

We modeled each piece of data we extracted out of the page into features, so each page would eventually translated into a list of features:

  • URL
  • TItle
  • Description
  • Image
  • Author
  • Publish Date
  • Categories

The data flow in those services was very simple. Each step got all the features that were created up to its run, and added (if needed) one or more features to its output. That way the features list (starting with only URL) got “inflated” going over all the steps, reaching the “Load” part with all the features we need to save.




The migration

One of the most painful parts of such rewrites is the migration. Since this is a very important core-functionality in Outbrain, we could not just change it and cross fingers that everything is OK. In addition, it took several months to build this new flow, and we wanted to test as we go – in production – and not wait until we are done.

The main concept for the migration was to create this new flow side by side with the old flow, having them both run in the same time in production, allowing us to test the new flow without harming production.

The main steps of the migration were:

  1. Create the new services and start implementing some of the functionality. Do not save anything in the end.
  2. Start calls from the old-flow to the new one, a-synchronically, passing all features that the old flow calculated.
  3. Each step in the new flow that we implement can compare its results to the old-flow results, and report when the results are different.
  4. Implement the “save” part – but do it only for a small part of the pages – control it by a setting.
  5. Evaluate the new-flow using the comparison done between the old and new flows results.
  6. Gradually enable the new-flow for more and more pages – monitoring the effect in production.
  7. Once feeling comfortable enough, remove the old-flow and run everything only in the new-flow.

The approach can be described as “TDD” in production. We have created a skeleton for the new-flow and started streaming the crawls into it, while actually it does almost nothing. We have started writing the functionality – each one tested, in production, compared to the old-flow. Once all steps were done and tested, we have replaced the old-flow by the new one.

Where we are now

As of December 20th 2016 we are running only the new flow for 100% of the traffic.

The changes we already see:

  1. Throughput/Scale: we have increased the throughput (#crawls per minute) from 2K to ~15K.
  2. Simplicity: time to add new features or solve bugs decreased dramatically. There is no good KPI here but most bugs are solved in 1-2 hours, including production integration.
  3. Less production issues: it is easier for the QA to understand and even debug the flow (using calls to the micro services) – so some of the issues are eliminated even be getting to developers.
  4. Bursts handling: due to the queues architecture we endure bursts much better. It also allows simple recovery after one of the services is down (maintenance for example).
  5. Better auditing: thanks to the workflow architecture, it was very easy to add audit message using our ELK infrastructure (Elastic search, Log-stash, Kibana). The crawling flow today reports the outcome (and issues) of every step it does, allowing us, the QA, and even the field guys to understand the crawl in details without the need of a developer.

Automating your workflow

During development, there are many occasions where we have to do things that are not directly related to the feature we are working on, or things that are repetitive and recurring.
In the time span of a feature development this can often take as much time to do as the actual development.

For instance, updating your local dev micro services environment before testing your code. This task on its own, which usually includes updating your local repo version, building and starting several services and many times debugging and fixing issues caused by others, can take hours, many times just to test a simple procedure.

We are developers, we spend every day automating and improving other people’s workflows, yet we often spend so many hours doing the same time consuming tasks over and over again.
So why not build the tools we need to automate our own workflows?

In our team we decided to build a few tools to help out with some extra irritating tasks we were constantly complaining about to each other.

First one was simple, creating a slush sub-generator. For those of you who don’t know, slush is a scaffolding tool, like yeoman but for gulp. We used this to create our Angular components.
Each time we needed to make a new component we had to create a new folder, with three files:


Each file of course has its own internal structure of predisposed code, and each component had to be registered in the app module and the main less file.

This was obviously extremely annoying to redo each time, so we automated it. Now each time you run “ob-genie” from the terminal, you are asked the name of your component and what module to register it with, and the rest happens on its own. We did this for services and directives too.

Other than saving a lot of time and frustration, this had an interesting side effect – people on the team were creating more components than before! This was good because it resulted in better separation of code and better readability. Seems that many tim the developers were simply too lazy to create a new component and just chucked it all in together. Btw, Angular-CLI have added a similar capability, guess great minds think alike.

Another case we took on in our team was to rid ourselves of the painstaking task of setting up the local environment. This I must say was a real pain point. Updating the repo, building and running the services we needed each time could take hours, assuming everything went well.
There have been times where I spent days on this just to test the simplest of procedures.
Often I admit, I simply pushed my code to a test environment and debugged it there.
So we decided to build a proxy server to channel all local requests to the test environment.

For this we used node-proxy, a very easy to configure proxy. However, this was still not an easy task since each company has very specific configurations issues we had to work with.
One thing that was missing was proper routing capabilities. Since you want some requests to go local and some remote we added this before each request.



We passed as an option the routing table with a regex for each path, making it easy to configure which requests to proxy out, and which in.



Another hurdle was working with HTTPS, since our remote environments work on HTTPS.
In order to adhere to this we needed to create SSL certificate for our proxy and the requestCert parameter in our proxy server to false, so that the it doesn’t get validated.

The end configuration should look something like this.



With this you should be able to run locally and route all needed calls to the test environment when working on localhost:2109.

So to conclude, be lazy, make your work easier, and use the skills you have to automate your workflows as much as possible.

How to take innovation into production

How to take innovation into production

Outbrain’s Hackathon

The Outbrain Hackathon which is held twice a year, is a 24-hours event in which employees and friends are invited to build and present an original product or innovation.

The Hackathon is a mini festival held at all Outbrain’s offices around the globe where the offices are open for 24 hours and meals and beers are served all day long.

The winning team is rewarded with a worthy present and the opportunity to turn the idea into a working feature/product.

Few weeks before the event, people start to raise ideas and to team up.

In the beginning of the event, a representative from each team has 5 minutes to present his/her idea in front of the whole company.

In the end of the event, each representative presents a demo of the software his/her team developed.

Immediately afterwards, a vote is conducted and the winners are declared.

In this post I will share my experience from the Hackathon which my team and I won.

About Our hack: Let’s do some innovation

One of our legacy services is called the “Editorial Reviewer”.

It is a user interface for approving/rejecting newly created promoted content, based on Outbrain’s content guidelines (See:

This was an old service from Outbrain’s early days. It was slow and was using old technologies and frameworks. We decided to rewrite it with a fresh breeze of technologies and make it function and look awesome.

Let’s get to work…

Prior to the Hackathon we did some research about what we could remove or improve about the current service and if there are new features or demands from the users we could add during the makeover.

We then divided the work among the team members and chose the technologies and frameworks based on our needs and desires.

One of our main goal was to improve the performance of the old tool.

Switching from multi page web application architecture to a single page web application made a real change but wasn’t enough.

The real challenge was to speed up the database access calls that the service makes.

We analyze the current queries, and found out they can be dramatically improved.

After a few hours, we already had a working demo. It looked a bit childish but it was already performing much better!!

We decided to get some advice from our masters and took it up with our UX designer who came up with a really cool sketch which we were excited to implement.

After a long night of hacking and tons of coffee we finally had an impressive working demo that we could present to the company.

The feedbacks we got were awesome! The teams from Outbrain that were using the old tool were super excited and couldn’t wait to start using our new hack.

As part of the prize, we got the time to develop it into a full-blown product.

2 months later, we got the chance to invest more time in our idea.

We added more tests, monitors and dashboards, did some fine-tuning and at the end, came up with a really cool and sexy single page application that was much faster, comfortable and reliable than the old tool.


The atmosphere was great. All participants worked around the clock and did their best to kick ass!

The challenge of working on a project that we chose and the fact we were striving to make it happen regardless of the tight time frame was amazing and so was the final outcome.

Is Cassandra really visible? Meet Cassibility…

You love Cassandra, but do you really know what’s going on inside your clusters?

Cassandra-CassibilityThis blog post describes how we managed to shed some light on our Cassandra clusters, add visibility and share it with the open source community.

Outbrain has been using Cassandra for several years now. As all companies, we started small, but during the time our usage grew, complexity increased, and we ended up having over 15 Cassandra clusters with hundreds of nodes, spread across multiple data centers.

We are working in the popular micro services model, where each group has its own Cassandra cluster to ensure resources and problems are isolated. But one thing remains common – we need to have good visibility in order to understand what is going on.

We already use Prometheus for metrics collection, Grafana for visualising its data and Pagerduty as our alerting system. However, we still needed to have detailed enough visibility on the Cassandra internals to ensure we could react to any issues encountered before they became a problem and make appropriate and informed performance tunings. I have to admit that when you don’t encounter a lot of problems, you tend to believe that what you have is actually sufficient, but when you suddenly have some nasty production issues, and we had our fair share, it becomes very challenging to debug it efficiently, in realtime, sometimes in the middle of the night.

Let’s say, as a simple example, that you realized that the application is behaving slower because the latency in Cassandra increased. You would like to understand what happened, and you start thinking that it can be due to a variety of causes – maybe it’s a system issue, like a hardware problem, or a long GC. Maybe it’s an applicative issue, like an increase in the number of requests due to a new feature or an application bug, and if so you would like to point the developer to a specific scenario which caused it. If so it would be good if you could tell him that this is happening in a specific keyspace or column family. In this case, if you’re also using row cache for example, you would wonder if maybe the application is not using the cache well enough, for example the new feature is using a new table which is not in the cache, so the hit rate will be low. And Maybe it’s not related to any of the above and it is actually happening due to a repair or read repair process, or massive amount of compactions that accumulated.  It would be great if you could see all of this in just a few dashboards, where all you had to do in order to dig into these speculation of your could be done in just a few clicks, right? Well, that’s what Cassibility gives you.

Take a look at the following screenshots, and see how you can see an overview of the situation and pinpoint the latency issue to number of requests or connections change, then quickly move to a system dashboard to isolate the loaded resource:

* Please note, the dashboards don’t correspond to the specific problem described, this is just an example of the different graphs



Then if you’d like to see if it’s related to specific column families, to cache or to repairs, there are dedicated dashboards for this as well




Here is the story of how we created Cassibility.

We decided to invest time in creating better and deeper visibility, and had some interesting iterations in this project.

At first, we tried to look for an open-source package that we could use, but as surprising as it may be, even with the wide usage of Cassandra around the world, we couldn’t find one that was sufficient and detailed enough for our requirements. So we started thinking how to do it ourselves.

Iteration I

We began to dig into what Cassandra can show us. We found out that Cassandra itself exposes quite a lot of metrics, could reach dozens of thousands of metrics per node, and they can be captured easily via JMX. Since we were already using the Prometheus JMX exporter ( in our Prometheus implementation, it seemed like the right choice to use it and easy enough to accomplish.

Initially we thought we should just write a script that exposes all of our metrics and automatically create JSON files that represent each metric in a graph. We exposed additional dimensions for each metric in order to specify the name of the cluster, the data center and some other information that we could use to monitor our Cassandra nodes more accurately. We also thought of automatically adding Grafana templates to all the graphs, from which one could choose and filter which cluster he wants to see, which datacenter, which Keyspace or Column Family / Table and even how to see the result (as a sum, average, etc.).

This sounded very good in theory, but after thinking about it a bit more, such abstraction was very hard to create. For instance there are some metrics that are counters, (e.g number of requests) and some that are gauge (e.g latency percentile). This means that with counters you may want to calculate the rate on top of the metric itself, like when you would want to take the number of requests and use it to calculate  a throughput. With a gauge you don’t need to do anything on top of the metric.

Another example is how you would like to see the results when looking at the whole cluster. There are some metrics, which we would like to see in the node resolution and some in the datacenter or cluster resolution. If we take the throughput, it will be interesting to see what is the overall load on the cluster, so you can sum up the throughput of all nodes to see that. The same calculation is interesting at the keyspace or column family level. But if you look at latency, and you look at a specific percentile, then summing, averaging or finding maximum across all nodes actually has no meaning. Think about what it means if you take the number that represents the request latency which 99% of the requests on a specific node are lower than, and then do the maximum over all nodes in the cluster. You don’t really get the 99’th percentile of latency over the whole cluster, you get a lot of points, each representing the value of the node with the highest 99’th percentile latency in every moment. There is not much you can do with this information.

There are lot of different examples of this problem with other metrics but I will skip them as they require a more in depth explanation.

The next issue was how to arrange the dashboards. This is also something that is hard to do automatically. We thought to just take the structure of the Mbeans, and arrange dashboards accordingly, but this is also not so good. The best example is, of course, that anyone would like to see an overview dashboard that contains different pieces from different Mbeans, or a view of the load on your system resources, but there are many other examples.

Iteration II

We realized that we need to better understand every metric in order to create a clear dashboard suite, that will be structured in a way that is intuitive to use while debugging problems.

When reviewing the various sources of documentation on the many metrics, we found that although there was some documentation out there, it was often basic and incomplete – typically lacking important detail such as the units in which the metric is calculated, or is written in a way that doesn’t explain much on top of the metric name itself. For example, there is an Mbean called ClientRequest, which includes different metrics on the external requests sent to Cassandra’s coordinator nodes. It contains metrics about latency and throughput. On the Cassandra Wiki page the description is as follows:

  Latency: Latency statistics.

TotalLatency: Total latency in microseconds

That doesn’t say much. Which statistics exactly? What does the total mean in comparison to  just latency? The throughput, by the way, is actually an attribute called counter within the latency MBean of a certain scope (Read, Write, etc.), but there are no details about this in the documentation and it’s not that intuitive to understand. I’m not saying you can’t get to it with some digging and common sense, but it certainly takes time when you’re starting.

Since we couldn’t find one place with good and full set of documentation we started digging ourselves, comparing values to see if they made sense and used a consultant named Johnny Miller from who has worked a lot with Cassandra and was very familiar with its internals and  metrics.

We improved our overall understanding at the same time as building and structuring the dashboards and graphs.

Before we actually started, we figured out two things:

  1. What we are doing must be something that many other companies working with Cassandra need, so our project just might as well be an open-source one, and help others too.
  2. There were a lot of different sections inside Cassandra, from overview to cache, entropy, keyspace/column family granularities and more, each of which we may want to look at separately in case we get some clues that something may be going on there. So each such section could actually be represented as a dashboard and could be worked on in parallel.

We dedicated the first day to focus on classifying the features into logical groupings with a common theme and deciding what information was required in each one.

Once we had defined that, we then started to think about the best and fastest way to implement the project and decided to have a 1 day Hackathon in Outbrain. Many people from our Operations and R&D teams joined this effort, and since we could parallelize the work to at most 10 people, that’s the number of people who participated in the end.

This day focused both on creating the dashboards as well as finding solutions to all places where we used Outbrain specific tools to gather information (for example, we use Consul for service discovery and are able to pull information from it). We ended the day with having produced 10 dashboards,with some documentation, and we were extremely happy with the result.

Iteration III

To be sure that we are actually releasing something that is usable, intuitive to use and clear, we wanted to review the dashboards, documentation and installation process. During this process, like most of us engineers know, we found out that the remaining 20% will take quite a bit to complete.

Since in the Hackathon people with different knowledge of Cassandra participated, some of the graphs were not completely accurate. Additional work was therefore needed to work out exactly which graphs should go together and what level of detail is actually helpful to look at while debugging, how the graphs will look when there are a lot of nodes, column families/tables and check various other edge cases. We spent several hours a week over the next few weeks on different days to finalize it.

We are already using Cassibility in our own production environment, and it has already helped us to expose anomalies, debug problems quickly and optimize performance.

I think that there is a big difference between having some visibility and some graphs and having a full, well organized and understandable list of dashboards that gives you the right views at the right granularity, with clear documentation. The latter is what will really save you time and effort and even help people that are relatively new to Cassandra to understand it better.

I invite you to take a look, download and easily start using Cassibility: We will be happy to hear your feedback!

UPDATE #2: Outbrain Security Breach

Earlier today, Outbrain was the victim of a hacking attack by the Syrian Electronic Army. Below is a description of how the attack unfolded to help others protect against similar attempts. Updates will continue to be posted to this blog.

On the evening of August 14th, a phishing email was sent to all employees at Outbrain purporting to be from Outbrain’s CEO. It led to a page asking Outbrain employees to input their credentials to see the information. Once an employee had revealed their information, the hackers were able to infiltrate our email systems and identify other credentials for accessing some of our internal systems.

At 10:23am EST SEA took responsibility for hack of, changing a setting through Outbrain’s admin console to label Outbrain recommendations as “Hacked by SEA.

At 10:34am Outbrain internal staff became aware of the breach.

By 10:40am Outbrain network operations began investigating and decided to shut down all serving systems, degrade gracefully and block all external access to the system.

By 11:03am Outbrain finished turning off its service from all sites where we operate.

We are continuing to review all systems before re-initiating service.

UPDATE #1: Outbrain Security Breach

We are aware that Outbrain was hacked earlier today and we took down service as soon as it was apparent.

The breach now seems to be secured and the hackers blocked out, but we are keeping the service down for a little longer until we can be sure it’s safe to turn it back on securely. Please stay tuned here or to our Twitter feed for updates.

Hurricane Sandy – Outbrain Service updates

Hi all!

As Hurricane Sandy is about to hit the east coast US, and as Outbrain’s main Datacenter is located in downtown Manhattan, we are taking measures to make as little service interruption as possible for our partners and customers. Outbrain is normally serving from 3 data centers and in case of NY data center loss, we will supply the service from one the other data centers. On this page, below – we will update on any service interruption and ETAs for problem-solving. We assume all will go well and we will not have to update but… just in case 🙂

[UPDATE – Nov 3rd 3:45 pm EST] – At this time Utility power is back to all our datacenters and HQ office. It is now time to restore the service from NY and get the office back to work. This will take some time but systems will gradually be put back up over the next week or so. There should be no effect on users, publishers or clients.

Our HQ will also start working gradually depending on the availability of public transportation.

We are here closing this reporting post – if you see any issues, please report to or your rep.

I hope the storm of the century will be the last one for the next century (at least).

[UPDATE – Nov 1st 9:30 am EST] – Our HQ, located on 13th between 5th and 6th in downtown New York City is still without power and therefore closed. Thankfully, our NY-based team is safe and in dry locations, and will continue to try and work as best they can. We highly appreciate the concern and best wishes we received from our partners and clients across the globe; thank you!

We are doing our best to continue to provide the best in class service, one we hope you’ve come to expect from us. As an update, our datacenter in NY is still without power and we expect it to be down for a few more days. We will continue to serve from our other datacenters located in Chicago and Los Angeles. To reiterate, our service did not go down, and we are currently still serving across our client’s sites. As of this morning, we recovered and updated all our reporting capabilities, so we should be back to 100%.

If you are experiencing any difficulties or seeing different, please reach out to your respective contacts. We’ll also continue to operate under emergency mode until Monday, you can reach us 24/7 at (am = Account Management).

[UPDATE – Oct 31st 6:46 am EST] – Serving still holds strong from our LA and Chicago data centers and we are not aware of any disruption to our service. We are working hard to recover our dashboard reporting capabilities, but it will probably take a couple more days before we’re able to get back to normal mode. Sorry for any inconvenience caused by this. Send us a note to if you have any request, and one of us from around the world will respond as soon as possible.

[UPDATE – 6:51 pm EST]  – Again, not much to update – All is stable with both LA and Chicago datacenters. It’s the end of the day here in Israel and we are trying to get some rest. Our teammates in the US are keeping an eye on the system and will alert us if there is anything wrong. Good night.

[UPDATE – 3:35 am EST] – Actually not much to update about the service. All is pretty much stable. we are safely serving from LA and Chicago. most back-end services are running in LA Datacenter and our tech team in Israel and NY are monitoring and handling issues as they raise. Our Datacenter vendors in NY are working with FDNY to pump the water from the flooded generator room so it will take a while to recover this datacenter 🙂

[UPDATE – 10:50 am EST] – The clients dashboard is back up.

[UPDATE – 10 am EST] – The client’s dashboard on our site is periodically down – we are handling the issues there and will update soon.

[UPDATE – 5 am EST] Our NY Datacenter went down. Our service is fully operational and we are serving through our Chicago and LA Datacenters. If you’re accessing your Outbrain dashboard you may experience some delays in data freshness. We are working to resolve this issue and will continue to update.

[UPDATE – 2am EST] – Our NY Datacenter went completely off – We are fully serving from our Chicago and LA Datacenters. External reports on our site are still down but we are working to fail overall services from the LA Datacenter. – we will follow with updates.

[Update – 12:50 am EST] – power just went all off in our NY Datacenter and the provider has evacuated the facility – we are taking our measures to move all functionality to other datacenters.

[UPDATE]  – at 9 pm EST]  commercial power went down in our NY Datacenter. Provider failed over to the generator and we continue to serve smoothly from this Datacenter. We continue to monitor the service closely and ready to take actions if needed.