July 9, 2020

Elazar Gur

Keren Ofek Granov

Faster Than a Missile: Intercepting Content Fraud

On May 21, 2020, thousands of Israeli websites were hacked by a group called “Hackers Of Savior”, probably including hackers from Turkey, The Gaza Strip, and North Africa. Not the first – and surely not the last – cyber-attack of its kind, the hackers targeted websites stored on uPress, an Israeli web hosting company.

The websites’ content was replaced by a page with a statement against Israel – “The countdown to Israel’s destruction began a long time ago” – and a disturbing video clip. The attack was timed to coincide with the Israeli national holiday “Jerusalem Day”, which celebrates the reunification of the city after the Six-Day War. It also coincided with the Iranian “Quds Day”, which was probably an additional motivation for the attack.

*The attacker’s content as shown on the hacked websites*

Beyond the visual corruption of the websites, the attackers also tried to delete all the websites’ data. Besides, the hackers prompted the browser to ask the viewer’s permission to take their photo with their computer’s webcam.

This attack was not your ordinary run-of-the-mill fraud event. It did not revolve around the usual advertiser-sourced content fraud. Rather, the source of the problem was an outside party and Outbrain was not even the target. The attack harmed not just the website publishers and end-users, but advertisers too.

Before diving in to see how Outbrain coped on the battleground during this event, let’s start at the beginning. What is “content fraud”, why is it such a challenge, and how does Outbrain protect its network and partners every day?

Content Fraud – The Major Challenge for Content Networks

Content fraud is an industry-wide problem that comes in many forms, including phishing, redirects, cloaking, malvertising, conspiracy-mongering and fake news. The internet is subsumed by an endless amount of fake content, such as fake quotes, doctored images, and deepfake video, making it increasingly difficult to navigate towards an authentic online experience. As content fraud is always evolving, it becomes much harder to detect and prevent over time.

As a discovery company, part of Outbrain’s lighthouse is to provide qualified content to its readers. While the overwhelming majority of content provided by advertisers is qualified, interesting and safe, Outbrain invests significant resources to ensure that no malicious or deceitful content makes its way onto our advertising platform.

When content fraud is detected, it comes from a specific and identifiable source that can be easily blocked and prevented from accessing Outbrain’s network. Generally, we use multiple technologies and people-powered techniques to verify that advertisers are legitimate and promoting qualified content. We look at a variety of properties, the identity of the advertiser, and the content itself. Based on the technologies Outbrain usually uses, the ability to respond is broad and fast.

In the recent event, the challenge for Outbrain was different.

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

*Scott Webb (https://unsplash.com/photos/yekGLpc3vro*)

The cyber event of May 2020 presented our anti-fraud specialists with a different and unique set of obstacles.

Firstly, we were not battling against an entity that was part of our network. The threat did not come from one of our partners – rather, our publishers and advertisers were first in the line of fire.

Secondly, the attack was broad and not sourced from a specific website. We assumed that some websites in our network had been corrupted, however, we didn’t know which ones. Targeted and specific blocking is the most common anti-fraud solution, yet in this case, we had to find and block multiple domains that were yet unknown.

Thirdly, any action would be temporary. When we identify a fraudster on the network, they are permanently blocked and banished from the network. There is no going back. However, in this case, we knew that once the affected websites recovered, we had to immediately re-enable them and permit them back on the network.

Lastly, even though Outbrain was not the target, we had to stay on our toes, and to assume that the attack was still happening or could happen again at any time. Although we identified the characteristics when it occurred, we had to be ready to quickly detect any new method or pattern that the attackers could develop and deploy.

The response by our anti-fraud team was designed to protect all three pillars of the Outbrain ecosystem:

The first pillar – Users who need protection from threatening content.
The second pillar – Advertisers who need protection from being charged for harmful visits and damage to their brand image.
The third pillar – Publishers who were prevented from directing traffic to threatening content that would harm their reputation.

The Technology Challenge

In order to find the corrupted sites, we had to scan loads and loads of web pages. Once such a page was found, we had to disable it immediately so it would no longer be promoted on the Outbrain network. These scans had to be done quickly and repeatedly. If a cyber attack is ongoing, pages can be hacked even after we scan them, and if the pages have recovered, we need to know straight away in order to recommend them back to the network.

*“The hacked pages shared the same attributes” – One of the attributes we discovered from crawler logs*

How did we identify a hacked page? We had a few options:

Published lists of the hacked websites: In some cyber attacks, this can be a good solution, but not in this case. The attack was a rolling one, which meant not all the websites were hijacked at once. There was no single reliable source that contained all the hacked websites.
Third-party solutions: In order to maintain the integrity of the Outbrain network, the Anti-Fraud team uses not only internal solutions that were developed in-house, but also external solutions, to flag problematic content, like malware, cloakers, redirects, and other fraud types. A third-party solution can be used to classify specific characteristics as a website “under attack”, so it might have been an option in a case like this. However, making the adjustments necessary to detect this specific attack would take too long, and in this instance, the ability to react fast was super important.
Outbrain’s in-house crawler: In order to recommend the best content to readers, Outbrain crawls the promoted content in its network, analyzing characteristics and extract features that will be used in the NLP and recommendation algorithms. Our in-house crawler was identified as the ideal solution for this challenge, as it is already connected to our databases. In order to get immediate results, we only had to define the population we wanted to crawl. Moreover, with some additional configurations and quick development by the AppServices team, we could tackle all the challenges at once, automatically, and rapidly.

Crawling Our Way Out

So Outbrain’s in-house crawler was immediately mobilized for detecting which sites were hacked.

We started by analyzing several hacked sites found in multiple sources across the net. Using our crawler output logs, we discovered that the pages were identical in terms of both structure and visual content, meaning they shared the same attributes (title, text, links, keywords, etc.). By querying the crawler logs for those attributes, we were able to easily identify more hacked domains. The problem was that the picture we saw was only a partial one, since not all the pages were crawled yet.

We chose the relevant potentially-hacked domains list and set it to be crawled in cycles of 3 hours, using a configured crawler feature that was specially tailored for this problem. Once the crawler identified a new (specific) value for the attributes the attackers injected into the web pages, we could disable the hacked pages. Disabling a page meant that the hacked native ads were no longer available as recommendations on the Outbrain network, and were no longer accessible by our users.

The ongoing scans also allowed us to monitor and understand the impact of the attack on our inventory in real-time and to communicate it to our partners. When web pages were recovered from the cyber attack, we used the crawler to re-enable them on the Outbrain network. With the help of our unique tech capabilities, we were able to come full circle out of this pernicious cyber attack.

Rapid Response Saves the Day

The attack began on the morning of May 21. According to uPress’s announcement, by May 22, 02:28 (Israel Time), 90% of the websites were recovered. As a result of our actions, we were able to prevent Outbrain users from being exposed to hacked web pages containing disturbing content.

This rapid reaction not only protected our users from threatening content, but it also enabled affected publishers and advertisers to return to their usual Outbrain traffic services very quickly after websites were recovered.

What Doesn’t Kill Us, Makes Us Stronger

During the cyber rescue operation, we were in direct contact with several teams in the organization, enabling us to present a united front and a holistic solution. Along the way, we benefited from the ability to pivot in several different directions as necessary, both internally and externally. It was a great lesson about how to use various tools to solve an array of problems.

With the rapid, coordinated response from Outbrain’s teams, we overcame the “Hackers of Savior” attack, and also developed the infrastructure to manage similar cyberattacks in the future, if – and when – they occur.

May 27, 2020

Avi Avraham

Real Time Data Pipeline – More Than We Expected

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and to make the data available in real time. But as soon as we started to work on it, we realized that there are quite a few additional good reasons to embark on this complex project.

Overview

Our data delivery pipeline is designed to deliver data from various services into our Hadoop ecosystem. There it is processed using Spark and Hive in order to produce aggregated data to be used by our algorithms and machine learning (ML) implementations, business reports, debugging, and other analysis. We call each type of this data a “datum”. We have more than 60 datums, each representing a different type of data, for example: request, click, listings, etc.

Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system building from thousands of micro services running inside Kubernetes containers on top of an infrastructure that is spread over more than 7000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

As you can imagine, our services produce lots of data that is passed through our data delivery pipeline. On an average day, more than 40TB is moving from through it, and on peak days it can pass the 50TB.

And yes, in order to support this scale, the delivery pipeline needs to be scalable, reliable and lots of other words that end with “able”.

Pipeline Architecture

As I wrote above, we have multiple data centers (DCs), but in order to simplify the diagram and explanation, I will use only 2 DCs to present our pipeline architecture. As we go through the two architectures (legacy hourly and current RT), you will notice that the only components which remain the same are the edges. The services that produce the data, and the Hadoop where the data ends up.

Another thing worth mentioning is that in order to have full disaster recovery (DR), we have 2 independent Hadoop clusters in separated locations, each one of them gets all the data and processes the same jobs independently, in case one of them goes down, the other continues to work normally.

Hourly Files Pipeline – legacy

The data is collected by the services, and saved to files on the local file system of each service. From there, another process copies those files to our Data Processing System (DPS), which is an in house solution that collects all the data and copies it into the Hadoop. As you can see, it is a very simple architecture without a lot of components, and is quite reliable and robust since the data is saved in files which can be easily recovered in case of any malfunction.

The drawbacks of this pipeline is that the data is moved in chunks of hours and it is not accessible during the duration of the delivery time. It is only available for processing after each hour is completed, and all the data is in the Hadoop. And it also places a heavy burden on the network because the data is transferred in spikes.

Real Time Streaming Pipeline

Replacing the local file system storage, we have a Kafka cluster. And, instead of our in house DPS system we have a MirrorMaker, an aggregated Kafka cluster and a SparkStreaming cluster.

In this pipeline, the data is written directly from the service to the Kafka cluster. From this point all the data is accessible to anyone that wants to use it. Analysts, algo developers, or any other service that can find it useful. Of course it is also available to the MirrorMaker, as part of the pipeline.

The MirrorMaker job is to sync data between Kafka clusters, and in our case, to make sure that each aggregated Kafka cluster will have the entire data from all DCs.

Like before, the data in the aggregated cluster is available to all, especially to our SparkStreaming cluster which runs various jobs that consume the data and writes to the Hadoop.

So it’s clear that by implementing this pipeline, the data is available to all, and it reaches the Hadoop faster.

Obvious Benefits

Now lets go over the obvious benefits of the real time pipeline:

Data availability – since the data is passed through Kafka clusters, it is available to all implementations that need real time data. Here are 2 good examples:
- Recommendation engines can now calculate its algo models with fresher data. This led to a lift of more than 16% in our Revenue Per Mille (RPM).
- We added real time customer facing dashboards which enable them to take required actions faster according to fresh data.
Reduced the latency of the processing system – since the data reach the Hadoop faster, the hourly jobs that wait for the hourly data can start work sooner and to complete their work much closer to the end of the hour. This reduced the overall latency of the processing system and now the business reports for example, are available in a shorter time.

Additional Benefits

And those are the unexpected benefits that we gain from the real time pipeline:

Network capacity – In the hourly files pipeline, the data is moved in hourly batches and at the end of each hour. This means that the network capacity needs to support movement of all data at once. This requirement forces us to allocate the required bandwidth while it is used only for a short time during each hour, wasting expensive resources. The real time pipeline moves the same amount of data incrementally over the course of the hour.

The graphs below demonstrate the bandwidth savings that we made once we moved to the real time pipeline. For example, in the RX graph, you can see that we moved from peaks of 17 GBPs to have flat bandwidth of 7 GBPs, saving 10 GBPs.

Total RX traffic

Total TX traffic

Disaster recovery – the fact that in the hourly based pipeline the data is saved in files on the local file system of the machines that run the services have some limitations.
- Data loss – as we all know, machines have 100% chance of failing at some point. And when you have thousands of machines, the odds are against us. Each time a machine goes down, you have the risk of data loss since all the hourly files may not have been copied into the pipeline yet. In the real time pipeline, the data is written immediately to the Kafka cluster which is more resilient and the risk of the data loss is reduced.
- Late processing – if a machine has recovered from a failure, and you were lucky enough to avoid data loss, the recovered data needs to be processed and in most cases it won’t be done within the time period that the data is related to. This means that this time period will be processed again which adds extra load on the Hadoop and may result in data delays, since the Hadoop needs to process multiple time periods at the same time. Like before, the benefit of the real time pipeline in that aspect is that the data reaches the Hadoop without any delays so there is no reprocessing of any time period.

Conclusions

Having the real time pipeline, our life became much simpler. Beside the planned goals (data availability and reducing the latency), the extra goodies that we got from this change made us less sensitive to any network glitch or hardware malfunction. In the past, each one of those issues forced us to handle data delays and have the risk of data loss, and now the real time pipeline by its nature, solved all of them.

Yes, there are more components that we need to maintain and monitor, but this cost is justified if you compare it to the great results we achieved by implementing this real time system for our data delivery pipeline.

September 2, 2019

Doug Chimento

The First Rule of Distributed Tracing Is…

Distributed Tracing

Distributed Tracing is a mechanism to collect individual requests across micro-services boundaries. It also enables instrumentation of application latency (how long each request took), tracking the life-cycle of network calls (HTTP, RPC, etc) and also identify performance issues by getting visibility on bottlenecks. Plenty of articles are on the inter-webs describing various implementations of Distributed Tracing. This blog will focus on our experience rolling out tracing into our ecosystem with tales of what not to do.

What we need to solve

Outbrain has hundreds of micro-services running on top of Kubernetes. We have metrics via Prometheus and Netflix Hystrix dashboard to monitor overall health, but we lack visibility on business requests. For example, when a user clicks an ad, we want to see the exact flow. In particular, we may want to see the bid request/response to our partners for performance or fraud reasons. But we don’t have something that allows us to examine an individual request. This is what tracing solves for us; the ability to capture exact business logic for a specific request. Tracing can answer such questions:

Did I hit cache or go to a datastore? What services were used for this specific business logic request?

We implemented tracing with Jaeger to visualize and store our traces:

The First Rule

The first rule of tracing is no one should know about tracing. Tracing should be behind a magical curtain of abstraction; all without developers knowledge. Depending on your organization development practices and usage of core infrastructure (e.g. HTTP clients, RPC framework, database drivers…etc), this maybe difficult. But if the majority of your organization uses one or two HTTP clients (i.e. OkHttp, Apache HTTP) it is rather trivial instrument those clients with tracing capabilities without much developer intervention. By instrumenting only HTTP clients and servers you will get a huge benefit with tracing. Think of the 80/20 rule when you plan your tracing implementation.

What not to do

Conceptual, tracing is easy to understand. Operationally, it can turn into a mess. There is an overhead to tracing, albeit small, but still you don’t want to get OutOfMemory errors because of tracing. So don’t by start tracing everything!

Besides, if you start tracing everything what are you going to do with all that data? Depending on your request load this could be huge. Sure, you solved tracing your micro-services, but you created massive amount of data (and hardware) that probably no one is going to look at. In the beginning, keep your operational costs to a bare minimum. You could start by dumping traces into your existing logging infrastructure (i.e. put the trace Id as MDC context in all logging statements). No extra hardware (collectors) required! Like designing software, start small and increment your changes overtime.

Slow and steady wins the race

Begin your tracing quest by providing a mechanism which will “turn on” tracing manually. Suppose you have a REST API, by detecting a special request header, say “X-AdHocTrace”, will inform the service that tracing has been requested. This will enable tracing to all downstream servers as well. We use swagger at Outbrain and added the header automatically:

window.onload = function() {
 const ui = SwaggerUIBundle({
 url: "../api/swagger/apiDocs",
 requestInterceptor: function(request) {
   request.headers['X-AdHocTrace'] = "true";
   return request;
}

When developers are testing their services (locally or remotely) they automatically get all their calls traced when using swagger.

Overtime you can add more sophisticated sampling mechanisms. For instance you could sample all request that result in a 500 HTTP status code. Read more >

June 12, 2018

Daria Litvinov

Making Spark Native for Data Processing Framework

Working in an infrastructure team is different from working in a development team. The team develops tools that are used by developers’ teams inside Outbrain. It is very important to provide a convenient way for working with new technologies. Having a Spark cluster enables running distributed data processing in scale. Until now developers worked with Spark writing ad-hoc crontab jobs in production. It created a messy and unstable environment and required manual maintenance in case of failure. For that reason, our team wanted to give an infrastructural way for working with Spark.

In this blog post, I explain how we built the solution implementing Spark to be a native tool in our WorkflowEngine.

Meet WorkflowEngine

The Outbrain Data Infrastructure team enables running hundreds of ETL jobs every hour. These jobs are orchestrated by WorkflowEngine – an internal tool, which is developed by the team.

The WorkflowEngine (WFE) enables running various types of transformations from many kinds of data sources: MySQL, Cassandra, Hive, Vertica, Hadoop etc., located in two data centers and the cloud. Every transformation, or flow, implements some business logic. A flow may contain several tasks, which are executed one after another.

Flows are developed and owned by researchers, data scientists and developers in other groups at Outbrain, such as Algorithm research groups, BI and Data Science.

input Trigger

Defining flows in WFE gives users a lot of benefits. It enables users to autonomously specify dependencies between flows, so when one flow ends, another is triggered. In addition, users can define a number of retries for their flows, so if a flow fails for some reason, WFE reruns it. If all retries fail, WFE sends an email or alert about the flow failures. Logs for each running flow are shown in a convenient form in WFE UI. Furthermore, WFE collects metrics for all flows, which enables us to track the health of the system.

From the operational perspective, several instances of WFE are deployed on machines which are owned by the infrastructure team, users just define ETL flows in a separate repository.

Diving Into the Solution

In order to respond to requirements for running data analytics on Spark, our team had to allow an infrastructural way for running Spark jobs orchestrated by WFE. To achieve this, we’ve developed a new kind of WorkflowEngine task called SparkStep.

When approaching the design for SparkStep, we had several principals in mind:

Avoid producing additional load on WFE machines
Create an isolated environment for every Spark job
Allow users to specify and at the same time allow the system to limit the number of YARN resources allocated for every SPARK job
Enable using multiple Spark clusters with different versions

Taking all this into consideration, we came up with the following architecture:

Keeping the control while having fun

The Spark job definition specifies how a job will run on the Spark cluster (e.g. jarURL, className, configuration parameters). In addition, there are properties that are set internally by WFE. When WFE starts executing a SPARK job, it builds a spark-submit command using all these properties.

The spark-submit command submits the Spark job to YARN, which allocates needed resources and launches the application. After submitting the job, WFE constantly checks its status using YARN REST API. When the job is completed in YARN, WFE reports a COMPLETED status for the WFE Spark task and continues running the next task. However, if the Spark job fails in YARN, WFE reports a FAILED status for the WFE job. In this case, WFE can rerun the job, according to the defined number of retries. If the job runs in YARN for too long, and its running time exceeds the defined maximum, WFE stops the job using the YARN REST API. In this case, the job is marked as FAILED in WFE. This functionality ensures both that we have a cap on the amount of resources each flow can use and that resources are saved when a job is canceled for any reason.

The spark-submit command runs in a Docker container. The Docker exits once the job is submitted to the cluster. It enables us to create an isolated environment for every Spark job executed by WFE. In addition, using Docker container, we can easily use different Spark versions simultaneously, by creating different Docker images.

By using this architecture we achieve all our goals. Spark jobs are submitted to YARN in a “cluster” mode, so they do not create a load on WFE machines. YARN handles all resources for each job. Every Spark job runs in an isolated environment owing to the Docker container.

April 24, 2018

Shaked Bar

Taking the pain out of Data Science – part 2

This is post #2 in series of 3 posts covering our machine learning framework. We recommend to read post #1 first to understand the challenges we face and get an overview of our solution.

This part will focus on how we handle our data and make it more accessible – using an on-going data collection process.

Data Collection

The first part of any data science task, is getting a good dataset to work with. We have a lot of data, but preparing the datasets can be a very hard work – you really have to “get your hands dirty” to get the data from all the sources and tables and convert them into an easy to use dataset.

Data Collection

Challenges:

What are the main challenges in creating a good dataset to use?

Many output tables – tables that store requests for recommendations, served recommendations, user clicks, profiles for the users and documents and more.
Number of data stores – these tables are stored on a number of sources, due to their different nature. Some are Hive tables, some data is stored on MySQL, and on Cassandra as well.
Long queries – some of these tables are very big. Querying them, especially for a long date range, can take a while.
Irrelevant data – we rarely want data from our entire traffic. Usually we only want some portion of it which is relevant for the current modeling task.

Silos and partitioning:

In addition to these challenges, there are other advantages to a good data collection process.

We want to have the ability to train models on different silos – different population groups, that may behave differently and require different models.

To enable achieving this easily, we add a number of columns and partitions to our output aggregation tables – such as platform, country, language and more.

This allows us to quickly try out our models on different groups.

Output:

We decided to split our output into 2 main parts:

First, a dataset for building models: It will contain only the served recommendations we want (from specific variants of the traffic), and it should contain all of the clicked recommendations plus a sample of the non clicks, in order to have a balanced dataset for learning.

Second, a dataset that will be used for simulation of the business metrics.

Our recommendations are served in widgets, showing a small batch of recommendations.
For this use case, we take only recommendations from widgets that received at least one click.

With this dataset, we can apply our model only on these clicked widgets, and see how well we graded the clicked recommendation compared to the other non clicked recommendations.

The solution – Automatic Spark job

Our solution to solve all these challenges was to create an automatic data collection job.

The job runs hourly, triggered by our ETL engine.
An hourly Apache Spark job aggregates an hourly dataset, with the relevant data; creates the needed partitions; and creates the two outputs described above.

Using Spark was very convenient for this use case. It allowed us to create a readable flow, that queries different input sources, and holds in memory data that is common for both tables before writing the final output to Hive.

A quick note on how we monitor our Spark jobs:

It is somewhat of a challenge to understand how a Spark job behaves, other the basic error messages and checking the job’s output.

To make the job’s progress and status more visible, we send metrics from the job’s driver, using HTTP, to our monitoring server, which aggregates them.

This allows us to create simple to use dashboards and understand with ease what is going on.

In addition we send out textual logs to our monitoring server, that indexes them to an ElasticSearch cluster. We can then view the logs with ease using Kibana.

Below is a dashboard example for a demo job, collecting a portion of our data.

dashboard example for a demo job

Stay tuned for our 3rd and last part of this blog post series where we will cover our Spark-based machine learning framework, that allows us to be highly agile with our research tasks, and dynamic as well as robust in pushing our models to production.

April 16, 2018

Shaked Bar

Taking the pain out of Data Science – RecSys machine learning framework over Spark, Part 1

Overview

The role of a Data Scientist, or Machine Learning engineer, is becoming more and more valuable in the tech industry. It is the fast growing job in the U.S. according to a LinkedIn study, and was recently ranked as the best job in America by Glassdoor.

However, the life of a Data Scientist isn’t easy – the job requires good Math and Statistics knowledge, programming background and experience, and “hacking” skills, in order to get things done. This is especially true when handling huge amounts of data of different types.

We, at the personalization team at Outbrain, decided to try and take out the pain of data science, and make our life easier to allow us to perform effective research with immediate production effect.

In this post series I will describe an end-to-end machine learning framework we developed, over Apache Spark, in order to address the different challenges our Data Scientists and Algorithm Engineers face.

Outbrain’s recommendation system machine learning challenge:

Our goal is recommending stories the user is most likely to be interested in, given the user’s interests and the current context.
We need to rank the stories in our inventory by the probability that the user will click to read, and display the top stories to the user.
Our supervision is the past user actions – given a user and a document with a set of features, did the user click or not.

Our data and features:

Outbrain generates a lot of data.
We get over 550 million unique monthly users, generate over 275 billion recommendations monthly and have more than 35 million clicks a day.

The first part to computing quality recommendations, is representing well our key players: The users and the documents.

We extract semantic features from each document that has our widget installed, using an NLP engine. The engine extracts semantic features on some levels of granularity, from high level categories to very granular entities.
For example, on the ‘Westworld’ story below, we extract:

High-level categories, such as entertainment.
Lower level topics – such as TV or murder.
Entities – persons, locations or companies, that the document discusses.

We represent our readers with similar semantic features.

We create an anonymous profile for each reader, based on content they were reading.

Each profile is an aggregation of the semantic features of each document the user read before.

Predictive models:

We use a variety of models, from three main types:

Content based models –

These models assume that there is a semantic connection between the documents the user likes to read.
The model uses the user profile, to find more documents that have similar semantic features to the ones we found out the user likes.
These could be stories within the same categories, from the same sites or covering a specific person or location.

Behavioural models –

Rather than assuming the user will want to keep reading documents on similar topics, it looks for connections between user interests, to other potential subjects that go well together.
For example, it may find that users that showed interest before on retirement investing, will be interested in the future on heart disease articles.

Collaborative models-

The third type, and potentially most powerful and interesting, are collaborative models.

These models use the wisdom of the crowd in order to recommend new content, potentially without the need to semantically understand it.
The basic idea is to find out readers with similar reading patterns; Find out what they like, in addition to the current user; and recommend these items.

Algorithms in this family use algebraic dimensionality reduction methods, such as Matrix Factorization or Factorization Machines, or finding a new, latent representation of the users and items using Deep Learning neural networks.

The process of data modeling consists of many common tasks.
To make this process efficient and enable agile research and productization, we have developed a general framework on top of Spark, containing 5 independent components.

Framework overview:

The system’s key components:

Data collection
Feature engineering
Model training
Offline evaluation and simulation
Model deployment

The same system is used for both research and analysis of new algorithms, and also for running production models and updating them on a regular basis.

Next week, in part 2 of this blog post series, we will dive into the data collection flow which is a key ingredient to machine learning flows, and see how data is made more accessible using an automated Spark process.
Part 3 will cover our modeling framework, developed on top of Spark, that allows us to be highly agile with our research tasks, and dynamic as well as robust in pushing our models to production. Stay tuned!

March 21, 2018

Alex Balk

Switches, Penguins and One Bad Cable

Back in May 2017, I was scheduled to speak at the DoTC conference in Melbourne. I was really excited and looking forward to it, but fate had different plans. And lots of them. From my son going through an emergency appendicitis operation, through flight delays, and up to an emergency landing back in Tel Aviv… I ended up missing the opportunity to speak at the conference. Amazingly, something similar happened this year! Maybe 3rd time’s a charm?

The post below is the talk I’d planned to give, converted to a blog format.

August 13, 2015. Outbrain’s ops on call is just getting out of his car when his phone rings. It’s a PagerDuty alert. Some kind of latency issue in the Chicago data center. He acks it, figuring he’ll unload the groceries first and then get round to it. But then, his phone rings again. And again.

Forget the groceries. Forget the barbecue. Production is on fire.

18 hours and many tired engineers later, we’re recovering from having lost our Chicago datacenter. In the takein that follows, we trace the root cause to a single network cable that’s mistakenly connected to the wrong switch.

Hi, my name is Alex, and I lead the Core Services group at Outbrain. Our group owns everything from the floor that hosts Outbrain’s servers, to the delivery pipelines that ship Outbrain’s code. If you’re here, you’ve likely heard of Outbrain. You probably know that we’re the world’s leading Discovery platform, and that you’ll find us installed on publisher sites like CNN, The Guardian, Time Inc and the Australian news.com, where we serve their readers with premium recommendations.

But it wasn’t always this way.

You see, back when we started, life was simple: all you had to do was throw a bunch of Linux servers in a rack, plug them into a switch, write some code… and sell it. And that we did!

But then, an amazing thing happened. The code that we wrote actually worked and customers started showing up. And they did the most spectacular and terrifying thing ever – they made us grow. One server rack turned into two and then three and four. And before we knew it, we had a whole bunch of racks, full of penguins plugged into switches. It wasn’t as simple as before, but it was manageable. Business was growing, and so were we.

Fast forward a few years.

We’re running quite a few racks across 2 datacenters. We’re not huge, but we’re not a tiny startup anymore. We have actual paying customers, and we have a service to keep up and running. Internally, we’re talking about things like scale, automation, and all that stuff. And we understand that the network is going to need some work. By now, we’ve reached the conclusion that managing a lot of switches is time-consuming, error-prone, and frankly, not all that interesting. We want to focus on other things, so we break the network challenge down to 2 main topics:

Management and Availability.

Fortunately, management doesn’t look like a very big problem. Instead of managing each switch independently, we go for a something called “a stack”. In essence, it turns 8 switches into one logical unit. At full density, it lets us treat 4 racks as a single logical switch. With 80 nodes per rack, that’s 320 nodes. Quite a bit of computes power!

Four of these setups – about 1200 nodes.

Across two datacenters? 2400 nodes. Easily 10x our size.

Now that’s very impressive, but what if something goes wrong? What if one of these stacks fails? Well, if the whole thing goes down, we lose all 320 nodes. Sure, there’s built-in redundancy for the stack’s master, and losing a non-master switch is far less painful, but even then, 40 nodes going down because of one switch? That’s a lot.

So we give it some thought and come up with a simple solution. Instead of using one of these units in each rack, we’ll use two. Each node will have a connection to stack A, and another to stack B. If stack A fails, we’ll still be able to go through stack B, and vice versa. Perfect!

In order to pull that off, we have to make these two separate stacks, which are actually two separate networks, somehow connect. Our solution to that is to set up bonding on the server side, making its two separate network interfaces look like a single, logical one. On the stack side, we connect everything to one big, happy, shared backbone. With its own redundant setup, of course.

In case you’re still keeping track of the math, you might notice that we just doubled the number of stacks per datacenter. But we still gained simple management And high availability at 10x scale. All this without having to invest in expensive, proprietary management solutions. Or even having to scale the team.

And so, it is decided. We build our glorious, stack-based topology. And the land has peace for 40 years. Or… months.

Fast forward 40 months.

We’re running quite a few racks across 3 datacenters. We’re serving customers like CNN, The Guardian, Time Inc and the Australian news.com. We reach over 500 million people worldwide, serving 250 billion recommendations a month.

We’re using Chef to automate our servers, with over 300 cookbooks and 1000 roles.

We’re practicing Continuous Delivery, with over 150 releases to production a day.

We’re managing petabytes of data in Hadoop, Elasticsearch, Mysql, Cassandra.

We’re generating over 6 million metrics every minute, have thousands of alerts and dozens of dashboards.

Infrastructure as Code is our religion. And as for our glorious network setup? it’s completely, fully, 100% … manual.

No, really. It’s the darkest, scariest part of our infrastructure.

I mean hey, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. But every change in the switches is risky because it’s done using the infamous “config management” called “copy-paste”.

The switching software stack and protocols are proprietary, especially the secret sauce that glues the stacks together. Which makes debugging issues a tiring back-and-forth with support at best, or more often just a blind hit-and-miss. The lead time to setting up a new stack is measured in weeks, with risk of creating network loops and bringing a whole datacenter down. Remember August 13th, 2015? We do.

Again, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. And it’s not like we babysit the solution on daily basis. But it’s definitely not Infrastructure as Code. And there’s no way it’s going to scale us to the next 10x.

Fast forward to June 2016.

We’re still running across 3 data centers, thousands of nodes. CNN, The Guardian, Time Inc, the Australian news.com. 500 million users. 250 billion recommendations. You get it.

But something is different.

We’re just bringing up a new datacenter, replacing the oldest of the three. And in it, we’re rolling out a new network topology. It’s called a Clos Fabric, and it’s running BGP end-to-end. It’s based on a design created by Charles Clos for analog telephony switches, back in the 50’s. And on the somewhat more recent RFCs, authored by Facebook, that bring the concept to IP networks.

In this setup, each node is connected to 2 top-of-rack switches, called leaves. And each leaf is connected to a bunch of end-of-row switches, called spines. But there’s no bonding here and no backbone. Instead, what glues this network together, is that fact that everything in it is a router. And I do mean everything – every switch, every server. They publish their IP addresses over all of their interfaces, essentially telling their neighbors, “Hi, I’m here, and you can reach me through these paths.” And since their neighbors are routers as well, they propagate that information.

Thus a map of all possible paths to all possible destinations is constructed, hop-by-hop, and held by each router in the network. Which, as I mentioned, is everyone. But it gets even better.

We’ve already mentioned that each node is connected to two leaf switches. And that each leaf is connected to a bunch of spines switches. It’s also worth mentioning that they’re not just “connected”. They’re wired the exact same way. Which means, that any path between two points in the network is the exact same distance. And what THAT means is that we can rely on something called ECMP. Which, in plain English, means “just send the packets down any available path, they’re all the same anyway”. And ECMP opens up interesting options for high availability and load distribution.

Let’s pause to consider some of the gains here:

First, this is a really simple setup. All the leaf switches are the same. And so are all of the spines. It doesn’t matter if you have one, two or thirty. And pretty much the same goes for cables. This greatly simplifies inventory, device and firmware management.

Second, it’s predictable. You know the exact amount of hops from any one node in the network to any other: It’s either two or four, no more, no less. Wiring is predictable as well. We know exactly what gets connected where, and what are the exact cable lengths, right from design phase. (spoiler alert:) We can even validate this in software.

Third, it’s dead easy to scale. When designing the fabric, you choose how many racks it’ll support, and at what oversubscription ratio. I’ll spare you the math and just say:

You want more bandwidth? Add more spines.

Support more racks? Go for spines with higher port density.

Finally, high availability is built into the solution. If a link goes down, BGP will make sure all routers are aware. And everything will still work the same way, because with our wiring scheme and ECMP, all paths are created equal. Take THAT evil bonding driver!

But it doesn’t end there. Scaling the pipes is only half the story. What about device management? The infamous copy-paste? Cable management? A single misconnected cable that could bring a whole datacenter down? What about those?

Glad you asked 🙂

After a long, thorough evaluation of multiple vendors, we chose Cumulus Networks as our switch Operating System vendor, and Dell as our switch hardware vendor. Much like you would with servers, by choosing Enterprise Redhat, Suse or Ubuntu. Or with mobile devices, by choosing Android. We chose a solution that decouples the switch OS from the hardware it’s running on. One that lets us select hardware from a list of certified vendors, like Dell, HP, Mellanox and others.

So now our switches run Cumulus Linux, allowing us use the very same tools that manage our fleet of servers, to now manage our fleet of switches. To apply the same open mindset in what was previously a closed, proprietary world.

In fact, when we designed the new datacenter, we wrote Chef cookbooks to automate provisioning and config. We wrote unit and integration tests using Chef’s toolchain and setup a CI pipeline for the code. We even simulated the entire datacenter, switches, servers and all, using Vagrant.

It worked so well, that bootstrapping the new datacenter took us just 5 days. Think about it:

the first time we ever saw a real Dell switch running Cumulus Linux was when we arrived on-site for the buildout. And yet, 99% of our code worked as expected. In 5 days, we were able to setup a LAN, VPN, server provisioning, DNS, LDAP and deal with some quirky BIOS configs. On the servers, mind you, not the switches.

We even hooked Cumulus’ built-in cabling validation to our Prometheus based monitoring system. So that right after we turned monitoring on, we got an alert. On one bad cable. Out of 3000.

Infrastructure as Code anyone?

December 28, 2017

admin

Live Tail in Kubernetes / Docker Based environment

At Outbrain we are big believers in Observability.

What is Observability, and what is the difference between Observability and Monitoring? I will leave the explanation to Baron Schwartz @xaprb:

“Monitoring tells you whether the system works. Observability lets you ask why it’s not working.”

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

This presented many new challenges around understanding why things don’t work.

In this post I will be sharing with you our logging implementation which is the first tool used to understand the why.

But first thing first, a short review of our current standard logging architecture:

We use a standard ELK stack for the majority of our logging needs. By standard I mean Logstash on bare metal nodes, Elasticsearch for storage and Kibana for visualizing and analytics. Apache Kafka is transport layer for all of the above.

A very simplified sketch of the system: Live Tail in Kubernetes

Of course the setup is a bit more complex in real life since Outbrain’s infrastructure is spread across thousands of servers, in multiple physical data centers and cloud providers; and there are multiple Elasticsearch clusters for different use cases.

Add to the equation that these systems are used in a self-serve model, meaning the engineers are creating and updating configurations by themselves – and you end up with a complex system which must be robust and resilient, or the users will lose trust in the system.

The move to Kubernetes presented new challenges and requirements, specifically related to the logging tools:

Support multiple Kubernetes clusters and data centers.
We don’t want to us “kubectl”, because managing keys is a pain especially in a multi cluster environment.
Provide a way to tail logs and even edit log file. This should be available on a single pod or across a service deployed in multiple pods.
Leverage existing technologies: Kafka, ELK stack and Log4j on the client side
Support all existing logging sources like multiline and Json.
Don’t forget services which don’t run in Kubernetes, yes we still need to support those.

So how did we meet all those requirements? Time to talk about our new Logging design.

The new architecture is based on a standard Kubernetes logging setup – Fluentd daemonset running on each Kubelet node, and all services are configured to send logs to stdout / err instead of a file.

The Fluentd agent is collecting the pod’s logs and adding the Kubernetes level labels to every message.

The Fluentd plugin we’re using is the kubernetes_metadata_filter.

After the messages are enriched they are stored in a Kafka topic.

A pool of Logstash agents (Running as pods in Kubernetes) are consuming and parsing messages from Kafka as needed.

Once parsed messages can be indexed into Elasticsearch or routed to another topic.

A sketch of the setup described:

And now it is time to introduce CTail.

Ctail, stands for Containers Tail, it is an Outbrain homegrown tool written in Go, and based on a server and client side components.

A CTail server-side component runs per datacenter or per Kubernetes cluster, consuming messages from a Kafka topic named “CTail” and based on the Kubernetes app label creates a stream which can be consumed via the CTail client component.

Since order is important for log messages, and since Kafka only guarantees order for messages in the same partition, we had to make sure messages are partitioned by the pod_id.

With this new setup and tooling, when Outbrain engineers want to live tail their logs, all they need to do is launch the CTail client.

Once the Ctail client starts, it will query Consul, which is what we use for service discovery, to locate all of the CTail servers; register to their streams and will perform aggregations in memory – resulting in a live stream of log entries.

Here is a sketch demonstrating the environment and an example of the CTail client output:

CTail client output

To view logs from all pods of a service called “ob1ktemplate” all you need is to run is:

# ctail-client -service ob1ktemplate -msg-only

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: uri http://localhost:8181/Ob1kTemplate/ returned status code 200
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: Getting uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate'
2017-06-13T19:16:25.531Z ob1ktemplate-test-ssages-2751568954-n1rte: uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate' returned status code 200

Or logs of a specific pod:

# ctail-client -service ob1ktemplate -msg-only -pod ob1ktemplate-test-ssages-2751568960-n1kwd

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri 
http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751568960-n1kwd: uri http://localhost:8181/Ob1kTemplate/ returned status code 200

This is how we solve this challenge.

Interested in reading more about other challenges we encountered during the migration? Either wait for our next blog, or reach out to visibility at outbrain.com.

June 15, 2017

admin

Migrating Elephants – How To Migrate Petabyte Scale Hadoop Clusters With Zero Downtime

Outbrain has been an early adopter of Hadoop and we, the team operating it, have acquired a lot of experience running it in production in terms of data ingestion, processing, monitoring, upgrading etc. This also means that we have a significant ecosystem around each cluster, with both open source and in-house systems.

A while back we decided to upgrade both the hardware and software versions of our Hadoop clusters.

“Why is that a big problem?” you might ask, so let me explain a bit about our current Hadoop architecture. We have two clusters of 300 machines in two different data centers, production and DR. Each cluster has a total dataset size of 1.5 PB with 5TB of compressed data loaded into it each day. There are ~10,000 job executions daily of about 1200 job definitions that were written by dozens of developers, data scientists and various other stakeholders within the company, spread across multiple teams around the globe. These jobs do everything from moving data into Hadoop (for ex. Sqoop or Mysql to Hive data loads), processing in Hadoop (for ex. running Hive, Scalding or Pig jobs), and pushing the results into external data stores (for ex. Vertica, Cassandra, Mysql etc.). An additional dimension of complexity originates from the dynamic nature of the system since developers, data scientists and researchers are pushing dozens of changes to how flows behave in production on a daily basis.

This system needed to be migrated to run on new hardware, using new versions of multiple components of the Hadoop ecosystem, without impacting production processes and active users. A partial list of the components and technologies that are currently being used and should be taken into consideration is HDFS, Map-Reduce, Hive, Pig, Scalding, and Sqoop. On top of that, of course, we have several more in-house services for data delivery, monitoring and retention that we have developed.

I’m sure you’ll agree that this is quite an elephant.

Storming Our Brains

We sat down with our users, and started thinking about a process to achieve this goal and quickly arrived at several guidelines that our selected process should abide by:

Both Hadoop clusters (production and DR) should always be kept fully operational
The migration process must be reversible
Both value and risk should be incremental

After scratching our heads for quite a while, we came up with these options:

In place: In place migration of the existing cluster to the new version and then rolling the hardware upgrade by gradually pushing new machines into the cluster and removing the old machines. This is the simplest approach and you should probably have a very good reason to choose a different path if you can afford the risk. However, since upgrading the system in place would expose clients to a huge change in an uncontrolled manner and is not by any means an easily reversible process we had to forego this option.
Flipping the switch: The second option is to create a new cluster on new hardware, sync the required data, stop processing on the old cluster and move it to the new one. The problem here is that we still couldn’t manage the risk, because we would be stopping all processing and moving it to the new cluster. We wouldn’t know if the new cluster can handle the load or if each flow’s code is compatible with the new component’s version. As a matter of fact, there are a lot of unknowns that made it clear we had to split the problem into smaller pieces. The difficulty with splitting in this approach is that once you move a subset of the processing from the old cluster to the new, these results will no longer be accessible on the old cluster. This means that we would have had to migrate all dependencies of that initial subset. Since we have 1200 flow definitions with marvelous and beautiful interconnections between them, the task of splitting them would not have been practical and very quickly we found that we would have to migrate all flows together.
Side by side execution: The 3rd option is to start processing on the new cluster without stopping the old cluster. This is a sort of an active-active approach, because both Hadoop clusters, new and old, will contain the processing results. This would allow us to migrate parts of the workload without risking interfering with any working pipeline in the old cluster. Sounds good, right.

First Steps

To better understand the chosen solution let’s take a look at our current architecture:

current architecture:

We have a framework that allows applications to push raw event data into multiple Hadoop clusters. For the sake of simplicity, the diagram describes only one cluster.

Once the data reaches Hadoop, processing begins to take place using a framework for orchestrating data flows we’ve developed in-house that we like to call the Workflow Engine.

Each Workflow Engine belongs to a different business group. That Workflow Engine is responsible for triggering and orchestrating the execution of all flows developed and owned by that group. Each job execution can trigger more jobs on its current Workflow Engine or trigger jobs in other business groups’ Workflow Engines. We use this partitioning mainly for management and scale reasons but during the planning of the migration, it provided us with a natural way to partition the workload, since there are very few dependencies between groups vs within each group.

Now that you have a better understanding of the existing layout you can see that the first step is to install a new Hadoop cluster with all required components of its ecosystem and begin pushing data into it.

To achieve this, we configured our dynamic data delivery pipeline system to send all events to the new cluster as well as the old, so now we have a new cluster with a fully operational data delivery pipeline:

data delivery pipeline

Side by Side

Let’s think a bit about what options we had for running a side by side processing architecture.

We could use the same set of Workflow Engines to execute their jobs on both clusters, active and new. While this method would have the upside of saving machines and lower operational costs it would potentially double the load on each machine since jobs are assigned to machines in a static manner. This is due to the fact that each Workflow Engine is assigned a business group and all jobs that belong to this group are executed from it. To isolate the current production jobs execution from the ones for the new cluster we decided to allocate independent machines for the new cluster.

Let the Processing Commence!

Now that we have a fully operational Hadoop cluster running alongside our production cluster, and we now have raw data delivered into it, you might be tempted to say: “Great! Bring up a set of Workflow Engines and let’s start side by side processing!”.

Well… not really.

Since there are so many jobs and they doing varied types of operations we can’t really assume that letting them run side by side is a good idea. For instance, if a job calculates some results and then pushes them to MySql, these results will be pushed twice. Aside from doubling the load on the databases for no good reason, it may cause in some cases corruption or inconsistencies of the data due to race conditions. In essence, every job that writes to an external datasource should be allowed to run only once.

So we’ve described two types of execution modes a WorkflowEngine can have:

Leader: Run all the jobs!

Secondary: Run all jobs except those that might have a side effect external to that Hadoop cluster (e.g. write to external database or trigger an applicative service). This will be done automatically by the framework thus preventing any effort from the development teams.

When a Workflow Engine is in secondary mode, jobs executed from it can read from any source, but write only to a specific Hadoop cluster. That way they are essentially filling it up and syncing (to a degree) with the other cluster.

Let’s Do This…

Phase 1 of the migration should look something like this:

Let's Do This...

Notice that I’ve only included a Workflow Engine for one group in the diagram for simplicity but it will look similar for all other groups.

So the idea is to bring up a new Workflow Engine and give it the role of a migration secondary. This way it will run all jobs except for those writing to external data stores, thus eliminating all side effects external to the new Hadoop cluster.

By doing so, we were able to achieve multiple goals:

Test basic software integration with the new Hadoop cluster version and all services of the ecosystem (hive, pig, scalding, etc.)
Test new cluster’s hardware and performance compared to the currently active cluster
Safely upgrade each business group’s Workflow Engine separately without impacting other groups.

Since the new cluster is running on new hardware and with a new version of Hadoop ecosystem, this is a huge milestone towards validating our new architecture. The fact the we managed to do so without risking any downtime that could have resulted from failing processing flows, wrong cluster configurations or any other potential issue was key in achieving our migration goals.

Once we were confident that all phase 1 jobs were operating properly on the new cluster we could continue to phase 2 in which a migration leader becomes secondary and the secondary becomes a leader. Like this:

new cluster

In this phase all jobs will begin running from the new Workflow Engine impacting all production systems, while the old Workflow Engine will only run jobs that create data to the old cluster. This method actually offers a fairly easy way to rollback to the old cluster in case of any serious failure (even after a few days or weeks) since all intermediate data will continue to be available on the old cluster.

The Overall Plan

The overall process is to push all Workflow Engines to phase 1 and then test and stabilize the system. We were able to run 70% (!) of our jobs in this phase. That’s 70% of our code, 70% of our integrations and APIs and at least 70% of the problems you would experience in a real live move. We were able to fix issues, analyze system performance and validate results. Only once everything seems to be working properly we can start pushing the groups to phase 2 one by one into a tested, stable new cluster.

Once again we benefit from the incremental nature of the process. Each business group can be pushed into phase 2 independently of other groups thus reducing risk and increasing our ability to debug and analyze issues. Additionally, each business group can start leveraging the new cluster’s capabilities (e.g. features from newer version, or improved performance) immediately after they have moved to phase 2 and not after we have migrated every one of the ~1200 jobs to run on the new cluster. One pain point that can’t be ignored is that inter-group dependencies can make this a significantly more complicated feat as you need to bring into consideration the state of multiple groups when migrating.

What Did We Achieve?

Incremental Migration – Due to the fact that we had an active-active migration that we could apply on each business group, we benefited in terms of mitigating risk and gaining value from the new system gradually.
Reversible process- since we kept all old workflow engines (that executed their jobs on the old Hadoop cluster) in a state of secondary execution mode, all intermediate data was still being processed and was available in case we needed to revert groups independently from each other.
Minimal impact on users – Since we defined an automated transition of jobs between secondary and leader modes users, didn’t need to duplicate any of their jobs.

What Now?

We have completed the upgrade and migration of our main cluster and have already started the migration of our DR cluster.

There are a lot more details and concerns to bring into account when migrating a production system at this scale. However, the basic abstractions we’ve introduced here, and the capabilities we’ve infused our systems with have equipped us with the tools to migrate elephants.

For more information about this project, you can check out the video from Strata 2017 London where I discussed it in more detail.

June 13, 2017

admin

Failure Testing for your private cloud – Introducing GomJabbar

TL;DR Chaos Drills can contribute a lot to your services resilience, and it’s actually quite a fun activity. We’ve built a tool called GomJabbar to help you run those drills.

Failure Testing for your private cloud

Here at Outbrain we manage quite a large scale deployment of hundreds of services/modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can’t ensure a 100% fault-free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix’s Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool is that failures have the tendency of occurring when you’re least prepared, and in the least desirable time, e.g. Friday nights, when you’re out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don’t always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn’t it be great if we could have “chaos drills”, where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:

Did the system handle the failure case correctly?
Was our alerting strategy effective?
Did the team have the knowledge to handle, and troubleshoot the failure?
Was the issue investigated thoroughly?

These take-ins lead to super valuable inputs, which we probably wouldn’t collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.

"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)

So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:

There’s an obvious need to avoid unnecessary damage.
- We’ve created filters to ensure only approved targets get to participate in the drills.
  This has a side effect of pre-marking areas in the code we need to take care of.
- We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
  we reschedule.
- When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
- We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
  and moved on to more interesting testing like faulty network emulation.
We’ve measured the time teams spent on these drills, and it turned out to be negligible.
Most of the time was spent on preparations. For example, ensuring we have proper alerting,
and correct resilience features in the clients.
This is actually something you need to do anyway. At the end of the day, we’ve heard no complaints about interruptions, nor time waste.
We’ve made sure teams, and engineers on the call were not left on their own. We wanted everybody to learn
from this drill, and when they weren’t sure how to proceed, we jumped in to help. It’s important
to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.

All that said, it’s important to remember that we basically simulate failures that occur on a daily basis. It’s only that when we do that in a controlled manner, it’s easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap – What next?

Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
Add new kinds of failures, like disk space issues, power failures, etc.
So far, we were only brave enough to run this on applicative nodes, and there’s no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):

Graceful shutdowns of service instances.
Graceless shutdowns of service instances.
Faulty Network Emulation (high latency, and packet-loss).

Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you’ve read so far, you may be asking yourself what’s in it for me? What kind of lessons can I learn from these drills?

We’ve actually found and fixed many issues by running these drills, and here’s what we can share:

We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We’ve found that we didn’t compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
We’ve found services that had no owner, became obsolete and were basically running unattended in production. The horror.
During the faulty network emulations, we’ve found that we had clients that didn’t implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We’ve also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
We’ve also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.

Conclusion

We’ve found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We’re by no means anywhere near perfection. We’re actually pretty sure we’ll find many many more issues we need to take care of. We’re hoping this exciting new tool will help us move to the next level, and we hope you find it useful too 😉

Category: System Architecture

Content Fraud – The Major Challenge for Content Networks

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

The Technology Challenge

Crawling Our Way Out

Rapid Response Saves the Day

What Doesn’t Kill Us, Makes Us Stronger

Overview

Pipeline Architecture

Hourly Files Pipeline – legacy

Real Time Streaming Pipeline

Obvious Benefits

Additional Benefits

Conclusions

Distributed Tracing

What we need to solve

The First Rule

What not to do

Slow and steady wins the race

Meet WorkflowEngine

Diving Into the Solution

Keeping the control while having fun

Data Collection

Challenges:

Silos and partitioning:

Output:

The solution – Automatic Spark job

Overview

Outbrain’s recommendation system machine learning challenge:

Our data and features:

Predictive models:

Content based models –

Behavioural models –

Collaborative models-

Framework overview:

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

So how did we meet all those requirements? Time to talk about our new Logging design.

And now it is time to introduce CTail.

Storming Our Brains

First Steps

Side by Side

Let the Processing Commence!

Let’s Do This…

The Overall Plan

What Did We Achieve?

What Now?

Chaos Drills at Outbrain

How did we kick this off?

Our roadmap – What next?

The GomJabbar Internals

What have we learned so far?

Conclusion

Search

עברית

Categories

Archive

RSS