November 15, 2020

Ariel Kipervasser

Taming the bear (metal)

Introduction

In this blog post, I am going to share with you the process behind Namaste – an automated tool we built, in order to manage our bare metal machine requests, as part of our BMaaS approach (that’s Bare Metal as a Service).

Who are we?

Before we dive into the new tool we built, let me introduce our team.

We are a team of 4 DevOps engineers based in Israel, managing an on premise environment of 7000 bare metal servers across 3 Data Centers in the US.

But there is a little bit more to it. When we say 7000 servers, it includes:

about 20 different server models
more than 100 different disk models
…memory sticks
…CPUs
…raid cards
..NICs
…PSUs
…and a combination of all of the above

So there are a lot of hardware configuration options and a lot of reusing of the servers.

Since we are located in Israel and our Data Centers are located in the US, in each Data Center, there is a support team available 24/7 to handle everything on the physical side. They take the role of our “Remote Hands and Eyes” with any onsite related work. I will address them as “Remote Hands” in this blog post.

What we do?

We divide our work into 3 main types of tasks:

Planned changes – Our planned changes are managed in Jira and are mainly hardware related tickets opened by our customers – mostly engineers in the Cloud Platform group. We usually divide these tickets into 3 types of tasks:

hardware upgrades
service requests
machine requests (which we’ll discuss in more details later)

These can be upgrading the resources of a server, working on a suspected hardware issue or building a new cluster altogether.

Unplanned changes – Everything which has to be done Now. During working hours it can be a task managed in Jira as a blocking/critical ticket. During off hours it can be a PagerDuty alert.

Internal projects – Projects which our team initiates and decides to manage. These are usually the tasks which our users don’t know or really care about, but those tasks are more interesting and make our users’ lives and our lives much better. In essence, our internal projects prevent unplanned changes and make the planned changes easier to perform. The new tool I’ll discuss falls under this category.

If you’d like to learn more about who we are and how we do things, I strongly encourage you to check out this great blog post by my colleague, Adib Daw.

The challenge

Looking back, a year ago our tasks were distributed like this:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

This means 90% of the work was manual labor and 10% were projects geared towards automation. The meaning was both relatively slow delivery, much room for human errors and frustration due to the highly repetitive nature of most of the work. Simply put, it just didn’t scale.

Our solution to this problem was to formulate a vision, and present it to our customers to get their buyin. Our customers agreed to take a hit on the delivery times of their requests, so we could focus on the “internal projects” category, with the intention of investing in automation, and thus the velocity and robustness of our execution.

We decided to push towards building as many automated processes as possible, in order to minimise the amount of time spent on planned changes and unplanned changes, since many of these tasks appeared to lend themselves quite well to automation.

The “before” picture

When we began our journey towards automation, the process of choosing the best suited servers for the machine request took us a lot of time.

It looked like this:

Check available hardware and verify we have enough free servers to fulfill the request
Choose the best suited hardware by carefully (and manually) ensuring hardware constraints are met (suitable server form factor, support for required number of drives, RAID adapter, etc)
Reserve the servers in a temporary allocation pool so no one else will use them for something else as they’re being worked on
Allocate the relevant parts in our inventory system
Submit the target configuration into the Remote Hands management system we’d built, so that the onsite technician would have the required information for the task (server setup and required parts)
Open a ticket with the relevant information (server locations, serial number, etc) to the onsite Remote Hands team, to begin work on the hardware configuration changes

All of this had to be done before any actual hands-on work on the hardware even started.

In addition, we also had to deal with failures in the flow:

configuration mistakes
hardware mismatches
human errors
faulty parts

The “after” picture

The process we had worked, but it took a lot of time, sometimes as much as hours per request, depending on how many servers were requested and what changes were needed to their config.

While we had a set of tools that helped bring us to this stage – much improved over the previous process we had, which could take days for large requests – it would keep us afloat but still required manual labor which seemed entirely unnecessary.

Our vision for this process was to simplify it, to look as follows:

Submit a request for the final server config, detailing number of servers per datacenter and their final spec
Drink coffee while Namaste handles everything in the background, including opening a ticket for Remote Hands
Get notified that the request was fulfilled

This simplified process would have code deal with the metadata matching, allocation and reservation, and leave us to handle fulfilment rejects, where human intervention is really necessary.

Behind the scenes

The flow of matching machines per a machine request:

Accept user input – We enter server requirements such as quantity, location and spec
Perform machine matching – An engine we built returns all the servers matching the requirements, some of which may be a partial match (missing a drive, less RAM, etc)
Prioritise – Prioritise from the matched server list, based on:
- Rack awareness
- Server form factor
- Drive form factor
- Number of cores closest to requested spec
- Hardware changes required to bring machines to the requested spec
Reserve machines – Reserve the prioritised servers in a temporary pool to avoid double allocation
Perform parts matching – Decide which parts to use to bring the reserved servers to the requested spec, wherever needed
Reserve parts – Reserve the selected parts in our inventory to avoid double allocation
Set machines to target spec – Mark the servers with the needed changes, so when the onsite support team begins working on fulfilling the work order, they will connect an iPad to each server which will tell them what changes need to be done.
Open a ticket – Open a ticket with all the relevant information to the onsite support team for fulfilment.

Return on investment

The most obvious gain in developing this mechanism is time saving. With the steps of matching hardware and reserving parts being the main points of friction that were removed. Time saving also shortens the wait for hardware request fulfillment, so both the team and the customers win.

Automated matching also reduces human errors, which has impact on time saving as well as on customer experience. In addition, it further allows us to use the best suitable hardware, which improves efficiency by assigning the best matching resource to fulfill the request. Simply put, we no longer assign the “first matching” or “random” servers. Instead, we select the “most fitting” ones, as the matching algorithm is now explicit and repeatable.

A “side effect” of the improved matching process is that a whole class of issues is avoided completely. Those are issues around server <-> part incompatibility, maximum server capacity and generally everything Remote Hands might come across after the request had already been dispatched to them. These kinds of issues are the most expensive time-wise, as they require human to human communication across different time zones. We knew this would help, but the actual impact was astounding.

Finally, the user experience dramatically improved with the introduction of this system:

Users request hardware get their requests fulfilled quicker
Remote Hands come across less issues in the field
Our team deals with much fewer rejects and manual labor

As you can see, there are 3 types of customers here, and all come out on top.

A short summary

A year ago we embarked on a journey to build our BMaaS solution, where Namaste is just one part of the full picture. While this is still a work in progress and we expect it to continue evolving indefinitely, now is a good opportunity to look back on where we started, specifically on what drove us down this path.

At the beginning of this post, I mentioned what our work distribution looked like before we decided to change things around. To remind you, it was something along:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

A year into the process, the nature of our work has changed dramatically, and the work distribution is now more along:

Planned changes – 15%

Unplanned changes – <0.1%

Internal projects – 85%

It’s important to note that the actual volume of changes hasn’t decreased. In fact, we now handle more hardware fulfillment requests than ever before. But the actual mindful attention these changes require from us has dropped significantly. Instead, we focus mostly on building the system that gets the work done for all of us, shifting us from “moving hardware” to writing software.

What’s next

APIs – by making our BMaaS platform’s APIs open to our users, we expect new and unforeseen use cases to surface, where our users build their own solutions on top of the platform
Web UI – much of what we do is currently accessible via command-line tools. While this works, a web UI could improve the user experience even further and allow building various views which the system doesn’t support as of yet
Events & notifications – our team still serves as a communication pipeline in various parts of the process, as we’re validating different ideas and ensuring everything works as expected. We plan to introduce events and notifications that would remove us entirely from the flow and allow users to act upon certain events, in-person or via their own code
Coffee – maybe move to some decaf. Or tea.

July 9, 2020

Elazar Gur

Keren Ofek Granov

Faster Than a Missile: Intercepting Content Fraud

On May 21, 2020, thousands of Israeli websites were hacked by a group called “Hackers Of Savior”, probably including hackers from Turkey, The Gaza Strip, and North Africa. Not the first – and surely not the last – cyber-attack of its kind, the hackers targeted websites stored on uPress, an Israeli web hosting company.

The websites’ content was replaced by a page with a statement against Israel – “The countdown to Israel’s destruction began a long time ago” – and a disturbing video clip. The attack was timed to coincide with the Israeli national holiday “Jerusalem Day”, which celebrates the reunification of the city after the Six-Day War. It also coincided with the Iranian “Quds Day”, which was probably an additional motivation for the attack.

*The attacker’s content as shown on the hacked websites*

Beyond the visual corruption of the websites, the attackers also tried to delete all the websites’ data. Besides, the hackers prompted the browser to ask the viewer’s permission to take their photo with their computer’s webcam.

This attack was not your ordinary run-of-the-mill fraud event. It did not revolve around the usual advertiser-sourced content fraud. Rather, the source of the problem was an outside party and Outbrain was not even the target. The attack harmed not just the website publishers and end-users, but advertisers too.

Before diving in to see how Outbrain coped on the battleground during this event, let’s start at the beginning. What is “content fraud”, why is it such a challenge, and how does Outbrain protect its network and partners every day?

Content Fraud – The Major Challenge for Content Networks

Content fraud is an industry-wide problem that comes in many forms, including phishing, redirects, cloaking, malvertising, conspiracy-mongering and fake news. The internet is subsumed by an endless amount of fake content, such as fake quotes, doctored images, and deepfake video, making it increasingly difficult to navigate towards an authentic online experience. As content fraud is always evolving, it becomes much harder to detect and prevent over time.

As a discovery company, part of Outbrain’s lighthouse is to provide qualified content to its readers. While the overwhelming majority of content provided by advertisers is qualified, interesting and safe, Outbrain invests significant resources to ensure that no malicious or deceitful content makes its way onto our advertising platform.

When content fraud is detected, it comes from a specific and identifiable source that can be easily blocked and prevented from accessing Outbrain’s network. Generally, we use multiple technologies and people-powered techniques to verify that advertisers are legitimate and promoting qualified content. We look at a variety of properties, the identity of the advertiser, and the content itself. Based on the technologies Outbrain usually uses, the ability to respond is broad and fast.

In the recent event, the challenge for Outbrain was different.

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

*Scott Webb (https://unsplash.com/photos/yekGLpc3vro*)

The cyber event of May 2020 presented our anti-fraud specialists with a different and unique set of obstacles.

Firstly, we were not battling against an entity that was part of our network. The threat did not come from one of our partners – rather, our publishers and advertisers were first in the line of fire.

Secondly, the attack was broad and not sourced from a specific website. We assumed that some websites in our network had been corrupted, however, we didn’t know which ones. Targeted and specific blocking is the most common anti-fraud solution, yet in this case, we had to find and block multiple domains that were yet unknown.

Thirdly, any action would be temporary. When we identify a fraudster on the network, they are permanently blocked and banished from the network. There is no going back. However, in this case, we knew that once the affected websites recovered, we had to immediately re-enable them and permit them back on the network.

Lastly, even though Outbrain was not the target, we had to stay on our toes, and to assume that the attack was still happening or could happen again at any time. Although we identified the characteristics when it occurred, we had to be ready to quickly detect any new method or pattern that the attackers could develop and deploy.

The response by our anti-fraud team was designed to protect all three pillars of the Outbrain ecosystem:

The first pillar – Users who need protection from threatening content.
The second pillar – Advertisers who need protection from being charged for harmful visits and damage to their brand image.
The third pillar – Publishers who were prevented from directing traffic to threatening content that would harm their reputation.

The Technology Challenge

In order to find the corrupted sites, we had to scan loads and loads of web pages. Once such a page was found, we had to disable it immediately so it would no longer be promoted on the Outbrain network. These scans had to be done quickly and repeatedly. If a cyber attack is ongoing, pages can be hacked even after we scan them, and if the pages have recovered, we need to know straight away in order to recommend them back to the network.

*“The hacked pages shared the same attributes” – One of the attributes we discovered from crawler logs*

How did we identify a hacked page? We had a few options:

Published lists of the hacked websites: In some cyber attacks, this can be a good solution, but not in this case. The attack was a rolling one, which meant not all the websites were hijacked at once. There was no single reliable source that contained all the hacked websites.
Third-party solutions: In order to maintain the integrity of the Outbrain network, the Anti-Fraud team uses not only internal solutions that were developed in-house, but also external solutions, to flag problematic content, like malware, cloakers, redirects, and other fraud types. A third-party solution can be used to classify specific characteristics as a website “under attack”, so it might have been an option in a case like this. However, making the adjustments necessary to detect this specific attack would take too long, and in this instance, the ability to react fast was super important.
Outbrain’s in-house crawler: In order to recommend the best content to readers, Outbrain crawls the promoted content in its network, analyzing characteristics and extract features that will be used in the NLP and recommendation algorithms. Our in-house crawler was identified as the ideal solution for this challenge, as it is already connected to our databases. In order to get immediate results, we only had to define the population we wanted to crawl. Moreover, with some additional configurations and quick development by the AppServices team, we could tackle all the challenges at once, automatically, and rapidly.

Crawling Our Way Out

So Outbrain’s in-house crawler was immediately mobilized for detecting which sites were hacked.

We started by analyzing several hacked sites found in multiple sources across the net. Using our crawler output logs, we discovered that the pages were identical in terms of both structure and visual content, meaning they shared the same attributes (title, text, links, keywords, etc.). By querying the crawler logs for those attributes, we were able to easily identify more hacked domains. The problem was that the picture we saw was only a partial one, since not all the pages were crawled yet.

We chose the relevant potentially-hacked domains list and set it to be crawled in cycles of 3 hours, using a configured crawler feature that was specially tailored for this problem. Once the crawler identified a new (specific) value for the attributes the attackers injected into the web pages, we could disable the hacked pages. Disabling a page meant that the hacked native ads were no longer available as recommendations on the Outbrain network, and were no longer accessible by our users.

The ongoing scans also allowed us to monitor and understand the impact of the attack on our inventory in real-time and to communicate it to our partners. When web pages were recovered from the cyber attack, we used the crawler to re-enable them on the Outbrain network. With the help of our unique tech capabilities, we were able to come full circle out of this pernicious cyber attack.

Rapid Response Saves the Day

The attack began on the morning of May 21. According to uPress’s announcement, by May 22, 02:28 (Israel Time), 90% of the websites were recovered. As a result of our actions, we were able to prevent Outbrain users from being exposed to hacked web pages containing disturbing content.

This rapid reaction not only protected our users from threatening content, but it also enabled affected publishers and advertisers to return to their usual Outbrain traffic services very quickly after websites were recovered.

What Doesn’t Kill Us, Makes Us Stronger

During the cyber rescue operation, we were in direct contact with several teams in the organization, enabling us to present a united front and a holistic solution. Along the way, we benefited from the ability to pivot in several different directions as necessary, both internally and externally. It was a great lesson about how to use various tools to solve an array of problems.

With the rapid, coordinated response from Outbrain’s teams, we overcame the “Hackers of Savior” attack, and also developed the infrastructure to manage similar cyberattacks in the future, if – and when – they occur.

May 27, 2020

Avi Avraham

Real Time Data Pipeline – More Than We Expected

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and to make the data available in real time. But as soon as we started to work on it, we realized that there are quite a few additional good reasons to embark on this complex project.

Overview

Our data delivery pipeline is designed to deliver data from various services into our Hadoop ecosystem. There it is processed using Spark and Hive in order to produce aggregated data to be used by our algorithms and machine learning (ML) implementations, business reports, debugging, and other analysis. We call each type of this data a “datum”. We have more than 60 datums, each representing a different type of data, for example: request, click, listings, etc.

Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system building from thousands of micro services running inside Kubernetes containers on top of an infrastructure that is spread over more than 7000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

As you can imagine, our services produce lots of data that is passed through our data delivery pipeline. On an average day, more than 40TB is moving from through it, and on peak days it can pass the 50TB.

And yes, in order to support this scale, the delivery pipeline needs to be scalable, reliable and lots of other words that end with “able”.

Pipeline Architecture

As I wrote above, we have multiple data centers (DCs), but in order to simplify the diagram and explanation, I will use only 2 DCs to present our pipeline architecture. As we go through the two architectures (legacy hourly and current RT), you will notice that the only components which remain the same are the edges. The services that produce the data, and the Hadoop where the data ends up.

Another thing worth mentioning is that in order to have full disaster recovery (DR), we have 2 independent Hadoop clusters in separated locations, each one of them gets all the data and processes the same jobs independently, in case one of them goes down, the other continues to work normally.

Hourly Files Pipeline – legacy

The data is collected by the services, and saved to files on the local file system of each service. From there, another process copies those files to our Data Processing System (DPS), which is an in house solution that collects all the data and copies it into the Hadoop. As you can see, it is a very simple architecture without a lot of components, and is quite reliable and robust since the data is saved in files which can be easily recovered in case of any malfunction.

The drawbacks of this pipeline is that the data is moved in chunks of hours and it is not accessible during the duration of the delivery time. It is only available for processing after each hour is completed, and all the data is in the Hadoop. And it also places a heavy burden on the network because the data is transferred in spikes.

Real Time Streaming Pipeline

Replacing the local file system storage, we have a Kafka cluster. And, instead of our in house DPS system we have a MirrorMaker, an aggregated Kafka cluster and a SparkStreaming cluster.

In this pipeline, the data is written directly from the service to the Kafka cluster. From this point all the data is accessible to anyone that wants to use it. Analysts, algo developers, or any other service that can find it useful. Of course it is also available to the MirrorMaker, as part of the pipeline.

The MirrorMaker job is to sync data between Kafka clusters, and in our case, to make sure that each aggregated Kafka cluster will have the entire data from all DCs.

Like before, the data in the aggregated cluster is available to all, especially to our SparkStreaming cluster which runs various jobs that consume the data and writes to the Hadoop.

So it’s clear that by implementing this pipeline, the data is available to all, and it reaches the Hadoop faster.

Obvious Benefits

Now lets go over the obvious benefits of the real time pipeline:

Data availability – since the data is passed through Kafka clusters, it is available to all implementations that need real time data. Here are 2 good examples:
- Recommendation engines can now calculate its algo models with fresher data. This led to a lift of more than 16% in our Revenue Per Mille (RPM).
- We added real time customer facing dashboards which enable them to take required actions faster according to fresh data.
Reduced the latency of the processing system – since the data reach the Hadoop faster, the hourly jobs that wait for the hourly data can start work sooner and to complete their work much closer to the end of the hour. This reduced the overall latency of the processing system and now the business reports for example, are available in a shorter time.

Additional Benefits

And those are the unexpected benefits that we gain from the real time pipeline:

Network capacity – In the hourly files pipeline, the data is moved in hourly batches and at the end of each hour. This means that the network capacity needs to support movement of all data at once. This requirement forces us to allocate the required bandwidth while it is used only for a short time during each hour, wasting expensive resources. The real time pipeline moves the same amount of data incrementally over the course of the hour.

The graphs below demonstrate the bandwidth savings that we made once we moved to the real time pipeline. For example, in the RX graph, you can see that we moved from peaks of 17 GBPs to have flat bandwidth of 7 GBPs, saving 10 GBPs.

Total RX traffic

Total TX traffic

Disaster recovery – the fact that in the hourly based pipeline the data is saved in files on the local file system of the machines that run the services have some limitations.
- Data loss – as we all know, machines have 100% chance of failing at some point. And when you have thousands of machines, the odds are against us. Each time a machine goes down, you have the risk of data loss since all the hourly files may not have been copied into the pipeline yet. In the real time pipeline, the data is written immediately to the Kafka cluster which is more resilient and the risk of the data loss is reduced.
- Late processing – if a machine has recovered from a failure, and you were lucky enough to avoid data loss, the recovered data needs to be processed and in most cases it won’t be done within the time period that the data is related to. This means that this time period will be processed again which adds extra load on the Hadoop and may result in data delays, since the Hadoop needs to process multiple time periods at the same time. Like before, the benefit of the real time pipeline in that aspect is that the data reaches the Hadoop without any delays so there is no reprocessing of any time period.

Conclusions

Having the real time pipeline, our life became much simpler. Beside the planned goals (data availability and reducing the latency), the extra goodies that we got from this change made us less sensitive to any network glitch or hardware malfunction. In the past, each one of those issues forced us to handle data delays and have the risk of data loss, and now the real time pipeline by its nature, solved all of them.

Yes, there are more components that we need to maintain and monitor, but this cost is justified if you compare it to the great results we achieved by implementing this real time system for our data delivery pipeline.

May 17, 2020

Adib Daw

4 Engineers, 7000 Servers, And One Global Pandemic

Inside one of our data centers (tiny COVIDs by CDC on Unsplash)

If this title did not send a slight shiver down your spine, then you should either move on to the next post, or drop by our careers page – we’d love to have a chat.

This post will touch the core of what we do – building up through abstraction. It will demonstrate how design principles, like separation of concerns and inversion of control, manifest in the macro, physical realm, and how software helps bind it all together into a unified, ever developing, more reliable system.

While we’re only touching the tip of the iceberg, we intend to dive into the juicy details in future posts.

Who we are

We are a team of 4 penguins who enjoy writing code and tinkering with hardware. In our spare time, we are in charge of deploying, maintaining and operating a fleet of over 7000 physical servers running Linux, spread across 3 different DCs located in the US.

We also happen to do this 6,762 miles away, from the comfort of our very own cubicle, a short drive from the nearest beach resort overlooking the Mediterranean.

License to use: Creative Commons Zero – CC0.

Challenges of Scale

While it may make sense for a start-up to choose to start with hosting their infrastructure in a public cloud due to the relatively small initial investments, we at Outbrain choose to host our own servers. We do so because the ongoing costs of public cloud infrastructure far surpass the costs of running our own hardware in collocated data centers once a certain scale is reached, and it offers an unparalleled degree of control and fault recovery.

As we grow in scale, challenges are always a short distance away, and they usually come in droves. Managing a server’s life cycle must evolve to contain the rapid increase in the number of servers. Spreadsheet-native methods of managing server parts across data centers become, very quickly, cumbersome. Detecting, troubleshooting and resolving failures, while maintaining reasonable SLAs, becomes an act of juggling vastly diverse hardware arrays, different loads, timing of updates and the rest of the lovely things no one wants to care about.

Master your Domains

To solve many of these challenges, we’ve broken down a server’s life cycle at Outbrain into its primary components, and we called them domains. For instance, one domain encompasses hardware demand, another the logistics surrounding inventory lifecycle, while a third is communication with the staff onsite. There is another one concerning hardware observability, but we won’t go into all of them at this time. The purpose of this is to examine and define the domains, so they can be abstracted away through code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested and improved, as a preparation for a coded algorithm. Finally, a domain is set up to integrate with other domains via APIs, forming a coherent, dynamic, and ever evolving hardware life cycle system that is deployable, testable and observable. Like all of our other production systems.

Adopting this approach has allowed us to tackle many challenges the right way – by building tools and automations.

The Demand Domain

While emails and spreadsheets were an acceptable way to handle demand in the early days, they were no longer sustainable as the number of servers and volume of incoming hardware requests reached a certain point. In order to better organize and prioritize incoming requests in an environment of rapid expansion, we had to rely on a ticketing system that was:

Could be customised to present only relevant fields (simple)
Exposed APIs (extendable)
Familiar to the team (sane)
Integrated with our existing workflows (unified)

Since we used JIRA to manage our sprints and internal tasks, we decided to create another project that would help our clients submit tickets and monitor their progress. Relying on JIRA for both incoming requests and internal task management, allowed us to set up a consolidated Kanban board that gave us a unified view. Our internal customers, on the other hand, had a view that presented only hardware request tickets, without exposing “less relevant details” of additional tasks within the team (such as enhancing tools, bug fixing and writing this blog post).

(JIRA Kanban board)

As a bonus, the fact that queues and priorities were now visible to all, gave visibility into “where in line” each request stood, what came before it, what stage it was in, and allowed owners to shift priorities within their own requests without having to talk to us. As simple as a “drag and drop”. It has also allowed us to estimate and continuously evaluate our SLAs according to the types of request, based on metrics generated by JIRA.

The Inventory Lifecycle Domain

You could only imagine the complexity of managing the parts that go in each server. To make things worse, many parts (memory, disk) can find themselves traveling back and forth from inventory to different servers and back, according to the requirements. Lastly, when they fail, they are either decommissioned and swapped, or returned to the vendor for RMA. All of this, of course, needs to be communicated to the colocation staff onsite who do the actual physical labor. To tackle these challenges we have created an internal tool called Floppy. Its job is to:

Abstract away the complexities of adding/removing parts whenever a change is introduced to a server
Manage communication with the onsite staff including all required information – via email / iPad (more on that later)
Update the inventory once work has been completed and verified

The inventory, in turn, is visualized through a set of Grafana dashboards, which is what we use for graphing all our other metrics. So we essentially use the same tool for inventory visualization as we do for all other production needs.

(Disk inventory management Grafana dashboard)

When a server is still under warranty, we invoke a different tool we’ve built, called Dispatcher. It’s jobs are to:

Collects the system logs
Generate a report in the preferred vendor’s format
Open a claim with the vendor via API
Return a claim ID for tracking purposes

Once the claim is approved (typically within a business day), a replacement part is shipped to the relevant datacenter, and is handled by the onsite staff.

(Jenkins console output)

The Communication Domain

In order to support the rapid increases in capacity, we’ve had to adapt the way we work with the onsite data center technicians. When at first growing in scale meant buying new servers, following a large consolidation project (powered by a move to Kubernetes), it became something entirely different. Our growth transformed from “placing racks” to “repurposing servers”. So instead of adding capacity, we started opening up servers and replacing their parts. To successfully make this paradigm shift, we had to stop thinking about Outbrain’s colocation providers as vendors, and begin seeing them as our clients. Adopting this approach meant that we would need to design and build the right toolbox to help make the work of data center technicians:

Simple
Autonomous
Efficient
Reliable

All the while abstracting away the operational differences between our different colocation providers, and the seniority of the technicians at each location. We had to remove ourselves from the picture, and let them communicate with the server without our intervention and without making assumptions involving the workload, work hours, equipment at hand, and other factors.

To tackle this we had set up iPad rigs in each data center. Once plugged into a server, the following would happen:

The mechanism validates that this is indeed a server that needs to be worked on
The application running on the server is shutdown (where needed)
A set of work instructions is posted to a Slack channel explaining the required steps
Once work is completed, the tool validates the final state is correct
And if needed, restarts the application

In addition, we have also introduced a Slack bot meant to service the technician, with an ever expanding set of capabilities to make their work smoother and our lives easier. By doing so, we’ve turned most of the process of repurposing and servicing servers into an asynchronous one, removing ourselves from the loop.

(iPad rig at one of our data centers)

The Hardware Observability Domain

Scaling our datacenter infrastructure reliably requires good visibility into every component of the chain, for instance:

Hardware fault detection
Server states (active, allocatable, zombie, etc)
Power consumption
Firmware levels
Analytics on top of it all

Metric based decisions allow us to decide on how, where and when to procure hardware, at times before we even get the demand. It also allows us to better distribute resources such as power, by identifying workloads which are more demanding. Thus, we can make informed decisions as to server placement before a server is racked and plugged into the power, throughout its maintenance cycles, and up to its eventual decommissioning.

(Rack power utilization dashboard in Grafana)

And then came COVID-19…

At Outbrain, we build technologies that empower media companies and publishers across the open web by helping visitors discover relevant content, products and services that may be of interest to them. Therefore, our infrastructure is designed to contain the traffic generated when major news events unfold.

The media coverage of the events surrounding COVID-19 coupled with the increase in visitors traffic has meant that we had to rapidly increase our capacity to cope with these volumes. To top it off, we had to do this facing a global crisis, where supply chains were disrupted, and a major part of the workforce was confined to their homes.

But, as we described, our model already assumes that:

The hardware in our data centers, for the most part, is not physically accessible to us
We rely on remote hands for almost all of the physical work
Remote hands work is done asynchronous, autonomously and in large volume
We meet hardware demand by method of “building from parts” vs “buying ready kits”
We keep inventory to be able to build, not just fix

So the global limitations preventing many companies from accessing their physical data centers had little effect on us. And as far as parts and servers go – yes, we scrambled to secure hardware just like everyone else, but it was to ensure that we don’t run out of hardware as we backfill our reserves rather than to meet the already established demand.

In summary, I hope that this glimpse into our world of datacenter operations shows one can apply the same principles of good code design to the physical domains of datacenter management, and live to tell.

Want to hear more about the way we run our data centers? Stay tuned for the next part!

March 11, 2020

Dafna Frank Szarfman

I was a developer. Now I’m a Product Manager. Why and how.

Photo by davisco on Unsplash

I love writing code. I love everything about it. Coding is like solving puzzles. You get a problem and you need to think how to solve it. Think up the best solution, optimize it, debug it, squash bugs. I love writing code so much that I spent 20 years working as a developer. I even got used to waking up at night with a deep understanding that I just found a bug in my code. I love it! And despite having had several opportunities during my career to shift to other positions, I always decided to stick to coding. In a sense, I was afraid of not getting my hands dirty, of stepping away from the keyboard.

Product Hunt

When I joined Outbrain as a backend software developer in the Platform group, I took ownership of a backend service called Dyploma. Dyploma is a deployment gateway built on top of Kubernetes, enabling developers to build artifacts and deploy them to multiple data centers and environments with the click of a button. It makes our developers’ lives super easy.

Dyploma is a product with 200 users. A live product which is being used constantly, with feature requests, bugs and direct impact on the production services of Outbrain. With so many users, requirements and impact, we realized we wanted a product oriented approach to managing and evolving it. But we had no prior experience managing products, nor any product managers in the group. Our solution to this challenge was to introduce the Product Leads.

Product leads were an attempt to get the best of both worlds – keep the technical people doing the tech stuff they love and excel at, but also allow them to take on Product Manager responsibilities part-time. Dyploma wasn’t the only product in the Platform group where we wanted to implement a product-oriented approach, we had others as well. So when Platform leadership invited people to volunteer for these Product Lead roles, due to my deep sense of involvement with the product I was developing, I decided to take the challenge head on.

I took it upon myself to be the product lead for Dyploma, in parallel to being its backend developer. I met with the users, heard their wishes and complaints, mapped the different flows they needed, the new features they wanted, prioritized them… then switched hats to “backend developer” to do the implementations. It was fun, exciting, difficult, satisfying – it was everything I loved about writing code, but… not necessarily writing code.

Since I started doing something I was not familiar with, I decided I should start learning about what it takes to be a product manager. I have worked with product managers but to actually be one is different, and it had (and still has) its learning curve.

The Lean Startup

I learned that there is a constant loop of Learn->Build->Measure (Lean Startup anyone?).

For us, the “Learn” stage included creating a focus group, meeting the customers, documenting their requests and feedback.

One of the big advantages of working on internal products is that you can meet your customers all the time (should be handled with care, so it does not become a pain….). I would walk in the corridors and people would jump at me and start bombarding me with ideas and problems that they had. Corridor chats are nice but we needed more formalized communication channels to make communication scale, so I created focus groups and held periodic meetings with them. I used the documentation of these meetings when prioritizing features.

I used users stories to help me focus on what was really important, on the Minimum Viable Product (MVP). I then sat with a focus group of developers and we reviewed these user stories for feedback and validation. A slack channel for more ad-hoc documentation (and a paper trail), and boy, was it busy…

Next, the ‘Build’ stage. I figured this would be easy, since I was also the developer. But working with the UI team and seeing how the backend and frontend code work together toward a common goal, required going beyond just writing code. It required synchronization and team work.

Once built, came the ‘Measure’ stage. At first, this didn’t seem important – “if you build it, they will come” felt intuitively right. But that wasn’t the case. After implementing one specific, large feature, we found that user adoption was there… but we didn’t get the value we wanted. Partially because we didn’t define the value clearly enough, or had KPIs in place to measure it. In fact, once we implemented the KPIs, it seemed we were moving in the opposite direction!

Our learnings here lead us to implement a “KPIs first” approach, which wasn’t simple or necessarily intuitive. Instead of working on the feature, we would first work on the metrics, define the measurements and only then move onto building the features. This allowed us to “focus on the target” at all times and change direction quickly if needed. Another set of KPIs would focus on feature impact (for example, # of affected users) and implementation effort (high level estimation is mostly good enough). These reflect the feature’s perceived value and cost, so we could not only measure whether the feature moves the needle, but also understand how to prioritize the different features and manage our resources.

Measuring KPIs was true for customer satisfaction as well – we started gathering customer feedback, and were happy to see that our customers played along. They felt they were being heard and given a mandate to influence product direction.

To close the loop, following measurements, we returned to the ‘Learn’ stage, plotting the next steps in the product’s evolution.

Present day

Fast forward a year and a half – the Product Leads initiative was a partial success. Unfortunately, the number of Product Leads that kept the addon role over time wasn’t high. It was… one. But wherever we had a Product Lead, we got what we asked for :-). We had good user interaction, a product roadmap and a regular release cadence centered around it. This was the proof of the hypothesis we had in mind. As an infrastructure group we build products! And every product, no matter if it is internal or external, should be managed as one.

From this point on, we decided to invest more in this direction. I moved on to a role of full time Product Manager in the Platform group. While writing these lines, I manage 5 different products, developed by multiple teams, with more to come!

Looking back, here are some of my insights on what made this adventure a successful one.

Personal Skills – Like in many cases, a great human interface makes everything work better. The ability to build good relationships with different types of customers, make them feel they can approach you anytime and give feedback. Building these relationships will also enable you to facilitate better coordination and collaboration between the different teams.
Being a good Listener – As any good Product Manager, you need to collect ideas, listen to problems and needs, get feedback – and for all of that to be done well, you need to be a good listener. This usually involves keeping quiet while others talk 🙂
Being a good Investigator – Sometimes your customers are “too nice”. It might be that they don’t feel comfortable passing criticism, or they simply did not put the effort to think about it. They might not even realize they feel uncomfortable with some aspects of your product. To overcome this obstacle, you need to learn how to ask the hard questions. Look for the criticism, even if it doesn’t feel great, and focus on it. Dig deep, listen and look for hints.
Methodology – Working with different people, getting requirements all around, and choosing the right things to focus on – you must have a system if you want to get those right. When a VP has a feature request – is it more important than the feature request of the single developer? The system is more than just the priorities, it is also the KPIs, making sound decisions and holding conversations based on data and not on beliefs. The lighthouse should always be the value you bring to the product, and not necessarily the requestor’s role, how often they ask for it, or other “psychological bias factors”.

This past year was quite a ride – I learned a lot, gained exciting new experiences, failed and succeeded countless times. There’s still so much for me to learn, but one thing I already know – I love being a Product Manager!

January 15, 2020

Dori Shmuel

Upgrading your datacenter network with the push of a button

Intro

In this blog post, I am going to share with you the story of how we automated switch upgrades, to perform them at scale. First, however, I recommend reading one of our previous blog posts, to get a better understanding of our network environment: Switches, Penguins and One Bad Cable

To upgrade?

It all started with a PagerDuty call – a failed switch in one of our datacenters. One support ticket later, and our vendor says it’s time for a firmware upgrade. On all of our switches. We flinched. We’ve done this many times before and somehow, there were always edge cases, weird states, and other loose ends. Upgrading a single switch? it was a hassle we were used to, which thankfully didn’t happen too often. Frankly, not a big deal. But upgrading a fleet of 200? uh…

Just to get you up to speed, our network environment is based on Clos fabric, or more specifically, a leaf spine topology. BGP is the dynamic routing protocol that glues it all together.

We’re using Cumulus Linux as the operating system of our switches, Chef to manage their configuration and Jenkins to automate provisioning. In case you haven’t read the article in the intro and you’re wondering what I mean, here’s a link to it again.

So let’s jump straight to it. How do we upgrade switch firmwares at Outbrain? Simple. We don’t. We re-provision them instead. In case you’re not sure what I mean, I’ll clarify.

Switch Re-provisioning Process

Cumulus Linux is a Linux distribution designed for switches. As such, you can manage it using standard Linux tools, and you can also install it using standard Linux methodologies, such as using PXE boot. Except in the case of switches, PXE boot is replaced with the Open Network Install Environment (ONIE).

Once a switch is instructed to re-provision, on the next boot the following things will happen:

The switch requests an IP address via DHCP
The DHCP server acks and responds with DHCP option 114 for the location of the installation image
The switch uses ONIE to download the Cumulus Linux disk image, installs, and reboots
Success! You are now running Cumulus Linux

By the end of this process, we have an updated firmware (OS) version running on the switch without any special configuration. Just like a brand new switch.

“But what about the config?” you might ask. Excellent question! In a word, ZTP. In three words, “Zero Touch Provisioning”:

The switch boots to a fresh Cumulus Linux version (see step 4 of ONIE above)
The management interface (eth0) is configured by default for DHCP
eth0 sends a DHCP request
The DHCP server returns a URL for a ZTP script
The ZTP script installs the Chef client, configs and runs it
Chef-client runs our homemade run_list

Once the Chef client run completes successfully, the switch is fully configured with all the relevant configuration tailored for it. BGP sessions are established and the switch will be back in production.

Now that you have a grasp of the process, here’s what it actually takes to re-provision a switch:

“onie-select” selects ONIE mode to use during the next boot
“-i” means “Install Boot”: at the next reboot, load ONIE in ‘install’ mode. This installs a new operating system, overwriting the current one
“-f” means “Force the operation”: assumes “yes” to any questions the script might ask
“&& reboot” will reboot the switch only if the previous onie-select command succeeded

15 minutes later, and the switch is back in the game, running an updated firmware and its own tailored config. Pretty cool, right? Now we just have to do that to 74 switches, which is what we have in a datacenter.

15 minutes times 74 is…

… a lot. Especially if you add some validations pre and post “re-provisioning”, which bring you to 30 minutes per switch. Multiplied by 74? 2220 minutes, which is 37 hours. That’s a lot of time to be babysitting switches. Not to mention failures, edge cases, missing a switch or two…

And that’s just one datacenter. Multiplied by 3, that’s 6660 minutes, which is 111 hours. So if you spend 8 hours every day doing just this, one switch at a time, with nothing failing at all, it would take you about 2 weeks of repetitive work to do the full upgrade.

There must be a better way. A smarter way. A faster way. A way that is:

Automated
Robust
Hands-free
Auditable

A better way

3 days later and we have something to show for. Allow me to introduce you to our “Switch Upgrade Automation” process (notice that “has a catchy name” wasn’t part of the requirements).

The “switch upgrade automation” process is a Ruby-based program running inside a docker container. It is built as a Jenkins job which is executed periodically. The program covers all the necessary validations to pick a valid candidate switch, re-provision it and make sure the switch is ready to be put back into production.

Diving into the process, each Jenkins build will spawn a docker container, getting the following arguments (there’s actually more, but we’ve chosen the main ones):

Datacenter name
Switch type: leaf or spine
Switch OS version to upgrade from
The 1st or 2nd switch in the rack

Automated and Robust

We first check if there is already a running build, as we don’t want two switches re-provisioning at the same time. We took the conservative approach and decided to run the process serially. Also, we don’t want to move on to the next switch if the previous one failed, as there may be a problem with the process, environment or switch itself.

Next, we build a dynamic Prometheus query that returns a list of all switches running the OS version we’re upgrading from. Using this list we choose a random switch and trigger yet another Prometheus query. The additional query allows us to retrieve the following information about the candidate switch:

Global health status – a metric built out of many local tests running on the switch, including hardware sensors, ztp_exit_status, services status, and PTM
BGP peer status – number of established BGP peers
Number of Active links

Using these values we can then build a data structure containing the switch name, the metric values, and the validation phase: the pre-upgrade status. Remember this structure, we’ll come back to it later.

switch_list.each do |switch|
  puts "Candidate switch for upgrade: #{switch}"

  switch_data_structure = {
    switch => {
      'pre' => {},
      'post' => {}
    }
  }

  switch_data = fetch_switch_data(switch_data_structure, switch, switch_type, 'pre')
  ...

We now have the pre-upgrade status of the candidate switch however, this is not enough. We also need to validate the functionality of the candidate’s sibling switch (the other switch in the rack) before taking any action. We run the same Prometheus query again to retrieve the metrics we’d mentioned above, but this time for the sibling switch. The name of the sibling switch and its metric values are inserted into the pre-upgrade status.

  sibling_switches = get_sibling_switches(switch, switch_type, switch_list)

  sibling_switches.each do |sibling|
    sibling_data_structure = {
      sibling => {
        'pre' => {},
        'post' => {}
      }
    }
    sibling_data = fetch_switch_data(sibling_data_structure, sibling, switch_type, 'pre')
    switch_data.merge!(sibling_data)
  end
  ...

Compare the candidate switch with its sibling: if they match, or the upgrade candidate has fewer active links / BGP neighbors than its sibling (amongst other metrics), we’re good to go – we know we won’t lose connectivity to any of the machines in the rack. This gives us the confidence to run the re-provisioning process while there are active workloads in the rack.

Cross-checking switch metrics with sibling switch: leaf-r32-p1-pod1 <> leaf-r32-p2-pod1

leaf-r32-p1-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
leaf-r32-p2-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
Status: Pre upgrade validation passed!

Sending changelog for switch leaf-r32-p1-pod1
Sending leaf-r32-p1-pod1 to Re-provisioning, command is '/root/run_switch_provision.sh -F'

And if validation fails? the process would skip this candidate switch and elect a new candidate instead.

Hands-free and auditable

Just before rebooting our candidate switch, we silence alerts, so as to not disrupt our on-call while the process runs in the background. Also, to make sure we know “what happened, when and why”, we use our in-house changelog mechanism to fire a “machinelog” event for future auditing. What are these changelog and machinelog things? They’re another blog post waiting to be written.

With all pre-provisioning validations, alert masking and audit events out of the way, it’s showtime! We trigger the Switch Re-provisioning Process for our elected candidate switch and send it for a reboot.

Once rebooted, we monitor whether the switch accepts connections on TCP port 9100, which is the Prometheus node exporter port. Why? Because we’re using Chef to install the Prometheus node exporter, which gives us a good indication that Chef (which is the final provisioning step) completed successfully. In addition, node exporter is in charge of exposing switch metrics, on which we rely for post-provisioning validation. Once we manage to establish a connection, we know the switch is up again, running the new OS.

Last but not least, we need to validate that our switch is in the same state as it was prior to re-provisioning. We want to make sure we haven’t lost any links, peers, etc. How do we do that?

Remember the pre-upgrade status we saved when we started the validation process? It’s time to get the post-upgrade status and compare the two. Basically, it means running the same query as before, building the data structure and running a diff. If all is well, the process completed successfully and we can move on to the next candidate. If not, we abort and notify a team member that something went wrong.

Retrieving switch metrics: leaf-r32-p1-pod1 Upgrade_phase: post
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p1-pod1"}==0) - Prometheus response: 2
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p2-pod1"}==0) - Prometheus response: 2

Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p1-pod1"}==2) - Prometheus response: 7
Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p2-pod1"}==2) - Prometheus response: 7

Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p1-pod1"}==0 - Prometheus response: 0
Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p2-pod1"}==0 - Prometheus response: 0

Data for switch: leaf-r32-p1-pod1.nydc1.outbrain.com was validated SUCCESSFULY after upgrade

So what just happened?

If you’ve read carefully so far, you might notice that we haven’t gone into time savings in this whole process. We chose not to parallelise re-provisioning at this time, to make sure our blast radius, in case of failure, is limited to a single rack. However, we removed the human factor from the process, which means it can run the full 2 weeks in the background and let us know when things are done (or not working as expected). This, in turn, frees us to deal with the more important aspects of running a production system at scale.

November 5, 2019

Alex Balk

Why we’re not using Kubernetes (kind of)

Intro

Kubernetes is the best thing since sliced bread. Everyone is talking about its internal parts, how to use it, best practices and the latest and greatest supporting tools for it.

This isn’t a story about how great Kubernetes is (yes, okay, it’s great). It’s about our journey into the realm of large scale deployments, and why we’re not using Kubernetes. Okay, we are, but we hide it.

This blog post was put together by Shahaf Sages, Dafna Frank and Alex Balk.

In the beginning

A long time ago in the early 2000s, in a startup far-far away, there was a lone Java developer. The developer had a mission. It was an ambitious mission, one that was not for the faint of heart. His mission was to write the most beautiful Java code ever written, and make some sweet startup money along the way.

Actually, this developer wasn’t alone. And frankly, his code wasn’t that good, but it worked, and he, and his fellow “lone Java developers” needed to get this code to production. And so they did, using the tools they knew best: copy & paste… but in the Linux variant, called “scp”. And the sweet startup money flowed. Or at least trickled a bit.

It’s good, but is it good enough?

This worked well for a while, but not very well or for very long. Fairly quickly, problems started showing up. Problems that needed more code to be written, uploaded and managed on production machines. Which too started to add up. Version control, which was so common in the process of writing code, was desperately needed in the process of deployment. And rollbacks, because the beautiful code was sometimes moody. And some way to keep tab of these (also moody) machines where the code was running. Because sometimes they had an annoying tendency to die or just stop responding.

And so it was decided by the great powers of infrastructure development that be. There will be no more “scp”ing, as it was declared to be manual and thus the root of all evil. Instead, a system shall be born – a deployment system, with artifact version control, build management, rollbacks, progress bars, a model to account for all machines and services, and access control to grant permissions to the lucky few who shall unleash their code onto production. And it was dubbed GluFeeder, for it was based on the Open Source Glu framework from LinkedIn which was state of the art at the time. And it was GOOD.

It works so well, why touch it?

Until, 8 years later, it wasn’t all that good anymore. But that took a long time, and everyone got used to it and it hid away many problems. So if it works don’t touch it, right?

Maybe not. Problems were abundant:

The small (not so much) startup ran on physical machines, and each service had an entire machine for its own, which meant quite a bit of waste
The small (not so much) startup was moving towards microservices because they’re cool and scalable and async and shiny, so said waste was about to get out of hand because there were now a lot of services
The moody physical machines were still moody and whenever they decided they didn’t want to work anymore, all the services running on them just died in nasty, nasty ways
The model for describing “what service runs where” was handled in one big yaml file and all the developers had their sticky fingers touching it directly, with no validation
Adding more service instances was really really REALLY painful because the new physical machines that were ordered took a long, long time to arrive at the (not so small) startup’s datacenters

And so it was decided by the great powers of infrastructure development that be. There will be no more GluFeeding because it worked but wasn’t “10x scale”. Instead, a system shall be born – a deployment system, with:

Support for containers, because the Java developers were making friends with JavaScript and Python and Go developers
A per-service model stored in a real database with all the metadata goodness that describes how to build, run and operate micro, macro or mega services
A resource management system, the best Open Source had to offer, to manage the moody server resources and ensure code was running even if servers went away – the almighty Kubernetes
Orchestration of many, many resource managers (Kubernetes clusters), across many, many datacenters, or at least 3
A well defined contract between the services and the environment they run in, so that everyone gets their metrics, logs, environment variables and properties in the format and flavor they prefer, without dealing with the gory, gruesome details

A poll was run and the people voted. The name that was chosen was Dyploma (DYnamic dePLOyment MAnagement system), which only shows that democracy doesn’t always work very well.

One small step for dev, one giant leap for devops

In the course of a year and a half, the infrastructure developers and the Java early adopter developers met every week to present, discuss and test what was being developed. A Python CLI was chosen as the initial interface for the system, as it was quick to develop and required little UX skills. Little by little the features were added, tweaked and tuned. And little by little confidence was built in the new system and the great Kubernetes beast which it controlled through the scriptures of the fabric8 java client. Until the Python CLI was no longer enough and a Vue.js Web UI was added instead.

Great care was taken in the design to ensure only metadata was kept within Dyploma, so as not to contaminate it with duplicate state of the Kubernetes beasts. And so Dyploma was lean on data and mostly just passed orders to Kubernetes, Prometheus, Consul, Jenkins, TeamCity, Bitbucket and anyone else it could boss around. The system would let the developer:

Describe the service’s endpoints
Provider special build parameters
Set runtime information, such as environment, cluster, number of instances
Build, deploy, scale up/down and disable with one click
View what’s running, where, how much, why and who gave the order
And just plain hide away all of the underlying systems details because the developers were lazy and spoiled and we LOVE them that way

No, really, let’s unfold that last statement

The developers didn’t need to know anything about making the underlying infrastructure work. They just got it all for free. Gift wrapped with a nice web UI. Once they had their service defined, all they had to do was decide how much, which version and where, and pay the bill (okay, not yet, but it’s coming). There were no yaml files, no Helm charts, no configMaps or anyone called Jason. There was a single place to view the runtime status of a service, its history of changes, its logs, graphs and a single place to control it all, which even had batch operations but that’s just showing off.

And so simplicity was restored, velocity was increased, stability was a welcome side-effect and much cost was saved through better resource utilisation and reuse of aging moody machine hardware. And it was GOOD.

Until, one day, a lone Java developer had an idea.

“Why don’t we ditch Dyploma and use Kubernetes instead?”, he said. “I’ve read that it’s the best thing since sliced bread.” And the infrastructure developers just stared.

Show me the money

Every Kubernetes blog post that respects itself shows off some numbers and yaml files.

We have no yaml files to show you, but we do respect ourselves, so here are some numbers:

GluFeeder

Machines managed: 2000

Services running on the managed machines: 2500

Unique types of services: 150

Dyploma

Machines managed: 1600

Services running on the managed machines: 7500

Unique types of services: 400

Kubernetes deployments: 2300

EPILOGUE

Since you’ve read up to here, we’ll assume that you’re interested in getting some insights around building a deployment system (vs yet more Kubernetes tips & tricks), so we’ll give you some of our inputs:

We wanted flexibility AND simplicity. This isn’t cheap. If you want developers to “just write code”, you’ll have to have other developers “just write infra”.
Deployment systems are built for users. Bring the users onboard for the ride if you’re building one.
Use the terminology of the system you’re relying on. If it’s called a pod, call it a pod. Don’t call it a FLDSMDFR. We called a “deployment” a “service”. Don’t.
Kubernetes is complicated. So is your runtime context (at least when you’re big). The challenge is in using the former to contain the latter, while simplifying it for the users. This means that the user should be able to say the absolute minimum and get sane defaults, but also be able to override everything in the runtime without having to speak any Kubernetes. Simple, right?
Protect yourself. People will make mistakes. They will put the SVN version number in the replicas field. And you will cry.

Looking forward, this is what we’re working on these days:

Horizontal autoscaling. Because after you’re done migrating, you start optimising. And you want it simple enough for devs to “click here”.
Deployment A/B testing. All the levers are there, but you have to build them into a usable tool.
Requests, limits and “let me set that for you”. Because they don’t necessarily mean what you think they mean.
Jobs. Because crons are running wild and it doesn’t hurt us now, but only because we’re not looking.
Exposing cost to owners. Because nothing is free, not even your own bare metal.
Open Source. Because the world needs this.

March 4, 2019

Gerardo Laracuente

Migrating Servers in Our Sleep

The Cloud is an Illusion

Cloud service providers have enabled innovations in many areas of our society from the way we watch movies to the way we share media with our family and friends. Companies can focus on amazing products without having to worry about managing data centers, servers, networking equipment, and all of the complexities therein.

But the cloud is an illusion, just an abstraction. Behind every cloud are data centers full of countless racks of servers, routers, firewalls, and engineers who design, deploy, and manage them. There is physical infrastructure that powers the technology we enjoy everyday and, at Outbrain, we run the majority of our workloads on bare metal infrastructure. We are, for that matter, our own cloud provider. This post is a peak behind the curtain of how we do this, at scale.

It was a sunny day in California…

In the summer of 2017, our new West Coast data center went live, along with our first fully automated 10G network deployment. It was a great success, which is why we wanted to also roll it out in our other data centers. The new mission was to migrate every server from our legacy 1G network to our shiny new 10G network. Essentially, this meant installing 10G network cards and running new twinax cabling to a good number of thousands of servers. No biggie… except:

The vast majority of those servers run production workloads
Many of them run stateful applications such as data stores (mysql, elasticsearch, etc.)
Many of these stateful applications can only tolerate a limited amount of downtime before they start shuffling (a lot of) data around
Servers need to be migrated around the clock, with no downtime to any cluster
The application and networking engineers are in Israel
The physical server engineers are in NYC
There is a 7 hour time difference between NYC and Israel (the workweek also only overlaps for 4 days of the week)
The on-site technicians who work on our servers cannot log into them for reboots, health checks, etc.

How many Engineers does it take to…?

Let’s take a look at everything that goes into a manual migration of one server, so we can better understand the task at hand. We will narrow the focus to one specific case: migrating a mysql node to the new network.

The people involved:

Gerry – Data Center Server Engineer (NYC)
Yuval – Data Storage Engineer (Israel)
Adi – Networking Engineer (Israel)
Mike – On-site Remote Hands technician (non Outbrain employee)

The process:

Yuval removes the mysql server from the cluster and makes sure that the cluster is still healthy.
Gerry properly powers down the server, and sends the location details to the remote hands team. Based on the location of the server within the rack, he also lets them know which switch ports to use.
Mike opens up the server, installs the 10G network card, and runs redundant twinax cabling to the 10G switches. He powers it back up, connected to both the legacy and the 10G networks.
Adi prepares the server to join the 10G network and reboots it for the final changes to take effect. After the reboot, he checks that the server is indeed part of the new network and that both twinax connections are up and stable.
Yuval adds the server back to the mysql cluster and makes sure everything looks healthy.
Mike removes the old RJ-45 cables and waits for Yuval and Gerry to prepare and shut down the next server (to avoid multi-node shutdown in the same cluster).

* This is a simplified explanation of the process, and assumes everything goes as smooth as possible. It involves 3 Outbrain engineers to be available across a major time zone difference + on-site remote hands. That’s 4 engineers to migrate a single node.

And now that we’re done, there’s only… a few thousand servers left to migrate…running on various types of hardware, with different versions of Linux, different applications and operating under different availability restrictions.

Which begs the question – Will it Scale?

Putting Remote Hands in the Driver Seat

After a few iterations, this is what the process looks like from Outbrain’s perspective:

Gerry executes a Rundeck job with one mouse click, and then emails Mike a list of server locations.
Gerry goes to sleep.

This is the process from the perspective of a remote hands technician at the data center:

Mike receives a list of server locations, and plugs an iPad into the first server on the list.
The iPad is running Slack and the chat room starts displaying new messages. It lets Mike know that the server is attempting to safely stop mysql and power itself down.
The server shuts itself down, but right before it goes down, it sends a message to the Slack channel explaining all the next steps, including which switch ports to connect to server to.
Mike installs the 10G network card, finishes up the cabling, and powers the server back up.
The server runs the network preparation scripts, reboots itself, checks the network status, starts up mysql, runs health checks, and lets Mike know that it’s time to remove the old cabling and move on to the next server.

Meanwhile, Gerry and Yuval’s teams are getting updates via email every time a server begins and succeeds the migration process. They can monitor the Slack channel during the process, or even look back at all of the migrations in a Kibana dashboard. If anything ever goes wrong, the iPad communicates that Mike should stop and contact Outbrain. The iPad won’t take any wrong actions if it is plugged into the wrong server or reseated at any point.

How We Built It

When the iPad is plugged into the server, it is recognized by a udev rule. This triggers a wrapper script that contains all of the migration logic. Here’s a deeper look at the individual steps and components:

Rundeck is used to put the desired servers into maintenance mode. It does this by touching empty files onto the servers that indicate that they are in maintenance mode, and they are in the first stage of the network migration process.

Chef contains all of the necessary scripts to run the migration. This includes the udev rules and the wrapper, networking, and application specific scripts. The Chef recipe chooses the proper pre and post migration scripts based on the role of the server.

UDEV Rules are very powerful, and we use them define what happens when the iPad is plugged into a server:

SUBSYSTEM=="usb",
ACTION=="add", 
ENV{ID_SERIAL}=="Apple_Inc._iPad_averylonguniqueserialnumber",
RUN+="/path/to/wrapper_script.sh"

This rule roughly translates to: “When a usb device with this serial number in plugged into the server, run the following wrapper script”

A custom startup script is what triggers the wrapper script when the server boots back up after a shutdown or reboot to perform the next migration step.

The wrapper script holds all of the logic and functionality of the process.

It only runs if a maintenance file exists on this server.
It checks for a state file, and depending on what it finds, understands which part of the migration process to run next.
It sends the event logs (for the Kibana dashboard), emails (to the proper teams who manage this server), and Slack messages (via webhooks).
It handles the reboots and shutdowns.
When it runs an application specific pre/post migration script, it’s expecting to get an exit status of 0 (or else it will let everyone involved know that something went wrong).
When it runs the network preparation script (which is worthy of its own blog post), it can make a few decisions based on the exit status. It can rings the alarms, move on to the next step, or even let the remote hands technician know that one of the twinax cables seems loose and should be fully seated.
It cleans up by removing all migration related files, including itself.

More time to build more automation =]

Now we can set a bunch of servers to maintenance mode, hand off the list of server locations to remote hands, and continue our daily work while they are migrated.

And since this works so well using the current approach, work is underway to generalise the concept and make it available to many other types of physical maintenance.

So that our cloud can continue to build itself… while we sleep.

November 25, 2018

Avi Youkhananov

CodinGame Story One – The key for creativity and happiness in developers life

Photo by Juan Gomez on Unsplash

“Keep a developer learning and they’ll be happy working in a windowless basement eating stale food pushed through a slot in the door. And they’ll never ask for a raise.” — Rob Walling (https://robwalling.com/2006/10/31/nine-things-developers-want-more-than-money/)

The past decade has produced substantial research verifying what may come as no surprise: developers want to have fun. While we also need our salaries, salaries alone will not incentivize us developers who, in most cases, entered a field to do what we love: engage in problem-solving. We like competition. We like winning. We like getting prizes for winning. To be productive, we need job satisfaction. And job satisfaction can be achieved only if we get to have fun using the skills we were hired to use.

We wanted to keep the backend developers challenged and entertained.
That’s why Guy Kobrinsky and I created our own version of Haggling, whose basic idea we adapted from Hola, a negotiation game.

The Negotiation Game:

Haggling consists of rounds of negotiations between pairs of players. Each pair’s goal is to maximize score in the following manner:

Let’s say there are a sunglasses, two tickets, and three cups on the table. Both players have to agree on how to split these objects between them. To one, the sunglasses may be worth $4, a ball $2, and the tickets are worthless. The opponent might value the same objects differently; while the total worth of all the objects is the same for both players, their valuation kept secret

Both players take turns making offers to each other about how to split the goods. A proposed split must distribute all objects between partners such that no items are left on the table. On each turn, one can either accept an offer or make a counter-offer. If after 9 offers an agreement is reached, every player receives the amount that its portion of the goods is worth, according to the assigned values. If there is still no agreement after the last turn, both players receive no points.

The Object of the Game:

Write code to obtain a collection of items with the highest value by negotiating items with an opponent player.

User Experience:

We wanted it to be as easy as possible for players to submit, play and test their code.
Therefore, we decided to keep player code simple – not relying on any third-party libraries.
To do this, we built a simple web application for testing and submitting code, supplying a placeholder with the method “accept” – the code that needs to be implemented by the different participants. The “accept” method describes a single iteration within the negotiation, in which each player must decide if they will accept the offer given to them (by returning null or the received offer) – or return a counter offer.

To assist in verifying the players’ strategy, we added a testing feature allowing players to run their code vs some random player. Developers were able to play around with it, re-implementing the code before actual submission.

Java Code Example:

[gist id= 8e870dad5baeec79cbda4be5f56617f6 file=HagglingCode.java]

Test Your Code and Submit Online:

Tournament And Scoreboard:

Practice tournaments ran continuously for two weeks, taking all submitted players into account and allowing developers to see their rank. During this time, competitors were able edit their code. So there was plenty of time to learn and improve.

We also provided analytics for every player. Developers were able to analyze and improve their strategy.

At the end of the two weeks, we declared a code freeze and the real tournament took place. Players’ final score was determined only from the results of the real tournament, not the practice tournaments.

Game Execution And Score:

We executed the game tournament using multiple agents – each of the agents was reported to Kibana:

The Back-Stage:

Where did we store players’ code?
We decided to store all players’ code in S3 of AWS to avoid revealing the code to other players.

What languages were supported?

We started with Java only, but players expressed interest in using Scala and Kotlin as well. So we gave these developers free rein to add support for those languages, which we then reviewed before integrating into the base code. Ultimately, developers were able to play in all three languages.

What was the scale of Haggling?

In the final tournament, 91 players competed in 164 million rounds in which 1.14 billion “accepts” were called. The tournament was executed on 45 servers, having 360 cores and using 225G of memory.

The greatest advantage of our approach was our decision to use Kubernetes, enabling us to add more nodes, as well as tune their cores and memory requirements. Needless to say, it was no problem to get rid of all these machines when the game period ended.

How did the tournament progress?

The tournament was tense, and we saw a lot of interaction with the game over the two weeks.
The player in the winning position changed every day, and the final winner was not apparent until very near the end (and even then we were surprised!).
We saw a variety of single-player strategies with sophisticated calculations and different approaches to gameplay.
Moreover, in contrast to the original game, we allowed gangs: groups of players belonging to a single team that can “help” each other to win.

So how do you win at haggling?

The winning strategy was collaborative – the winning team created two types of players: the “Overlord” which played to win, and several “Minions” whose job was to give points to the Overlord while blocking other players. The Overlord and Minions recognized each other using a triple handshake protocol, based on mathematical calculations of the game parameters. Beyond this, the team employed a human psychological strategy – hiding the strength of the Overlord by ensuring that for the majority of the development period the Overlord went no higher than third place. They populated the game with “sleeper cells” – players with basic strategies ready to turn into minions at the right moment. The upheaval occurred in the final hour of the game when all sleepers were converted to minions.

The graph shows the number of commits in the last hour before the code freeze:

Hats Off to the Hacker: who got the better of us?

During the two weeks, we noticed multiple hacking attempts. The hacker’s intent was not to crash the game, but rather to prove that it is possible and make a lesson of it.
Although it was not our initial intent, we decided to make hacking part of the challenge and to reward the hacker for demonstrated skills and creativity.

On the morning of November 7th, we arrived at the office and were faced with the following graph of the outcomes:

The game had been hacked! As can be seen in the graph, one player was achieving an impossible success rate. What we discovered was the following: the read-only hash map that we provided as method argument to players was written in Kotlin; but, when players converted the map to play in either Java or Scala, the resulting conversion rendered a mutable hash map, and this is how one of the players was able to modify the hash map. We had failed to validate the preferences, ensuring that the hashmap values that players turned in used the same values as the original.

In conclusion, This is exactly the sort of sandbox experience, however, that makes us better, safer, and smarter developers. We embraced the challenge.

Want to play with us? Join Outbrain and challenge yourself.

September 6, 2018

Amir Matar

Increase Your Velocity with a Safe Automatic Deployment

At Outbrain we work at a fast pace trying to combine the challenges of developing new features fast, while also maintaining our systems so that they can cope with the constant growth of traffic. We deliver many changes on a daily basis to our production and testing environments so our velocity is much affected by our DevOps tools. One of the tools we use the most is the deployment tool since every new artifact must be deployed to simulation and staging environments and pass its test before it can be deployed to production. The simulation environment is used for running E2E integration tests. These tests simulate real use cases and they involve all relevant services. The staging environment is actually a single production machine (AKA a canary machine) which receives a small portion of the traffic in production. It allows us to make sure the new version is working properly in the production environment before we deploy it to the rest of the production servers. In this session, you’ll find out how we increased velocity with a safe automatic deployment of high scale services.

Our deployment flow

The illustration above depicts the flow each code change must pass until it arrives in production.

A developer commits code changes and triggers a “build & deploy” action that creates an artifact for the requested service and deploys it to our simulators servers. Once an hour, a build in TeamCity runs the simulation tests of our services.

If the developer doesn’t want to wait for the periodic run, they need to run the simulation tests manually. Once the build passes, the developer is allowed to deploy the artifact to the staging server. At this point, we verify that the staging server behaves properly by reviewing various metrics of the server, and by checking the logs of that server.

For instance, we verify that the response time hasn’t increased and that there are no errors in the log. Once all these steps are completed, the new version is deployed to all production servers. This whole process can take 30-45 minutes.

As one can see, this process has a lot of problems:

It requires many interventions of the developer.
The developer either spends time waiting for actions to complete in order to trigger the next ones or they suffer from context switches which slow them down.
The verification of the version in staging is done manually hence

It’s time-consuming.
There is no certainty that all the necessary tests are made.
It’s hard to share knowledge among team members of what the expected result of each test is.

The new automatic pipeline

Recently we have introduced a pipeline in Jenkins that automates this whole process. The pipeline allows a developer to send code changes to any environment (including production) simply by committing them into the source control while ensuring that these changes don’t break anything.

The illustration below shows all stages of our new pipeline

Aside from automating the whole process, which was relatively easy, we had to find a way to automate the manual tests of our staging environment. As mentioned, our staging servers serve real requests coming from our users.

Some of our services handle around 2M requests per minute so any bad version can affect our customers, our users, and us very quickly. Therefore we would like to be able to identify bad versions as soon as possible. To tackle this issue, our pipeline starts running health tests on our staging servers 5 minutes after the server goes up since sometimes it takes time for the servers to warm up.

The tests which are executed by TeamCity, pull a list of metrics of the staging server from our Prometheus server and verify that they meet the criteria we defined. For example, we check that the average response time is below a certain number of milliseconds. If one of these tests fail, the pipeline fails. At that point, the developer who triggered the pipeline receives a notification e-mail so that they can look into it and take the decision whether the new version is bad and revert it, or maybe the tests need some more fine-tuning and the version is okay to deploy to the rest of the servers.

The pipeline ends when the new version is deployed to production but this doesn’t necessarily mean that the version is 100% okay, although the chances that the version is not okay at this stage are low.

For the purpose of ensuring our production servers function properly, many periodic tests constantly monitor the servers and trigger alerts in case of a failure and allow us to react fast and keep our services available.

What we gained

The automated deployment process ensures the quality of our deliveries and that they don’t break our production servers.
Reduction of time developers spends on DevOps tasks.
The decision whether a version in staging is okay is more accurate as it is based on comparable metrics and not on a subjective decision of the developer.
The developer doesn’t need to remember which metrics to check for each service in order to tell whether a service functions properly.

Category: High Scale

Introduction

Who are we?

What we do?

The challenge

The “before” picture

The “after” picture

Behind the scenes

Return on investment

A short summary

What’s next

Content Fraud – The Major Challenge for Content Networks

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

The Technology Challenge

Crawling Our Way Out

Rapid Response Saves the Day

What Doesn’t Kill Us, Makes Us Stronger

Overview

Pipeline Architecture

Hourly Files Pipeline – legacy

Real Time Streaming Pipeline

Obvious Benefits

Additional Benefits

Conclusions

Who we are

Challenges of Scale

Master your Domains

The Demand Domain

The Inventory Lifecycle Domain

The Communication Domain

The Hardware Observability Domain

And then came COVID-19…

Product Hunt

The Lean Startup

Present day

Intro

To upgrade?

Switch Re-provisioning Process

15 minutes times 74 is…

A better way

Automated and Robust

Hands-free and auditable

So what just happened?

Intro

In the beginning

It’s good, but is it good enough?

It works so well, why touch it?

One small step for dev, one giant leap for devops

No, really, let’s unfold that last statement

Show me the money

EPILOGUE

The Cloud is an Illusion

It was a sunny day in California…

How many Engineers does it take to…?

The people involved:

The process:

Putting Remote Hands in the Driver Seat

How We Built It

More time to build more automation =]

The Negotiation Game:

The Object of the Game:

User Experience:

Java Code Example:

Test Your Code and Submit Online:

Tournament And Scoreboard:

Game Execution And Score:

The Back-Stage:

How did the tournament progress?

So how do you win at haggling?

Hats Off to the Hacker: who got the better of us?

Our deployment flow

As one can see, this process has a lot of problems:

The new automatic pipeline

What we gained

Search

עברית

Categories

Archive

RSS