May 17, 2020

Adib Daw

4 Engineers, 7000 Servers, And One Global Pandemic

Inside one of our data centers (tiny COVIDs by CDC on Unsplash)

If this title did not send a slight shiver down your spine, then you should either move on to the next post, or drop by our careers page – we’d love to have a chat.

This post will touch the core of what we do – building up through abstraction. It will demonstrate how design principles, like separation of concerns and inversion of control, manifest in the macro, physical realm, and how software helps bind it all together into a unified, ever developing, more reliable system.

While we’re only touching the tip of the iceberg, we intend to dive into the juicy details in future posts.

Who we are

We are a team of 4 penguins who enjoy writing code and tinkering with hardware. In our spare time, we are in charge of deploying, maintaining and operating a fleet of over 7000 physical servers running Linux, spread across 3 different DCs located in the US.

We also happen to do this 6,762 miles away, from the comfort of our very own cubicle, a short drive from the nearest beach resort overlooking the Mediterranean.

License to use: Creative Commons Zero – CC0.

Challenges of Scale

While it may make sense for a start-up to choose to start with hosting their infrastructure in a public cloud due to the relatively small initial investments, we at Outbrain choose to host our own servers. We do so because the ongoing costs of public cloud infrastructure far surpass the costs of running our own hardware in collocated data centers once a certain scale is reached, and it offers an unparalleled degree of control and fault recovery.

As we grow in scale, challenges are always a short distance away, and they usually come in droves. Managing a server’s life cycle must evolve to contain the rapid increase in the number of servers. Spreadsheet-native methods of managing server parts across data centers become, very quickly, cumbersome. Detecting, troubleshooting and resolving failures, while maintaining reasonable SLAs, becomes an act of juggling vastly diverse hardware arrays, different loads, timing of updates and the rest of the lovely things no one wants to care about.

Master your Domains

To solve many of these challenges, we’ve broken down a server’s life cycle at Outbrain into its primary components, and we called them domains. For instance, one domain encompasses hardware demand, another the logistics surrounding inventory lifecycle, while a third is communication with the staff onsite. There is another one concerning hardware observability, but we won’t go into all of them at this time. The purpose of this is to examine and define the domains, so they can be abstracted away through code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested and improved, as a preparation for a coded algorithm. Finally, a domain is set up to integrate with other domains via APIs, forming a coherent, dynamic, and ever evolving hardware life cycle system that is deployable, testable and observable. Like all of our other production systems.

Adopting this approach has allowed us to tackle many challenges the right way – by building tools and automations.

The Demand Domain

While emails and spreadsheets were an acceptable way to handle demand in the early days, they were no longer sustainable as the number of servers and volume of incoming hardware requests reached a certain point. In order to better organize and prioritize incoming requests in an environment of rapid expansion, we had to rely on a ticketing system that was:

Could be customised to present only relevant fields (simple)
Exposed APIs (extendable)
Familiar to the team (sane)
Integrated with our existing workflows (unified)

Since we used JIRA to manage our sprints and internal tasks, we decided to create another project that would help our clients submit tickets and monitor their progress. Relying on JIRA for both incoming requests and internal task management, allowed us to set up a consolidated Kanban board that gave us a unified view. Our internal customers, on the other hand, had a view that presented only hardware request tickets, without exposing “less relevant details” of additional tasks within the team (such as enhancing tools, bug fixing and writing this blog post).

(JIRA Kanban board)

As a bonus, the fact that queues and priorities were now visible to all, gave visibility into “where in line” each request stood, what came before it, what stage it was in, and allowed owners to shift priorities within their own requests without having to talk to us. As simple as a “drag and drop”. It has also allowed us to estimate and continuously evaluate our SLAs according to the types of request, based on metrics generated by JIRA.

The Inventory Lifecycle Domain

You could only imagine the complexity of managing the parts that go in each server. To make things worse, many parts (memory, disk) can find themselves traveling back and forth from inventory to different servers and back, according to the requirements. Lastly, when they fail, they are either decommissioned and swapped, or returned to the vendor for RMA. All of this, of course, needs to be communicated to the colocation staff onsite who do the actual physical labor. To tackle these challenges we have created an internal tool called Floppy. Its job is to:

Abstract away the complexities of adding/removing parts whenever a change is introduced to a server
Manage communication with the onsite staff including all required information – via email / iPad (more on that later)
Update the inventory once work has been completed and verified

The inventory, in turn, is visualized through a set of Grafana dashboards, which is what we use for graphing all our other metrics. So we essentially use the same tool for inventory visualization as we do for all other production needs.

(Disk inventory management Grafana dashboard)

When a server is still under warranty, we invoke a different tool we’ve built, called Dispatcher. It’s jobs are to:

Collects the system logs
Generate a report in the preferred vendor’s format
Open a claim with the vendor via API
Return a claim ID for tracking purposes

Once the claim is approved (typically within a business day), a replacement part is shipped to the relevant datacenter, and is handled by the onsite staff.

(Jenkins console output)

The Communication Domain

In order to support the rapid increases in capacity, we’ve had to adapt the way we work with the onsite data center technicians. When at first growing in scale meant buying new servers, following a large consolidation project (powered by a move to Kubernetes), it became something entirely different. Our growth transformed from “placing racks” to “repurposing servers”. So instead of adding capacity, we started opening up servers and replacing their parts. To successfully make this paradigm shift, we had to stop thinking about Outbrain’s colocation providers as vendors, and begin seeing them as our clients. Adopting this approach meant that we would need to design and build the right toolbox to help make the work of data center technicians:

Simple
Autonomous
Efficient
Reliable

All the while abstracting away the operational differences between our different colocation providers, and the seniority of the technicians at each location. We had to remove ourselves from the picture, and let them communicate with the server without our intervention and without making assumptions involving the workload, work hours, equipment at hand, and other factors.

To tackle this we had set up iPad rigs in each data center. Once plugged into a server, the following would happen:

The mechanism validates that this is indeed a server that needs to be worked on
The application running on the server is shutdown (where needed)
A set of work instructions is posted to a Slack channel explaining the required steps
Once work is completed, the tool validates the final state is correct
And if needed, restarts the application

In addition, we have also introduced a Slack bot meant to service the technician, with an ever expanding set of capabilities to make their work smoother and our lives easier. By doing so, we’ve turned most of the process of repurposing and servicing servers into an asynchronous one, removing ourselves from the loop.

(iPad rig at one of our data centers)

The Hardware Observability Domain

Scaling our datacenter infrastructure reliably requires good visibility into every component of the chain, for instance:

Hardware fault detection
Server states (active, allocatable, zombie, etc)
Power consumption
Firmware levels
Analytics on top of it all

Metric based decisions allow us to decide on how, where and when to procure hardware, at times before we even get the demand. It also allows us to better distribute resources such as power, by identifying workloads which are more demanding. Thus, we can make informed decisions as to server placement before a server is racked and plugged into the power, throughout its maintenance cycles, and up to its eventual decommissioning.

(Rack power utilization dashboard in Grafana)

And then came COVID-19…

At Outbrain, we build technologies that empower media companies and publishers across the open web by helping visitors discover relevant content, products and services that may be of interest to them. Therefore, our infrastructure is designed to contain the traffic generated when major news events unfold.

The media coverage of the events surrounding COVID-19 coupled with the increase in visitors traffic has meant that we had to rapidly increase our capacity to cope with these volumes. To top it off, we had to do this facing a global crisis, where supply chains were disrupted, and a major part of the workforce was confined to their homes.

But, as we described, our model already assumes that:

The hardware in our data centers, for the most part, is not physically accessible to us
We rely on remote hands for almost all of the physical work
Remote hands work is done asynchronous, autonomously and in large volume
We meet hardware demand by method of “building from parts” vs “buying ready kits”
We keep inventory to be able to build, not just fix

So the global limitations preventing many companies from accessing their physical data centers had little effect on us. And as far as parts and servers go – yes, we scrambled to secure hardware just like everyone else, but it was to ensure that we don’t run out of hardware as we backfill our reserves rather than to meet the already established demand.

In summary, I hope that this glimpse into our world of datacenter operations shows one can apply the same principles of good code design to the physical domains of datacenter management, and live to tell.

Want to hear more about the way we run our data centers? Stay tuned for the next part!

September 2, 2019

Doug Chimento

The First Rule of Distributed Tracing Is…

Distributed Tracing

Distributed Tracing is a mechanism to collect individual requests across micro-services boundaries. It also enables instrumentation of application latency (how long each request took), tracking the life-cycle of network calls (HTTP, RPC, etc) and also identify performance issues by getting visibility on bottlenecks. Plenty of articles are on the inter-webs describing various implementations of Distributed Tracing. This blog will focus on our experience rolling out tracing into our ecosystem with tales of what not to do.

What we need to solve

Outbrain has hundreds of micro-services running on top of Kubernetes. We have metrics via Prometheus and Netflix Hystrix dashboard to monitor overall health, but we lack visibility on business requests. For example, when a user clicks an ad, we want to see the exact flow. In particular, we may want to see the bid request/response to our partners for performance or fraud reasons. But we don’t have something that allows us to examine an individual request. This is what tracing solves for us; the ability to capture exact business logic for a specific request. Tracing can answer such questions:

Did I hit cache or go to a datastore? What services were used for this specific business logic request?

We implemented tracing with Jaeger to visualize and store our traces:

The First Rule

The first rule of tracing is no one should know about tracing. Tracing should be behind a magical curtain of abstraction; all without developers knowledge. Depending on your organization development practices and usage of core infrastructure (e.g. HTTP clients, RPC framework, database drivers…etc), this maybe difficult. But if the majority of your organization uses one or two HTTP clients (i.e. OkHttp, Apache HTTP) it is rather trivial instrument those clients with tracing capabilities without much developer intervention. By instrumenting only HTTP clients and servers you will get a huge benefit with tracing. Think of the 80/20 rule when you plan your tracing implementation.

What not to do

Conceptual, tracing is easy to understand. Operationally, it can turn into a mess. There is an overhead to tracing, albeit small, but still you don’t want to get OutOfMemory errors because of tracing. So don’t by start tracing everything!

Besides, if you start tracing everything what are you going to do with all that data? Depending on your request load this could be huge. Sure, you solved tracing your micro-services, but you created massive amount of data (and hardware) that probably no one is going to look at. In the beginning, keep your operational costs to a bare minimum. You could start by dumping traces into your existing logging infrastructure (i.e. put the trace Id as MDC context in all logging statements). No extra hardware (collectors) required! Like designing software, start small and increment your changes overtime.

Slow and steady wins the race

Begin your tracing quest by providing a mechanism which will “turn on” tracing manually. Suppose you have a REST API, by detecting a special request header, say “X-AdHocTrace”, will inform the service that tracing has been requested. This will enable tracing to all downstream servers as well. We use swagger at Outbrain and added the header automatically:

window.onload = function() {
 const ui = SwaggerUIBundle({
 url: "../api/swagger/apiDocs",
 requestInterceptor: function(request) {
   request.headers['X-AdHocTrace'] = "true";
   return request;
}

When developers are testing their services (locally or remotely) they automatically get all their calls traced when using swagger.

Overtime you can add more sophisticated sampling mechanisms. For instance you could sample all request that result in a 500 HTTP status code. Read more >

March 4, 2019

Gerardo Laracuente

Migrating Servers in Our Sleep

The Cloud is an Illusion

Cloud service providers have enabled innovations in many areas of our society from the way we watch movies to the way we share media with our family and friends. Companies can focus on amazing products without having to worry about managing data centers, servers, networking equipment, and all of the complexities therein.

But the cloud is an illusion, just an abstraction. Behind every cloud are data centers full of countless racks of servers, routers, firewalls, and engineers who design, deploy, and manage them. There is physical infrastructure that powers the technology we enjoy everyday and, at Outbrain, we run the majority of our workloads on bare metal infrastructure. We are, for that matter, our own cloud provider. This post is a peak behind the curtain of how we do this, at scale.

It was a sunny day in California…

In the summer of 2017, our new West Coast data center went live, along with our first fully automated 10G network deployment. It was a great success, which is why we wanted to also roll it out in our other data centers. The new mission was to migrate every server from our legacy 1G network to our shiny new 10G network. Essentially, this meant installing 10G network cards and running new twinax cabling to a good number of thousands of servers. No biggie… except:

The vast majority of those servers run production workloads
Many of them run stateful applications such as data stores (mysql, elasticsearch, etc.)
Many of these stateful applications can only tolerate a limited amount of downtime before they start shuffling (a lot of) data around
Servers need to be migrated around the clock, with no downtime to any cluster
The application and networking engineers are in Israel
The physical server engineers are in NYC
There is a 7 hour time difference between NYC and Israel (the workweek also only overlaps for 4 days of the week)
The on-site technicians who work on our servers cannot log into them for reboots, health checks, etc.

How many Engineers does it take to…?

Let’s take a look at everything that goes into a manual migration of one server, so we can better understand the task at hand. We will narrow the focus to one specific case: migrating a mysql node to the new network.

The people involved:

Gerry – Data Center Server Engineer (NYC)
Yuval – Data Storage Engineer (Israel)
Adi – Networking Engineer (Israel)
Mike – On-site Remote Hands technician (non Outbrain employee)

The process:

Yuval removes the mysql server from the cluster and makes sure that the cluster is still healthy.
Gerry properly powers down the server, and sends the location details to the remote hands team. Based on the location of the server within the rack, he also lets them know which switch ports to use.
Mike opens up the server, installs the 10G network card, and runs redundant twinax cabling to the 10G switches. He powers it back up, connected to both the legacy and the 10G networks.
Adi prepares the server to join the 10G network and reboots it for the final changes to take effect. After the reboot, he checks that the server is indeed part of the new network and that both twinax connections are up and stable.
Yuval adds the server back to the mysql cluster and makes sure everything looks healthy.
Mike removes the old RJ-45 cables and waits for Yuval and Gerry to prepare and shut down the next server (to avoid multi-node shutdown in the same cluster).

* This is a simplified explanation of the process, and assumes everything goes as smooth as possible. It involves 3 Outbrain engineers to be available across a major time zone difference + on-site remote hands. That’s 4 engineers to migrate a single node.

And now that we’re done, there’s only… a few thousand servers left to migrate…running on various types of hardware, with different versions of Linux, different applications and operating under different availability restrictions.

Which begs the question – Will it Scale?

Putting Remote Hands in the Driver Seat

After a few iterations, this is what the process looks like from Outbrain’s perspective:

Gerry executes a Rundeck job with one mouse click, and then emails Mike a list of server locations.
Gerry goes to sleep.

This is the process from the perspective of a remote hands technician at the data center:

Mike receives a list of server locations, and plugs an iPad into the first server on the list.
The iPad is running Slack and the chat room starts displaying new messages. It lets Mike know that the server is attempting to safely stop mysql and power itself down.
The server shuts itself down, but right before it goes down, it sends a message to the Slack channel explaining all the next steps, including which switch ports to connect to server to.
Mike installs the 10G network card, finishes up the cabling, and powers the server back up.
The server runs the network preparation scripts, reboots itself, checks the network status, starts up mysql, runs health checks, and lets Mike know that it’s time to remove the old cabling and move on to the next server.

Meanwhile, Gerry and Yuval’s teams are getting updates via email every time a server begins and succeeds the migration process. They can monitor the Slack channel during the process, or even look back at all of the migrations in a Kibana dashboard. If anything ever goes wrong, the iPad communicates that Mike should stop and contact Outbrain. The iPad won’t take any wrong actions if it is plugged into the wrong server or reseated at any point.

How We Built It

When the iPad is plugged into the server, it is recognized by a udev rule. This triggers a wrapper script that contains all of the migration logic. Here’s a deeper look at the individual steps and components:

Rundeck is used to put the desired servers into maintenance mode. It does this by touching empty files onto the servers that indicate that they are in maintenance mode, and they are in the first stage of the network migration process.

Chef contains all of the necessary scripts to run the migration. This includes the udev rules and the wrapper, networking, and application specific scripts. The Chef recipe chooses the proper pre and post migration scripts based on the role of the server.

UDEV Rules are very powerful, and we use them define what happens when the iPad is plugged into a server:

SUBSYSTEM=="usb",
ACTION=="add", 
ENV{ID_SERIAL}=="Apple_Inc._iPad_averylonguniqueserialnumber",
RUN+="/path/to/wrapper_script.sh"

This rule roughly translates to: “When a usb device with this serial number in plugged into the server, run the following wrapper script”

A custom startup script is what triggers the wrapper script when the server boots back up after a shutdown or reboot to perform the next migration step.

The wrapper script holds all of the logic and functionality of the process.

It only runs if a maintenance file exists on this server.
It checks for a state file, and depending on what it finds, understands which part of the migration process to run next.
It sends the event logs (for the Kibana dashboard), emails (to the proper teams who manage this server), and Slack messages (via webhooks).
It handles the reboots and shutdowns.
When it runs an application specific pre/post migration script, it’s expecting to get an exit status of 0 (or else it will let everyone involved know that something went wrong).
When it runs the network preparation script (which is worthy of its own blog post), it can make a few decisions based on the exit status. It can rings the alarms, move on to the next step, or even let the remote hands technician know that one of the twinax cables seems loose and should be fully seated.
It cleans up by removing all migration related files, including itself.

More time to build more automation =]

Now we can set a bunch of servers to maintenance mode, hand off the list of server locations to remote hands, and continue our daily work while they are migrated.

And since this works so well using the current approach, work is underway to generalise the concept and make it available to many other types of physical maintenance.

So that our cloud can continue to build itself… while we sleep.

September 14, 2018

Maor Frankel

Micro Front Ends - Doing it Angular Style Part 1

1st published on medium

What is it and why do I need it?

Let’s start with the why part, when the days of single page applications started most applications were considerably small and managed by a single FE team, all was well…

With time, applications have gotten larger and larger, and so have the teams managing them. No need to say much about the problems of having large code bases and large teams…

The term Micro Front Ends has been thrown around a lot lately, offering a concept similar to Microservices where we can split a monolithic Front End application to micro applications that can be loaded into a single container application running on the users’ browser.

Each application can have its own code base, be managed by a business-oriented team which can test and deploy their micro-app independently.

(taken from https://micro-frontends.org/)

While the concept itself sounds promising, actual implementations of it are lacking. Especially ones that can apply to an existing large application.

Alternatives

One possible solution is using the good old Iframe, which offers many advantages as far as encapsulation and independence but is an old technology and suffers from significant scale issues.

Other than Iframes, the term Web component has also been floating around for a while.

Web components are a solution where you can create custom DOM elements that can run independently and provide separation and CSS encapsulation, while this sounds like the right direction, Web components are far from an actual solution. They are more of a concept and the features behind it, such as the shadow DOM still lack full browser support.

Our solution

At Outbrain we have been facing the issues that most veteran SPAs are facing, we have a huge FE application, with a large team to manage it, and it’s getting rough.

Seeing that there are no outstanding solutions for MFE in the wild we decided to try to find a solution of our own, one that is quick to implement on our current eco system.

We defined several key points that we saw as necessary for any solution to apply as an MFE.

Standalone mode

Each micro-app should be capable of running completely in standalone mode, so each team in charge of a given app should be able to run it independently from other applications.

This means each application should be hosted on a separate codebase and be able to run locally on a developer’s computer, and in dev and tests environment.

Deployment

It should be possible to deploy each service independently to any environment including production in order to allow freedom for the owner team to work without interfering with other teams, If a bug fix needs to be deployed to production on the weekend no other team should have to be involved.

Testing

Running tests independently on each micro-app, that way a bug in one app is easily identified and doesn’t reflect on other apps. That said It is necessary to have some integration tests to check the interfaces between apps and make sure they are not broken, these tests should be monitored by all teams.

One to many

We want to be able to use each micro-app multiple times, a micro-app should not care where it runs, only be aware of its input and outputs.

Runtime separation and encapsulation

It is critical that each app is sandboxed in the runtime environment, so apps don’t interfere with each other, this includes CSS encapsulation, JS namespacing and HTML separation.

Common resources sharing

Since we don’t want to have to load large modules like Angular, lodash and CSS styles multiple times in the app it is important for micro apps to be able to share common resource between them, that said, we also want to be able to allow them to encapsulate resources that are only relevant to them, or encapsulate resources that have different versions across apps, for instance app A uses lodash 3 and app B wants to migrate to lodash 4. We don’t want to wait for all apps to migrate before it can upgrade.

Communication

There needs to be a decoupled way for the apps to communicate with each other without actually knowing about each other, only via predefined interfaces and API’s.

Backwards compatibility

Since we are not going to rewrite our huge code base, we need something we can plugin to our current system and gradually separate parts that can be managed by other teams.

Web Components

Finally, Any solution we go for should be aligned as much as possible with the web components concept, even though it is currently just that, a concept, it seems that that is the way the future is heading, and if any solution pop up in the future, aligning ourselves with this will help us adopt future solution.

Part 2

In the next part, I will go into details about how we achieved this and lived to write about it.
The ingredients for the next part include Angular, Webpack and a couple of tasty loaders.
Part 2

December 28, 2017

admin

Live Tail in Kubernetes / Docker Based environment

At Outbrain we are big believers in Observability.

What is Observability, and what is the difference between Observability and Monitoring? I will leave the explanation to Baron Schwartz @xaprb:

“Monitoring tells you whether the system works. Observability lets you ask why it’s not working.”

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

This presented many new challenges around understanding why things don’t work.

In this post I will be sharing with you our logging implementation which is the first tool used to understand the why.

But first thing first, a short review of our current standard logging architecture:

We use a standard ELK stack for the majority of our logging needs. By standard I mean Logstash on bare metal nodes, Elasticsearch for storage and Kibana for visualizing and analytics. Apache Kafka is transport layer for all of the above.

A very simplified sketch of the system: Live Tail in Kubernetes

Of course the setup is a bit more complex in real life since Outbrain’s infrastructure is spread across thousands of servers, in multiple physical data centers and cloud providers; and there are multiple Elasticsearch clusters for different use cases.

Add to the equation that these systems are used in a self-serve model, meaning the engineers are creating and updating configurations by themselves – and you end up with a complex system which must be robust and resilient, or the users will lose trust in the system.

The move to Kubernetes presented new challenges and requirements, specifically related to the logging tools:

Support multiple Kubernetes clusters and data centers.
We don’t want to us “kubectl”, because managing keys is a pain especially in a multi cluster environment.
Provide a way to tail logs and even edit log file. This should be available on a single pod or across a service deployed in multiple pods.
Leverage existing technologies: Kafka, ELK stack and Log4j on the client side
Support all existing logging sources like multiline and Json.
Don’t forget services which don’t run in Kubernetes, yes we still need to support those.

So how did we meet all those requirements? Time to talk about our new Logging design.

The new architecture is based on a standard Kubernetes logging setup – Fluentd daemonset running on each Kubelet node, and all services are configured to send logs to stdout / err instead of a file.

The Fluentd agent is collecting the pod’s logs and adding the Kubernetes level labels to every message.

The Fluentd plugin we’re using is the kubernetes_metadata_filter.

After the messages are enriched they are stored in a Kafka topic.

A pool of Logstash agents (Running as pods in Kubernetes) are consuming and parsing messages from Kafka as needed.

Once parsed messages can be indexed into Elasticsearch or routed to another topic.

A sketch of the setup described:

And now it is time to introduce CTail.

Ctail, stands for Containers Tail, it is an Outbrain homegrown tool written in Go, and based on a server and client side components.

A CTail server-side component runs per datacenter or per Kubernetes cluster, consuming messages from a Kafka topic named “CTail” and based on the Kubernetes app label creates a stream which can be consumed via the CTail client component.

Since order is important for log messages, and since Kafka only guarantees order for messages in the same partition, we had to make sure messages are partitioned by the pod_id.

With this new setup and tooling, when Outbrain engineers want to live tail their logs, all they need to do is launch the CTail client.

Once the Ctail client starts, it will query Consul, which is what we use for service discovery, to locate all of the CTail servers; register to their streams and will perform aggregations in memory – resulting in a live stream of log entries.

Here is a sketch demonstrating the environment and an example of the CTail client output:

CTail client output

To view logs from all pods of a service called “ob1ktemplate” all you need is to run is:

# ctail-client -service ob1ktemplate -msg-only

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: uri http://localhost:8181/Ob1kTemplate/ returned status code 200
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: Getting uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate'
2017-06-13T19:16:25.531Z ob1ktemplate-test-ssages-2751568954-n1rte: uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate' returned status code 200

Or logs of a specific pod:

# ctail-client -service ob1ktemplate -msg-only -pod ob1ktemplate-test-ssages-2751568960-n1kwd

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri 
http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751568960-n1kwd: uri http://localhost:8181/Ob1kTemplate/ returned status code 200

This is how we solve this challenge.

Interested in reading more about other challenges we encountered during the migration? Either wait for our next blog, or reach out to visibility at outbrain.com.

November 8, 2017

Avi Youkhananov

Keep bugs out of production

Production bugs are painful and can severely impact a dev team’s velocity. My team at Outbrain has succeeded in implementing a work process that enables us to send new features to production free of bugs, a process that incorporates automated functions with team discipline.

Why should I even care?

Bugs happen all the time – and they will be found locally or in production. But the main difference between preventing and finding the bug in a pre-production environment is the cost: according to IBM’s research, fixing a bug in production can cost X5 times more than discovering it in pre-production environments (during the design, local development, or test phase).

Let’s describe one of the scenarios happen once a bug reaches production:

A customer finds the bug and alerts customer service.
The bug is logged by the production team.
The developer gets the description of the bug, opens the spec, and spends time reading it over.
The developer then will spend time recreating the bug.
The developer must then reacquaint him/herself with the code to debug it.
Next, the fix must undergo tests.
The fix is then built and deployed in other environments.
Finally, the fix goes through QA testing (requiring QA resources).

How to stop bugs from reaching production

To catch and fix bugs at the most time-and-cost efficient stage, we follow these steps, adhering to the several core principles:

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Create the design before coding: try to divide difficult problems into smaller parts/steps/modules that you can tackle one by one, thinking of objects with well-defined responsibilities. Share the plans with your teammates at design-review meetings. Good design is a key to reducing bugs and improving code quality.

Step 2: Start Coding

The code should be readable and simple. Design and development principles are your best friends. Use SOLID, DRY, YAGNI, KISS and Polymorphism to implement your code.
Unit tests are part of the development process. We use them to test individual code units and ensure that the unit is logically correct.
Unit tests are written and executed by developers. Most of the time we use JUnit as our testing framework.

Step 3: Use code analysis tools

To help ensure and maintain the quality of our code, we use several automated code-analysis tools:
FindBugs – A static code analysis tool that detects possible bugs in Java programs, helping us to improve the correctness of our code.
Checkstyle – Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. It automates the process of checking Java code.

Step 4: Perform code reviews

We all know that code reviews are important. There are many best practices online (see 7 Ways to Up-Level Your Code Review Skills, Best Practices for Peer Code Review, and Effective Code Reviews), so let’s focus on the tools we use. All of our code commits are populated to ReviewBoard, and developers can review the committed code, see at any point in time the latest developments, and share input.
For the more crucial teams, we have a build that makes sure every commit has passed a code review – in the case that a review has not be done, the build will alert the team that there was an unreviewed change.
Regardless of whether you are performing a post-commit, a pull request, or a pre-commit review, you should always aim to check and review what’s being inserted into your codebase.

Step 5: CI

This is where all code is being integrated. We use TeamCity to enforce our code standards and correctness by running unit tests, FindBugs validations Checkstyle rules and other types of policies.

Stage 2 – Testing Environment

Step 1: Run integration tests

Check if the system as a whole work. Integration testing is also done by developers, but rather than testing individual components, it aims to test across components. A system consists of many separate components like code, database, web servers, etc.
Integration tests are able to spot issues like wiring of components, network access, database issues, etc. We use Jenkins and TeamCity to run CI tests.

Step 2: Run functional tests

Check that each feature is implemented correctly by comparing the results for a given input with the specification. Typically, this is not done at the development level.
Test cases are written based on the specification, and the actual results are compared with the expected results. We run functional tests using Selenium and Protractor for UI testing and Junit for API testing.

Stage 3 – Staging Environment

This environment is often referred to as a pre-production sandbox, a system testing area, or simply a staging area. Its purpose is to provide an environment that simulates your actual production environment as closely as possible so you can test your application in conjunction with other applications.
Move a small percentage of real production requests to the staging environment where QA tests the features.

Stage 4 – Production Environment

Step 1: Deploy gradually

Deployment is a process that delivers our code into production machines. If some errors occurred during deployment, our Continuous Delivery system will pause the deployment, preventing the problematic version to reach all the machines, and allow us to roll back quickly.

Step 2: Incorporate feature flags

All our new components are released with feature flags, which basically serve to control the full lifecycle of our features. Feature flags allow us to manage components and compartmentalize risk.

Step 3: Release gradually

There are two ways to make our release gradual:

We test new features on a small set of users before releasing to everyone.
Open the feature initially to, say, 10% of our customers, then 30%, then 50%, and then 100%.

Both methods allow us to monitor and track problematic scenarios in our systems.

Step 4: Monitor and Alerts

We use the ELK stack consisting of Elasticsearch, Logstash, and Kibana to manage our logs and events data.
For Time Series Data we use Prometheus as the metric storage and alerting engine.
Each developer can set up his own metrics and build grafana dashboards.
Setting the alerts is also part of the developer’s work and it is his responsibility to tune the threshold for triggering the PagerDuty alert.
PagerDuty is an automated call, texting, and email service, which escalates notifications between responsible parties to ensure the issues are addressed by the right people at the right time.

stop bugs
All in All,
Don’t let the bugs fly out of control.

June 13, 2017

admin

Failure Testing for your private cloud – Introducing GomJabbar

TL;DR Chaos Drills can contribute a lot to your services resilience, and it’s actually quite a fun activity. We’ve built a tool called GomJabbar to help you run those drills.

Failure Testing for your private cloud

Here at Outbrain we manage quite a large scale deployment of hundreds of services/modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can’t ensure a 100% fault-free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix’s Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool is that failures have the tendency of occurring when you’re least prepared, and in the least desirable time, e.g. Friday nights, when you’re out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don’t always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn’t it be great if we could have “chaos drills”, where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:

Did the system handle the failure case correctly?
Was our alerting strategy effective?
Did the team have the knowledge to handle, and troubleshoot the failure?
Was the issue investigated thoroughly?

These take-ins lead to super valuable inputs, which we probably wouldn’t collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.

"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)

So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:

There’s an obvious need to avoid unnecessary damage.
- We’ve created filters to ensure only approved targets get to participate in the drills.
  This has a side effect of pre-marking areas in the code we need to take care of.
- We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
  we reschedule.
- When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
- We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
  and moved on to more interesting testing like faulty network emulation.
We’ve measured the time teams spent on these drills, and it turned out to be negligible.
Most of the time was spent on preparations. For example, ensuring we have proper alerting,
and correct resilience features in the clients.
This is actually something you need to do anyway. At the end of the day, we’ve heard no complaints about interruptions, nor time waste.
We’ve made sure teams, and engineers on the call were not left on their own. We wanted everybody to learn
from this drill, and when they weren’t sure how to proceed, we jumped in to help. It’s important
to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.

All that said, it’s important to remember that we basically simulate failures that occur on a daily basis. It’s only that when we do that in a controlled manner, it’s easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap – What next?

Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
Add new kinds of failures, like disk space issues, power failures, etc.
So far, we were only brave enough to run this on applicative nodes, and there’s no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):

Graceful shutdowns of service instances.
Graceless shutdowns of service instances.
Faulty Network Emulation (high latency, and packet-loss).

Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you’ve read so far, you may be asking yourself what’s in it for me? What kind of lessons can I learn from these drills?

We’ve actually found and fixed many issues by running these drills, and here’s what we can share:

We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We’ve found that we didn’t compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
We’ve found services that had no owner, became obsolete and were basically running unattended in production. The horror.
During the faulty network emulations, we’ve found that we had clients that didn’t implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We’ve also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
We’ve also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.

Conclusion

We’ve found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We’re by no means anywhere near perfection. We’re actually pretty sure we’ll find many many more issues we need to take care of. We’re hoping this exciting new tool will help us move to the next level, and we hope you find it useful too 😉

January 9, 2017

Roy Bass

Kibana for Funnel Analysis

How we use Kibana (4) for user-acquisition funnel analysis

Outbrain has recently launched a direct-to-consumer (D2C) initiative. Our first product is a chatbot. As with every D2C product, acquiring users is important. Therefore, optimizing the acquisition channel is also important. The basis of our optimization is analysis.

Our Solution (General Architecture)

Our acquisition funnel spans on 2 platforms (2 web pages and a chatbot). Passing many parameter between platforms can be a challenge, so we chose a more stateful, server-based model. Client requests for a new session Id, together with basic data like IP and User agent. Server stores a session (we use Cassandra in this case) with processed fields like Platform, OS, Country, Referral, User Id. At a later stage the client reports a funnel event for a session Id. The server writes all known fields for the session into 2 storages:

ElasticSearch for quick & recent analytics (Using the standard ELK stack)
Hadoop for long term storage and offline reports

A few example fields stored per event

User Id – An unique & anonymous identifier for a user
Session Id – The session Id is the only parameter passed between funnel steps
Event Type – The specific step in the funnel – serve, view, click
User Agent – Broken down to Platform and OS
Location – based on IP
Referral fields – Information on the context in which the funnel is excercised
A/B Tests variants – The A/B Test variant Ids that are included in the session

Goal of the Analysis: Display most important metrics quickly

Kibana plugin #1: Displaying percent metric

Kibana has several ways of displaying a fraction, but none excel in displaying small numbers. (Pie can be used to visualize fractions, but small). We developed a Kibana plugin for displaying a single metric, in percent format.

funnel step

We use this visualization for displaying the conversion rate of the most interesting part of our funnel.

Kibana plugin #2: Displaying the funnel

We couldn’t find a good way for displaying a funnel so we developed a visualization plugin (honestly, we were eager to develop this, so we did not scan the entire internet..)

Based on the great D3 Funnel by Jake Zatecky, this is a Kibana plugin that displays buckets of events in funnel format. It’s customizable and open-source. Feel free to use it…

Putting it all together

Displaying your most important metrics and the full funnel is nice. Comparing variant A with variant B is very nice. We’ve setup our dashboard to show similar key metrics on 2 versions of the funnel. We always try to run at least 1 A/B test and this dashboard shows us realtime results of our tests.

variant

Cherry on top

A timeline is awesome. If you’re not using it, I suggest trying it.

Viewing your most important metrics over time is very useful, especially when you’re making changes fast. Here’s an example:

Summary

We track a user’s activity by sending events to the server. The server writes these events to ES and Hadoop. We developed 2 Kibana plugins to visualize the most important metrics of our user-acquisition funnel. We can filter the funnel by Platform, Country, OS, Time, Referral, or any other fields we bothered to save. In addition, we always filter by A/B Test variants and compare 2 specific variants.

Category: How To

Who we are

Challenges of Scale

Master your Domains

The Demand Domain

The Inventory Lifecycle Domain

The Communication Domain

The Hardware Observability Domain

And then came COVID-19…

Distributed Tracing

What we need to solve

The First Rule

What not to do

Slow and steady wins the race

The Cloud is an Illusion

It was a sunny day in California…

How many Engineers does it take to…?

The people involved:

The process:

Putting Remote Hands in the Driver Seat

How We Built It

More time to build more automation =]

What is it and why do I need it?

Alternatives

Our solution

Standalone mode

Deployment

Testing

One to many

Runtime separation and encapsulation

Common resources sharing

Communication

Backwards compatibility

Web Components

Part 2

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

So how did we meet all those requirements? Time to talk about our new Logging design.

And now it is time to introduce CTail.

Why should I even care?

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Step 2: Start Coding

Step 3: Use code analysis tools

Step 4: Perform code reviews

Step 5: CI

Stage 2 – Testing Environment

Step 1: Run integration tests

Step 2: Run functional tests

Stage 3 – Staging Environment

Stage 4 – Production Environment

Step 1: Deploy gradually

Step 2: Incorporate feature flags

Step 3: Release gradually

Step 4: Monitor and Alerts

Chaos Drills at Outbrain

How did we kick this off?

Our roadmap – What next?

The GomJabbar Internals

What have we learned so far?

Conclusion

How we use Kibana (4) for user-acquisition funnel analysis

Our Solution (General Architecture)

Goal of the Analysis: Display most important metrics quickly

Kibana plugin #1: Displaying percent metric

Kibana plugin #2: Displaying the funnel

Putting it all together

Cherry on top

Summary

Search

עברית

Categories

Archive

RSS