November 15, 2020

Ariel Kipervasser

Taming the bear (metal)

Introduction

In this blog post, I am going to share with you the process behind Namaste – an automated tool we built, in order to manage our bare metal machine requests, as part of our BMaaS approach (that’s Bare Metal as a Service).

Who are we?

Before we dive into the new tool we built, let me introduce our team.

We are a team of 4 DevOps engineers based in Israel, managing an on premise environment of 7000 bare metal servers across 3 Data Centers in the US.

But there is a little bit more to it. When we say 7000 servers, it includes:

about 20 different server models
more than 100 different disk models
…memory sticks
…CPUs
…raid cards
..NICs
…PSUs
…and a combination of all of the above

So there are a lot of hardware configuration options and a lot of reusing of the servers.

Since we are located in Israel and our Data Centers are located in the US, in each Data Center, there is a support team available 24/7 to handle everything on the physical side. They take the role of our “Remote Hands and Eyes” with any onsite related work. I will address them as “Remote Hands” in this blog post.

What we do?

We divide our work into 3 main types of tasks:

Planned changes – Our planned changes are managed in Jira and are mainly hardware related tickets opened by our customers – mostly engineers in the Cloud Platform group. We usually divide these tickets into 3 types of tasks:

hardware upgrades
service requests
machine requests (which we’ll discuss in more details later)

These can be upgrading the resources of a server, working on a suspected hardware issue or building a new cluster altogether.

Unplanned changes – Everything which has to be done Now. During working hours it can be a task managed in Jira as a blocking/critical ticket. During off hours it can be a PagerDuty alert.

Internal projects – Projects which our team initiates and decides to manage. These are usually the tasks which our users don’t know or really care about, but those tasks are more interesting and make our users’ lives and our lives much better. In essence, our internal projects prevent unplanned changes and make the planned changes easier to perform. The new tool I’ll discuss falls under this category.

If you’d like to learn more about who we are and how we do things, I strongly encourage you to check out this great blog post by my colleague, Adib Daw.

The challenge

Looking back, a year ago our tasks were distributed like this:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

This means 90% of the work was manual labor and 10% were projects geared towards automation. The meaning was both relatively slow delivery, much room for human errors and frustration due to the highly repetitive nature of most of the work. Simply put, it just didn’t scale.

Our solution to this problem was to formulate a vision, and present it to our customers to get their buyin. Our customers agreed to take a hit on the delivery times of their requests, so we could focus on the “internal projects” category, with the intention of investing in automation, and thus the velocity and robustness of our execution.

We decided to push towards building as many automated processes as possible, in order to minimise the amount of time spent on planned changes and unplanned changes, since many of these tasks appeared to lend themselves quite well to automation.

The “before” picture

When we began our journey towards automation, the process of choosing the best suited servers for the machine request took us a lot of time.

It looked like this:

Check available hardware and verify we have enough free servers to fulfill the request
Choose the best suited hardware by carefully (and manually) ensuring hardware constraints are met (suitable server form factor, support for required number of drives, RAID adapter, etc)
Reserve the servers in a temporary allocation pool so no one else will use them for something else as they’re being worked on
Allocate the relevant parts in our inventory system
Submit the target configuration into the Remote Hands management system we’d built, so that the onsite technician would have the required information for the task (server setup and required parts)
Open a ticket with the relevant information (server locations, serial number, etc) to the onsite Remote Hands team, to begin work on the hardware configuration changes

All of this had to be done before any actual hands-on work on the hardware even started.

In addition, we also had to deal with failures in the flow:

configuration mistakes
hardware mismatches
human errors
faulty parts

The “after” picture

The process we had worked, but it took a lot of time, sometimes as much as hours per request, depending on how many servers were requested and what changes were needed to their config.

While we had a set of tools that helped bring us to this stage – much improved over the previous process we had, which could take days for large requests – it would keep us afloat but still required manual labor which seemed entirely unnecessary.

Our vision for this process was to simplify it, to look as follows:

Submit a request for the final server config, detailing number of servers per datacenter and their final spec
Drink coffee while Namaste handles everything in the background, including opening a ticket for Remote Hands
Get notified that the request was fulfilled

This simplified process would have code deal with the metadata matching, allocation and reservation, and leave us to handle fulfilment rejects, where human intervention is really necessary.

Behind the scenes

The flow of matching machines per a machine request:

Accept user input – We enter server requirements such as quantity, location and spec
Perform machine matching – An engine we built returns all the servers matching the requirements, some of which may be a partial match (missing a drive, less RAM, etc)
Prioritise – Prioritise from the matched server list, based on:
- Rack awareness
- Server form factor
- Drive form factor
- Number of cores closest to requested spec
- Hardware changes required to bring machines to the requested spec
Reserve machines – Reserve the prioritised servers in a temporary pool to avoid double allocation
Perform parts matching – Decide which parts to use to bring the reserved servers to the requested spec, wherever needed
Reserve parts – Reserve the selected parts in our inventory to avoid double allocation
Set machines to target spec – Mark the servers with the needed changes, so when the onsite support team begins working on fulfilling the work order, they will connect an iPad to each server which will tell them what changes need to be done.
Open a ticket – Open a ticket with all the relevant information to the onsite support team for fulfilment.

Return on investment

The most obvious gain in developing this mechanism is time saving. With the steps of matching hardware and reserving parts being the main points of friction that were removed. Time saving also shortens the wait for hardware request fulfillment, so both the team and the customers win.

Automated matching also reduces human errors, which has impact on time saving as well as on customer experience. In addition, it further allows us to use the best suitable hardware, which improves efficiency by assigning the best matching resource to fulfill the request. Simply put, we no longer assign the “first matching” or “random” servers. Instead, we select the “most fitting” ones, as the matching algorithm is now explicit and repeatable.

A “side effect” of the improved matching process is that a whole class of issues is avoided completely. Those are issues around server <-> part incompatibility, maximum server capacity and generally everything Remote Hands might come across after the request had already been dispatched to them. These kinds of issues are the most expensive time-wise, as they require human to human communication across different time zones. We knew this would help, but the actual impact was astounding.

Finally, the user experience dramatically improved with the introduction of this system:

Users request hardware get their requests fulfilled quicker
Remote Hands come across less issues in the field
Our team deals with much fewer rejects and manual labor

As you can see, there are 3 types of customers here, and all come out on top.

A short summary

A year ago we embarked on a journey to build our BMaaS solution, where Namaste is just one part of the full picture. While this is still a work in progress and we expect it to continue evolving indefinitely, now is a good opportunity to look back on where we started, specifically on what drove us down this path.

At the beginning of this post, I mentioned what our work distribution looked like before we decided to change things around. To remind you, it was something along:

Planned changes – 75%

Unplanned changes – 15%

Internal projects – 10%

A year into the process, the nature of our work has changed dramatically, and the work distribution is now more along:

Planned changes – 15%

Unplanned changes – <0.1%

Internal projects – 85%

It’s important to note that the actual volume of changes hasn’t decreased. In fact, we now handle more hardware fulfillment requests than ever before. But the actual mindful attention these changes require from us has dropped significantly. Instead, we focus mostly on building the system that gets the work done for all of us, shifting us from “moving hardware” to writing software.

What’s next

APIs – by making our BMaaS platform’s APIs open to our users, we expect new and unforeseen use cases to surface, where our users build their own solutions on top of the platform
Web UI – much of what we do is currently accessible via command-line tools. While this works, a web UI could improve the user experience even further and allow building various views which the system doesn’t support as of yet
Events & notifications – our team still serves as a communication pipeline in various parts of the process, as we’re validating different ideas and ensuring everything works as expected. We plan to introduce events and notifications that would remove us entirely from the flow and allow users to act upon certain events, in-person or via their own code
Coffee – maybe move to some decaf. Or tea.

May 17, 2020

Adib Daw

4 Engineers, 7000 Servers, And One Global Pandemic

Inside one of our data centers (tiny COVIDs by CDC on Unsplash)

If this title did not send a slight shiver down your spine, then you should either move on to the next post, or drop by our careers page – we’d love to have a chat.

This post will touch the core of what we do – building up through abstraction. It will demonstrate how design principles, like separation of concerns and inversion of control, manifest in the macro, physical realm, and how software helps bind it all together into a unified, ever developing, more reliable system.

While we’re only touching the tip of the iceberg, we intend to dive into the juicy details in future posts.

Who we are

We are a team of 4 penguins who enjoy writing code and tinkering with hardware. In our spare time, we are in charge of deploying, maintaining and operating a fleet of over 7000 physical servers running Linux, spread across 3 different DCs located in the US.

We also happen to do this 6,762 miles away, from the comfort of our very own cubicle, a short drive from the nearest beach resort overlooking the Mediterranean.

License to use: Creative Commons Zero – CC0.

Challenges of Scale

While it may make sense for a start-up to choose to start with hosting their infrastructure in a public cloud due to the relatively small initial investments, we at Outbrain choose to host our own servers. We do so because the ongoing costs of public cloud infrastructure far surpass the costs of running our own hardware in collocated data centers once a certain scale is reached, and it offers an unparalleled degree of control and fault recovery.

As we grow in scale, challenges are always a short distance away, and they usually come in droves. Managing a server’s life cycle must evolve to contain the rapid increase in the number of servers. Spreadsheet-native methods of managing server parts across data centers become, very quickly, cumbersome. Detecting, troubleshooting and resolving failures, while maintaining reasonable SLAs, becomes an act of juggling vastly diverse hardware arrays, different loads, timing of updates and the rest of the lovely things no one wants to care about.

Master your Domains

To solve many of these challenges, we’ve broken down a server’s life cycle at Outbrain into its primary components, and we called them domains. For instance, one domain encompasses hardware demand, another the logistics surrounding inventory lifecycle, while a third is communication with the staff onsite. There is another one concerning hardware observability, but we won’t go into all of them at this time. The purpose of this is to examine and define the domains, so they can be abstracted away through code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested and improved, as a preparation for a coded algorithm. Finally, a domain is set up to integrate with other domains via APIs, forming a coherent, dynamic, and ever evolving hardware life cycle system that is deployable, testable and observable. Like all of our other production systems.

Adopting this approach has allowed us to tackle many challenges the right way – by building tools and automations.

The Demand Domain

While emails and spreadsheets were an acceptable way to handle demand in the early days, they were no longer sustainable as the number of servers and volume of incoming hardware requests reached a certain point. In order to better organize and prioritize incoming requests in an environment of rapid expansion, we had to rely on a ticketing system that was:

Could be customised to present only relevant fields (simple)
Exposed APIs (extendable)
Familiar to the team (sane)
Integrated with our existing workflows (unified)

Since we used JIRA to manage our sprints and internal tasks, we decided to create another project that would help our clients submit tickets and monitor their progress. Relying on JIRA for both incoming requests and internal task management, allowed us to set up a consolidated Kanban board that gave us a unified view. Our internal customers, on the other hand, had a view that presented only hardware request tickets, without exposing “less relevant details” of additional tasks within the team (such as enhancing tools, bug fixing and writing this blog post).

(JIRA Kanban board)

As a bonus, the fact that queues and priorities were now visible to all, gave visibility into “where in line” each request stood, what came before it, what stage it was in, and allowed owners to shift priorities within their own requests without having to talk to us. As simple as a “drag and drop”. It has also allowed us to estimate and continuously evaluate our SLAs according to the types of request, based on metrics generated by JIRA.

The Inventory Lifecycle Domain

You could only imagine the complexity of managing the parts that go in each server. To make things worse, many parts (memory, disk) can find themselves traveling back and forth from inventory to different servers and back, according to the requirements. Lastly, when they fail, they are either decommissioned and swapped, or returned to the vendor for RMA. All of this, of course, needs to be communicated to the colocation staff onsite who do the actual physical labor. To tackle these challenges we have created an internal tool called Floppy. Its job is to:

Abstract away the complexities of adding/removing parts whenever a change is introduced to a server
Manage communication with the onsite staff including all required information – via email / iPad (more on that later)
Update the inventory once work has been completed and verified

The inventory, in turn, is visualized through a set of Grafana dashboards, which is what we use for graphing all our other metrics. So we essentially use the same tool for inventory visualization as we do for all other production needs.

(Disk inventory management Grafana dashboard)

When a server is still under warranty, we invoke a different tool we’ve built, called Dispatcher. It’s jobs are to:

Collects the system logs
Generate a report in the preferred vendor’s format
Open a claim with the vendor via API
Return a claim ID for tracking purposes

Once the claim is approved (typically within a business day), a replacement part is shipped to the relevant datacenter, and is handled by the onsite staff.

(Jenkins console output)

The Communication Domain

In order to support the rapid increases in capacity, we’ve had to adapt the way we work with the onsite data center technicians. When at first growing in scale meant buying new servers, following a large consolidation project (powered by a move to Kubernetes), it became something entirely different. Our growth transformed from “placing racks” to “repurposing servers”. So instead of adding capacity, we started opening up servers and replacing their parts. To successfully make this paradigm shift, we had to stop thinking about Outbrain’s colocation providers as vendors, and begin seeing them as our clients. Adopting this approach meant that we would need to design and build the right toolbox to help make the work of data center technicians:

Simple
Autonomous
Efficient
Reliable

All the while abstracting away the operational differences between our different colocation providers, and the seniority of the technicians at each location. We had to remove ourselves from the picture, and let them communicate with the server without our intervention and without making assumptions involving the workload, work hours, equipment at hand, and other factors.

To tackle this we had set up iPad rigs in each data center. Once plugged into a server, the following would happen:

The mechanism validates that this is indeed a server that needs to be worked on
The application running on the server is shutdown (where needed)
A set of work instructions is posted to a Slack channel explaining the required steps
Once work is completed, the tool validates the final state is correct
And if needed, restarts the application

In addition, we have also introduced a Slack bot meant to service the technician, with an ever expanding set of capabilities to make their work smoother and our lives easier. By doing so, we’ve turned most of the process of repurposing and servicing servers into an asynchronous one, removing ourselves from the loop.

(iPad rig at one of our data centers)

The Hardware Observability Domain

Scaling our datacenter infrastructure reliably requires good visibility into every component of the chain, for instance:

Hardware fault detection
Server states (active, allocatable, zombie, etc)
Power consumption
Firmware levels
Analytics on top of it all

Metric based decisions allow us to decide on how, where and when to procure hardware, at times before we even get the demand. It also allows us to better distribute resources such as power, by identifying workloads which are more demanding. Thus, we can make informed decisions as to server placement before a server is racked and plugged into the power, throughout its maintenance cycles, and up to its eventual decommissioning.

(Rack power utilization dashboard in Grafana)

And then came COVID-19…

At Outbrain, we build technologies that empower media companies and publishers across the open web by helping visitors discover relevant content, products and services that may be of interest to them. Therefore, our infrastructure is designed to contain the traffic generated when major news events unfold.

The media coverage of the events surrounding COVID-19 coupled with the increase in visitors traffic has meant that we had to rapidly increase our capacity to cope with these volumes. To top it off, we had to do this facing a global crisis, where supply chains were disrupted, and a major part of the workforce was confined to their homes.

But, as we described, our model already assumes that:

The hardware in our data centers, for the most part, is not physically accessible to us
We rely on remote hands for almost all of the physical work
Remote hands work is done asynchronous, autonomously and in large volume
We meet hardware demand by method of “building from parts” vs “buying ready kits”
We keep inventory to be able to build, not just fix

So the global limitations preventing many companies from accessing their physical data centers had little effect on us. And as far as parts and servers go – yes, we scrambled to secure hardware just like everyone else, but it was to ensure that we don’t run out of hardware as we backfill our reserves rather than to meet the already established demand.

In summary, I hope that this glimpse into our world of datacenter operations shows one can apply the same principles of good code design to the physical domains of datacenter management, and live to tell.

Want to hear more about the way we run our data centers? Stay tuned for the next part!

January 15, 2020

Dori Shmuel

Upgrading your datacenter network with the push of a button

Intro

In this blog post, I am going to share with you the story of how we automated switch upgrades, to perform them at scale. First, however, I recommend reading one of our previous blog posts, to get a better understanding of our network environment: Switches, Penguins and One Bad Cable

To upgrade?

It all started with a PagerDuty call – a failed switch in one of our datacenters. One support ticket later, and our vendor says it’s time for a firmware upgrade. On all of our switches. We flinched. We’ve done this many times before and somehow, there were always edge cases, weird states, and other loose ends. Upgrading a single switch? it was a hassle we were used to, which thankfully didn’t happen too often. Frankly, not a big deal. But upgrading a fleet of 200? uh…

Just to get you up to speed, our network environment is based on Clos fabric, or more specifically, a leaf spine topology. BGP is the dynamic routing protocol that glues it all together.

We’re using Cumulus Linux as the operating system of our switches, Chef to manage their configuration and Jenkins to automate provisioning. In case you haven’t read the article in the intro and you’re wondering what I mean, here’s a link to it again.

So let’s jump straight to it. How do we upgrade switch firmwares at Outbrain? Simple. We don’t. We re-provision them instead. In case you’re not sure what I mean, I’ll clarify.

Switch Re-provisioning Process

Cumulus Linux is a Linux distribution designed for switches. As such, you can manage it using standard Linux tools, and you can also install it using standard Linux methodologies, such as using PXE boot. Except in the case of switches, PXE boot is replaced with the Open Network Install Environment (ONIE).

Once a switch is instructed to re-provision, on the next boot the following things will happen:

The switch requests an IP address via DHCP
The DHCP server acks and responds with DHCP option 114 for the location of the installation image
The switch uses ONIE to download the Cumulus Linux disk image, installs, and reboots
Success! You are now running Cumulus Linux

By the end of this process, we have an updated firmware (OS) version running on the switch without any special configuration. Just like a brand new switch.

“But what about the config?” you might ask. Excellent question! In a word, ZTP. In three words, “Zero Touch Provisioning”:

The switch boots to a fresh Cumulus Linux version (see step 4 of ONIE above)
The management interface (eth0) is configured by default for DHCP
eth0 sends a DHCP request
The DHCP server returns a URL for a ZTP script
The ZTP script installs the Chef client, configs and runs it
Chef-client runs our homemade run_list

Once the Chef client run completes successfully, the switch is fully configured with all the relevant configuration tailored for it. BGP sessions are established and the switch will be back in production.

Now that you have a grasp of the process, here’s what it actually takes to re-provision a switch:

“onie-select” selects ONIE mode to use during the next boot
“-i” means “Install Boot”: at the next reboot, load ONIE in ‘install’ mode. This installs a new operating system, overwriting the current one
“-f” means “Force the operation”: assumes “yes” to any questions the script might ask
“&& reboot” will reboot the switch only if the previous onie-select command succeeded

15 minutes later, and the switch is back in the game, running an updated firmware and its own tailored config. Pretty cool, right? Now we just have to do that to 74 switches, which is what we have in a datacenter.

15 minutes times 74 is…

… a lot. Especially if you add some validations pre and post “re-provisioning”, which bring you to 30 minutes per switch. Multiplied by 74? 2220 minutes, which is 37 hours. That’s a lot of time to be babysitting switches. Not to mention failures, edge cases, missing a switch or two…

And that’s just one datacenter. Multiplied by 3, that’s 6660 minutes, which is 111 hours. So if you spend 8 hours every day doing just this, one switch at a time, with nothing failing at all, it would take you about 2 weeks of repetitive work to do the full upgrade.

There must be a better way. A smarter way. A faster way. A way that is:

Automated
Robust
Hands-free
Auditable

A better way

3 days later and we have something to show for. Allow me to introduce you to our “Switch Upgrade Automation” process (notice that “has a catchy name” wasn’t part of the requirements).

The “switch upgrade automation” process is a Ruby-based program running inside a docker container. It is built as a Jenkins job which is executed periodically. The program covers all the necessary validations to pick a valid candidate switch, re-provision it and make sure the switch is ready to be put back into production.

Diving into the process, each Jenkins build will spawn a docker container, getting the following arguments (there’s actually more, but we’ve chosen the main ones):

Datacenter name
Switch type: leaf or spine
Switch OS version to upgrade from
The 1st or 2nd switch in the rack

Automated and Robust

We first check if there is already a running build, as we don’t want two switches re-provisioning at the same time. We took the conservative approach and decided to run the process serially. Also, we don’t want to move on to the next switch if the previous one failed, as there may be a problem with the process, environment or switch itself.

Next, we build a dynamic Prometheus query that returns a list of all switches running the OS version we’re upgrading from. Using this list we choose a random switch and trigger yet another Prometheus query. The additional query allows us to retrieve the following information about the candidate switch:

Global health status – a metric built out of many local tests running on the switch, including hardware sensors, ztp_exit_status, services status, and PTM
BGP peer status – number of established BGP peers
Number of Active links

Using these values we can then build a data structure containing the switch name, the metric values, and the validation phase: the pre-upgrade status. Remember this structure, we’ll come back to it later.

switch_list.each do |switch|
  puts "Candidate switch for upgrade: #{switch}"

  switch_data_structure = {
    switch => {
      'pre' => {},
      'post' => {}
    }
  }

  switch_data = fetch_switch_data(switch_data_structure, switch, switch_type, 'pre')
  ...

We now have the pre-upgrade status of the candidate switch however, this is not enough. We also need to validate the functionality of the candidate’s sibling switch (the other switch in the rack) before taking any action. We run the same Prometheus query again to retrieve the metrics we’d mentioned above, but this time for the sibling switch. The name of the sibling switch and its metric values are inserted into the pre-upgrade status.

  sibling_switches = get_sibling_switches(switch, switch_type, switch_list)

  sibling_switches.each do |sibling|
    sibling_data_structure = {
      sibling => {
        'pre' => {},
        'post' => {}
      }
    }
    sibling_data = fetch_switch_data(sibling_data_structure, sibling, switch_type, 'pre')
    switch_data.merge!(sibling_data)
  end
  ...

Compare the candidate switch with its sibling: if they match, or the upgrade candidate has fewer active links / BGP neighbors than its sibling (amongst other metrics), we’re good to go – we know we won’t lose connectivity to any of the machines in the rack. This gives us the confidence to run the re-provisioning process while there are active workloads in the rack.

Cross-checking switch metrics with sibling switch: leaf-r32-p1-pod1 <> leaf-r32-p2-pod1

leaf-r32-p1-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
leaf-r32-p2-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
Status: Pre upgrade validation passed!

Sending changelog for switch leaf-r32-p1-pod1
Sending leaf-r32-p1-pod1 to Re-provisioning, command is '/root/run_switch_provision.sh -F'

And if validation fails? the process would skip this candidate switch and elect a new candidate instead.

Hands-free and auditable

Just before rebooting our candidate switch, we silence alerts, so as to not disrupt our on-call while the process runs in the background. Also, to make sure we know “what happened, when and why”, we use our in-house changelog mechanism to fire a “machinelog” event for future auditing. What are these changelog and machinelog things? They’re another blog post waiting to be written.

With all pre-provisioning validations, alert masking and audit events out of the way, it’s showtime! We trigger the Switch Re-provisioning Process for our elected candidate switch and send it for a reboot.

Once rebooted, we monitor whether the switch accepts connections on TCP port 9100, which is the Prometheus node exporter port. Why? Because we’re using Chef to install the Prometheus node exporter, which gives us a good indication that Chef (which is the final provisioning step) completed successfully. In addition, node exporter is in charge of exposing switch metrics, on which we rely for post-provisioning validation. Once we manage to establish a connection, we know the switch is up again, running the new OS.

Last but not least, we need to validate that our switch is in the same state as it was prior to re-provisioning. We want to make sure we haven’t lost any links, peers, etc. How do we do that?

Remember the pre-upgrade status we saved when we started the validation process? It’s time to get the post-upgrade status and compare the two. Basically, it means running the same query as before, building the data structure and running a diff. If all is well, the process completed successfully and we can move on to the next candidate. If not, we abort and notify a team member that something went wrong.

Retrieving switch metrics: leaf-r32-p1-pod1 Upgrade_phase: post
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p1-pod1"}==0) - Prometheus response: 2
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p2-pod1"}==0) - Prometheus response: 2

Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p1-pod1"}==2) - Prometheus response: 7
Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p2-pod1"}==2) - Prometheus response: 7

Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p1-pod1"}==0 - Prometheus response: 0
Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p2-pod1"}==0 - Prometheus response: 0

Data for switch: leaf-r32-p1-pod1.nydc1.outbrain.com was validated SUCCESSFULY after upgrade

So what just happened?

If you’ve read carefully so far, you might notice that we haven’t gone into time savings in this whole process. We chose not to parallelise re-provisioning at this time, to make sure our blast radius, in case of failure, is limited to a single rack. However, we removed the human factor from the process, which means it can run the full 2 weeks in the background and let us know when things are done (or not working as expected). This, in turn, frees us to deal with the more important aspects of running a production system at scale.

November 5, 2019

Alex Balk

Why we’re not using Kubernetes (kind of)

Intro

Kubernetes is the best thing since sliced bread. Everyone is talking about its internal parts, how to use it, best practices and the latest and greatest supporting tools for it.

This isn’t a story about how great Kubernetes is (yes, okay, it’s great). It’s about our journey into the realm of large scale deployments, and why we’re not using Kubernetes. Okay, we are, but we hide it.

This blog post was put together by Shahaf Sages, Dafna Frank and Alex Balk.

In the beginning

A long time ago in the early 2000s, in a startup far-far away, there was a lone Java developer. The developer had a mission. It was an ambitious mission, one that was not for the faint of heart. His mission was to write the most beautiful Java code ever written, and make some sweet startup money along the way.

Actually, this developer wasn’t alone. And frankly, his code wasn’t that good, but it worked, and he, and his fellow “lone Java developers” needed to get this code to production. And so they did, using the tools they knew best: copy & paste… but in the Linux variant, called “scp”. And the sweet startup money flowed. Or at least trickled a bit.

It’s good, but is it good enough?

This worked well for a while, but not very well or for very long. Fairly quickly, problems started showing up. Problems that needed more code to be written, uploaded and managed on production machines. Which too started to add up. Version control, which was so common in the process of writing code, was desperately needed in the process of deployment. And rollbacks, because the beautiful code was sometimes moody. And some way to keep tab of these (also moody) machines where the code was running. Because sometimes they had an annoying tendency to die or just stop responding.

And so it was decided by the great powers of infrastructure development that be. There will be no more “scp”ing, as it was declared to be manual and thus the root of all evil. Instead, a system shall be born – a deployment system, with artifact version control, build management, rollbacks, progress bars, a model to account for all machines and services, and access control to grant permissions to the lucky few who shall unleash their code onto production. And it was dubbed GluFeeder, for it was based on the Open Source Glu framework from LinkedIn which was state of the art at the time. And it was GOOD.

It works so well, why touch it?

Until, 8 years later, it wasn’t all that good anymore. But that took a long time, and everyone got used to it and it hid away many problems. So if it works don’t touch it, right?

Maybe not. Problems were abundant:

The small (not so much) startup ran on physical machines, and each service had an entire machine for its own, which meant quite a bit of waste
The small (not so much) startup was moving towards microservices because they’re cool and scalable and async and shiny, so said waste was about to get out of hand because there were now a lot of services
The moody physical machines were still moody and whenever they decided they didn’t want to work anymore, all the services running on them just died in nasty, nasty ways
The model for describing “what service runs where” was handled in one big yaml file and all the developers had their sticky fingers touching it directly, with no validation
Adding more service instances was really really REALLY painful because the new physical machines that were ordered took a long, long time to arrive at the (not so small) startup’s datacenters

And so it was decided by the great powers of infrastructure development that be. There will be no more GluFeeding because it worked but wasn’t “10x scale”. Instead, a system shall be born – a deployment system, with:

Support for containers, because the Java developers were making friends with JavaScript and Python and Go developers
A per-service model stored in a real database with all the metadata goodness that describes how to build, run and operate micro, macro or mega services
A resource management system, the best Open Source had to offer, to manage the moody server resources and ensure code was running even if servers went away – the almighty Kubernetes
Orchestration of many, many resource managers (Kubernetes clusters), across many, many datacenters, or at least 3
A well defined contract between the services and the environment they run in, so that everyone gets their metrics, logs, environment variables and properties in the format and flavor they prefer, without dealing with the gory, gruesome details

A poll was run and the people voted. The name that was chosen was Dyploma (DYnamic dePLOyment MAnagement system), which only shows that democracy doesn’t always work very well.

One small step for dev, one giant leap for devops

In the course of a year and a half, the infrastructure developers and the Java early adopter developers met every week to present, discuss and test what was being developed. A Python CLI was chosen as the initial interface for the system, as it was quick to develop and required little UX skills. Little by little the features were added, tweaked and tuned. And little by little confidence was built in the new system and the great Kubernetes beast which it controlled through the scriptures of the fabric8 java client. Until the Python CLI was no longer enough and a Vue.js Web UI was added instead.

Great care was taken in the design to ensure only metadata was kept within Dyploma, so as not to contaminate it with duplicate state of the Kubernetes beasts. And so Dyploma was lean on data and mostly just passed orders to Kubernetes, Prometheus, Consul, Jenkins, TeamCity, Bitbucket and anyone else it could boss around. The system would let the developer:

Describe the service’s endpoints
Provider special build parameters
Set runtime information, such as environment, cluster, number of instances
Build, deploy, scale up/down and disable with one click
View what’s running, where, how much, why and who gave the order
And just plain hide away all of the underlying systems details because the developers were lazy and spoiled and we LOVE them that way

No, really, let’s unfold that last statement

The developers didn’t need to know anything about making the underlying infrastructure work. They just got it all for free. Gift wrapped with a nice web UI. Once they had their service defined, all they had to do was decide how much, which version and where, and pay the bill (okay, not yet, but it’s coming). There were no yaml files, no Helm charts, no configMaps or anyone called Jason. There was a single place to view the runtime status of a service, its history of changes, its logs, graphs and a single place to control it all, which even had batch operations but that’s just showing off.

And so simplicity was restored, velocity was increased, stability was a welcome side-effect and much cost was saved through better resource utilisation and reuse of aging moody machine hardware. And it was GOOD.

Until, one day, a lone Java developer had an idea.

“Why don’t we ditch Dyploma and use Kubernetes instead?”, he said. “I’ve read that it’s the best thing since sliced bread.” And the infrastructure developers just stared.

Show me the money

Every Kubernetes blog post that respects itself shows off some numbers and yaml files.

We have no yaml files to show you, but we do respect ourselves, so here are some numbers:

GluFeeder

Machines managed: 2000

Services running on the managed machines: 2500

Unique types of services: 150

Dyploma

Machines managed: 1600

Services running on the managed machines: 7500

Unique types of services: 400

Kubernetes deployments: 2300

EPILOGUE

Since you’ve read up to here, we’ll assume that you’re interested in getting some insights around building a deployment system (vs yet more Kubernetes tips & tricks), so we’ll give you some of our inputs:

We wanted flexibility AND simplicity. This isn’t cheap. If you want developers to “just write code”, you’ll have to have other developers “just write infra”.
Deployment systems are built for users. Bring the users onboard for the ride if you’re building one.
Use the terminology of the system you’re relying on. If it’s called a pod, call it a pod. Don’t call it a FLDSMDFR. We called a “deployment” a “service”. Don’t.
Kubernetes is complicated. So is your runtime context (at least when you’re big). The challenge is in using the former to contain the latter, while simplifying it for the users. This means that the user should be able to say the absolute minimum and get sane defaults, but also be able to override everything in the runtime without having to speak any Kubernetes. Simple, right?
Protect yourself. People will make mistakes. They will put the SVN version number in the replicas field. And you will cry.

Looking forward, this is what we’re working on these days:

Horizontal autoscaling. Because after you’re done migrating, you start optimising. And you want it simple enough for devs to “click here”.
Deployment A/B testing. All the levers are there, but you have to build them into a usable tool.
Requests, limits and “let me set that for you”. Because they don’t necessarily mean what you think they mean.
Jobs. Because crons are running wild and it doesn’t hurt us now, but only because we’re not looking.
Exposing cost to owners. Because nothing is free, not even your own bare metal.
Open Source. Because the world needs this.

September 2, 2019

Doug Chimento

The First Rule of Distributed Tracing Is…

Distributed Tracing

Distributed Tracing is a mechanism to collect individual requests across micro-services boundaries. It also enables instrumentation of application latency (how long each request took), tracking the life-cycle of network calls (HTTP, RPC, etc) and also identify performance issues by getting visibility on bottlenecks. Plenty of articles are on the inter-webs describing various implementations of Distributed Tracing. This blog will focus on our experience rolling out tracing into our ecosystem with tales of what not to do.

What we need to solve

Outbrain has hundreds of micro-services running on top of Kubernetes. We have metrics via Prometheus and Netflix Hystrix dashboard to monitor overall health, but we lack visibility on business requests. For example, when a user clicks an ad, we want to see the exact flow. In particular, we may want to see the bid request/response to our partners for performance or fraud reasons. But we don’t have something that allows us to examine an individual request. This is what tracing solves for us; the ability to capture exact business logic for a specific request. Tracing can answer such questions:

Did I hit cache or go to a datastore? What services were used for this specific business logic request?

We implemented tracing with Jaeger to visualize and store our traces:

The First Rule

The first rule of tracing is no one should know about tracing. Tracing should be behind a magical curtain of abstraction; all without developers knowledge. Depending on your organization development practices and usage of core infrastructure (e.g. HTTP clients, RPC framework, database drivers…etc), this maybe difficult. But if the majority of your organization uses one or two HTTP clients (i.e. OkHttp, Apache HTTP) it is rather trivial instrument those clients with tracing capabilities without much developer intervention. By instrumenting only HTTP clients and servers you will get a huge benefit with tracing. Think of the 80/20 rule when you plan your tracing implementation.

What not to do

Conceptual, tracing is easy to understand. Operationally, it can turn into a mess. There is an overhead to tracing, albeit small, but still you don’t want to get OutOfMemory errors because of tracing. So don’t by start tracing everything!

Besides, if you start tracing everything what are you going to do with all that data? Depending on your request load this could be huge. Sure, you solved tracing your micro-services, but you created massive amount of data (and hardware) that probably no one is going to look at. In the beginning, keep your operational costs to a bare minimum. You could start by dumping traces into your existing logging infrastructure (i.e. put the trace Id as MDC context in all logging statements). No extra hardware (collectors) required! Like designing software, start small and increment your changes overtime.

Slow and steady wins the race

Begin your tracing quest by providing a mechanism which will “turn on” tracing manually. Suppose you have a REST API, by detecting a special request header, say “X-AdHocTrace”, will inform the service that tracing has been requested. This will enable tracing to all downstream servers as well. We use swagger at Outbrain and added the header automatically:

window.onload = function() {
 const ui = SwaggerUIBundle({
 url: "../api/swagger/apiDocs",
 requestInterceptor: function(request) {
   request.headers['X-AdHocTrace'] = "true";
   return request;
}

When developers are testing their services (locally or remotely) they automatically get all their calls traced when using swagger.

Overtime you can add more sophisticated sampling mechanisms. For instance you could sample all request that result in a 500 HTTP status code. Read more >

March 4, 2019

Gerardo Laracuente

Migrating Servers in Our Sleep

The Cloud is an Illusion

Cloud service providers have enabled innovations in many areas of our society from the way we watch movies to the way we share media with our family and friends. Companies can focus on amazing products without having to worry about managing data centers, servers, networking equipment, and all of the complexities therein.

But the cloud is an illusion, just an abstraction. Behind every cloud are data centers full of countless racks of servers, routers, firewalls, and engineers who design, deploy, and manage them. There is physical infrastructure that powers the technology we enjoy everyday and, at Outbrain, we run the majority of our workloads on bare metal infrastructure. We are, for that matter, our own cloud provider. This post is a peak behind the curtain of how we do this, at scale.

It was a sunny day in California…

In the summer of 2017, our new West Coast data center went live, along with our first fully automated 10G network deployment. It was a great success, which is why we wanted to also roll it out in our other data centers. The new mission was to migrate every server from our legacy 1G network to our shiny new 10G network. Essentially, this meant installing 10G network cards and running new twinax cabling to a good number of thousands of servers. No biggie… except:

The vast majority of those servers run production workloads
Many of them run stateful applications such as data stores (mysql, elasticsearch, etc.)
Many of these stateful applications can only tolerate a limited amount of downtime before they start shuffling (a lot of) data around
Servers need to be migrated around the clock, with no downtime to any cluster
The application and networking engineers are in Israel
The physical server engineers are in NYC
There is a 7 hour time difference between NYC and Israel (the workweek also only overlaps for 4 days of the week)
The on-site technicians who work on our servers cannot log into them for reboots, health checks, etc.

How many Engineers does it take to…?

Let’s take a look at everything that goes into a manual migration of one server, so we can better understand the task at hand. We will narrow the focus to one specific case: migrating a mysql node to the new network.

The people involved:

Gerry – Data Center Server Engineer (NYC)
Yuval – Data Storage Engineer (Israel)
Adi – Networking Engineer (Israel)
Mike – On-site Remote Hands technician (non Outbrain employee)

The process:

Yuval removes the mysql server from the cluster and makes sure that the cluster is still healthy.
Gerry properly powers down the server, and sends the location details to the remote hands team. Based on the location of the server within the rack, he also lets them know which switch ports to use.
Mike opens up the server, installs the 10G network card, and runs redundant twinax cabling to the 10G switches. He powers it back up, connected to both the legacy and the 10G networks.
Adi prepares the server to join the 10G network and reboots it for the final changes to take effect. After the reboot, he checks that the server is indeed part of the new network and that both twinax connections are up and stable.
Yuval adds the server back to the mysql cluster and makes sure everything looks healthy.
Mike removes the old RJ-45 cables and waits for Yuval and Gerry to prepare and shut down the next server (to avoid multi-node shutdown in the same cluster).

* This is a simplified explanation of the process, and assumes everything goes as smooth as possible. It involves 3 Outbrain engineers to be available across a major time zone difference + on-site remote hands. That’s 4 engineers to migrate a single node.

And now that we’re done, there’s only… a few thousand servers left to migrate…running on various types of hardware, with different versions of Linux, different applications and operating under different availability restrictions.

Which begs the question – Will it Scale?

Putting Remote Hands in the Driver Seat

After a few iterations, this is what the process looks like from Outbrain’s perspective:

Gerry executes a Rundeck job with one mouse click, and then emails Mike a list of server locations.
Gerry goes to sleep.

This is the process from the perspective of a remote hands technician at the data center:

Mike receives a list of server locations, and plugs an iPad into the first server on the list.
The iPad is running Slack and the chat room starts displaying new messages. It lets Mike know that the server is attempting to safely stop mysql and power itself down.
The server shuts itself down, but right before it goes down, it sends a message to the Slack channel explaining all the next steps, including which switch ports to connect to server to.
Mike installs the 10G network card, finishes up the cabling, and powers the server back up.
The server runs the network preparation scripts, reboots itself, checks the network status, starts up mysql, runs health checks, and lets Mike know that it’s time to remove the old cabling and move on to the next server.

Meanwhile, Gerry and Yuval’s teams are getting updates via email every time a server begins and succeeds the migration process. They can monitor the Slack channel during the process, or even look back at all of the migrations in a Kibana dashboard. If anything ever goes wrong, the iPad communicates that Mike should stop and contact Outbrain. The iPad won’t take any wrong actions if it is plugged into the wrong server or reseated at any point.

How We Built It

When the iPad is plugged into the server, it is recognized by a udev rule. This triggers a wrapper script that contains all of the migration logic. Here’s a deeper look at the individual steps and components:

Rundeck is used to put the desired servers into maintenance mode. It does this by touching empty files onto the servers that indicate that they are in maintenance mode, and they are in the first stage of the network migration process.

Chef contains all of the necessary scripts to run the migration. This includes the udev rules and the wrapper, networking, and application specific scripts. The Chef recipe chooses the proper pre and post migration scripts based on the role of the server.

UDEV Rules are very powerful, and we use them define what happens when the iPad is plugged into a server:

SUBSYSTEM=="usb",
ACTION=="add", 
ENV{ID_SERIAL}=="Apple_Inc._iPad_averylonguniqueserialnumber",
RUN+="/path/to/wrapper_script.sh"

This rule roughly translates to: “When a usb device with this serial number in plugged into the server, run the following wrapper script”

A custom startup script is what triggers the wrapper script when the server boots back up after a shutdown or reboot to perform the next migration step.

The wrapper script holds all of the logic and functionality of the process.

It only runs if a maintenance file exists on this server.
It checks for a state file, and depending on what it finds, understands which part of the migration process to run next.
It sends the event logs (for the Kibana dashboard), emails (to the proper teams who manage this server), and Slack messages (via webhooks).
It handles the reboots and shutdowns.
When it runs an application specific pre/post migration script, it’s expecting to get an exit status of 0 (or else it will let everyone involved know that something went wrong).
When it runs the network preparation script (which is worthy of its own blog post), it can make a few decisions based on the exit status. It can rings the alarms, move on to the next step, or even let the remote hands technician know that one of the twinax cables seems loose and should be fully seated.
It cleans up by removing all migration related files, including itself.

More time to build more automation =]

Now we can set a bunch of servers to maintenance mode, hand off the list of server locations to remote hands, and continue our daily work while they are migrated.

And since this works so well using the current approach, work is underway to generalise the concept and make it available to many other types of physical maintenance.

So that our cloud can continue to build itself… while we sleep.

March 21, 2018

Alex Balk

Switches, Penguins and One Bad Cable

Back in May 2017, I was scheduled to speak at the DoTC conference in Melbourne. I was really excited and looking forward to it, but fate had different plans. And lots of them. From my son going through an emergency appendicitis operation, through flight delays, and up to an emergency landing back in Tel Aviv… I ended up missing the opportunity to speak at the conference. Amazingly, something similar happened this year! Maybe 3rd time’s a charm?

The post below is the talk I’d planned to give, converted to a blog format.

August 13, 2015. Outbrain’s ops on call is just getting out of his car when his phone rings. It’s a PagerDuty alert. Some kind of latency issue in the Chicago data center. He acks it, figuring he’ll unload the groceries first and then get round to it. But then, his phone rings again. And again.

Forget the groceries. Forget the barbecue. Production is on fire.

18 hours and many tired engineers later, we’re recovering from having lost our Chicago datacenter. In the takein that follows, we trace the root cause to a single network cable that’s mistakenly connected to the wrong switch.

Hi, my name is Alex, and I lead the Core Services group at Outbrain. Our group owns everything from the floor that hosts Outbrain’s servers, to the delivery pipelines that ship Outbrain’s code. If you’re here, you’ve likely heard of Outbrain. You probably know that we’re the world’s leading Discovery platform, and that you’ll find us installed on publisher sites like CNN, The Guardian, Time Inc and the Australian news.com, where we serve their readers with premium recommendations.

But it wasn’t always this way.

You see, back when we started, life was simple: all you had to do was throw a bunch of Linux servers in a rack, plug them into a switch, write some code… and sell it. And that we did!

But then, an amazing thing happened. The code that we wrote actually worked and customers started showing up. And they did the most spectacular and terrifying thing ever – they made us grow. One server rack turned into two and then three and four. And before we knew it, we had a whole bunch of racks, full of penguins plugged into switches. It wasn’t as simple as before, but it was manageable. Business was growing, and so were we.

Fast forward a few years.

We’re running quite a few racks across 2 datacenters. We’re not huge, but we’re not a tiny startup anymore. We have actual paying customers, and we have a service to keep up and running. Internally, we’re talking about things like scale, automation, and all that stuff. And we understand that the network is going to need some work. By now, we’ve reached the conclusion that managing a lot of switches is time-consuming, error-prone, and frankly, not all that interesting. We want to focus on other things, so we break the network challenge down to 2 main topics:

Management and Availability.

Fortunately, management doesn’t look like a very big problem. Instead of managing each switch independently, we go for a something called “a stack”. In essence, it turns 8 switches into one logical unit. At full density, it lets us treat 4 racks as a single logical switch. With 80 nodes per rack, that’s 320 nodes. Quite a bit of computes power!

Four of these setups – about 1200 nodes.

Across two datacenters? 2400 nodes. Easily 10x our size.

Now that’s very impressive, but what if something goes wrong? What if one of these stacks fails? Well, if the whole thing goes down, we lose all 320 nodes. Sure, there’s built-in redundancy for the stack’s master, and losing a non-master switch is far less painful, but even then, 40 nodes going down because of one switch? That’s a lot.

So we give it some thought and come up with a simple solution. Instead of using one of these units in each rack, we’ll use two. Each node will have a connection to stack A, and another to stack B. If stack A fails, we’ll still be able to go through stack B, and vice versa. Perfect!

In order to pull that off, we have to make these two separate stacks, which are actually two separate networks, somehow connect. Our solution to that is to set up bonding on the server side, making its two separate network interfaces look like a single, logical one. On the stack side, we connect everything to one big, happy, shared backbone. With its own redundant setup, of course.

In case you’re still keeping track of the math, you might notice that we just doubled the number of stacks per datacenter. But we still gained simple management And high availability at 10x scale. All this without having to invest in expensive, proprietary management solutions. Or even having to scale the team.

And so, it is decided. We build our glorious, stack-based topology. And the land has peace for 40 years. Or… months.

Fast forward 40 months.

We’re running quite a few racks across 3 datacenters. We’re serving customers like CNN, The Guardian, Time Inc and the Australian news.com. We reach over 500 million people worldwide, serving 250 billion recommendations a month.

We’re using Chef to automate our servers, with over 300 cookbooks and 1000 roles.

We’re practicing Continuous Delivery, with over 150 releases to production a day.

We’re managing petabytes of data in Hadoop, Elasticsearch, Mysql, Cassandra.

We’re generating over 6 million metrics every minute, have thousands of alerts and dozens of dashboards.

Infrastructure as Code is our religion. And as for our glorious network setup? it’s completely, fully, 100% … manual.

No, really. It’s the darkest, scariest part of our infrastructure.

I mean hey, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. But every change in the switches is risky because it’s done using the infamous “config management” called “copy-paste”.

The switching software stack and protocols are proprietary, especially the secret sauce that glues the stacks together. Which makes debugging issues a tiring back-and-forth with support at best, or more often just a blind hit-and-miss. The lead time to setting up a new stack is measured in weeks, with risk of creating network loops and bringing a whole datacenter down. Remember August 13th, 2015? We do.

Again, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. And it’s not like we babysit the solution on daily basis. But it’s definitely not Infrastructure as Code. And there’s no way it’s going to scale us to the next 10x.

Fast forward to June 2016.

We’re still running across 3 data centers, thousands of nodes. CNN, The Guardian, Time Inc, the Australian news.com. 500 million users. 250 billion recommendations. You get it.

But something is different.

We’re just bringing up a new datacenter, replacing the oldest of the three. And in it, we’re rolling out a new network topology. It’s called a Clos Fabric, and it’s running BGP end-to-end. It’s based on a design created by Charles Clos for analog telephony switches, back in the 50’s. And on the somewhat more recent RFCs, authored by Facebook, that bring the concept to IP networks.

In this setup, each node is connected to 2 top-of-rack switches, called leaves. And each leaf is connected to a bunch of end-of-row switches, called spines. But there’s no bonding here and no backbone. Instead, what glues this network together, is that fact that everything in it is a router. And I do mean everything – every switch, every server. They publish their IP addresses over all of their interfaces, essentially telling their neighbors, “Hi, I’m here, and you can reach me through these paths.” And since their neighbors are routers as well, they propagate that information.

Thus a map of all possible paths to all possible destinations is constructed, hop-by-hop, and held by each router in the network. Which, as I mentioned, is everyone. But it gets even better.

We’ve already mentioned that each node is connected to two leaf switches. And that each leaf is connected to a bunch of spines switches. It’s also worth mentioning that they’re not just “connected”. They’re wired the exact same way. Which means, that any path between two points in the network is the exact same distance. And what THAT means is that we can rely on something called ECMP. Which, in plain English, means “just send the packets down any available path, they’re all the same anyway”. And ECMP opens up interesting options for high availability and load distribution.

Let’s pause to consider some of the gains here:

First, this is a really simple setup. All the leaf switches are the same. And so are all of the spines. It doesn’t matter if you have one, two or thirty. And pretty much the same goes for cables. This greatly simplifies inventory, device and firmware management.

Second, it’s predictable. You know the exact amount of hops from any one node in the network to any other: It’s either two or four, no more, no less. Wiring is predictable as well. We know exactly what gets connected where, and what are the exact cable lengths, right from design phase. (spoiler alert:) We can even validate this in software.

Third, it’s dead easy to scale. When designing the fabric, you choose how many racks it’ll support, and at what oversubscription ratio. I’ll spare you the math and just say:

You want more bandwidth? Add more spines.

Support more racks? Go for spines with higher port density.

Finally, high availability is built into the solution. If a link goes down, BGP will make sure all routers are aware. And everything will still work the same way, because with our wiring scheme and ECMP, all paths are created equal. Take THAT evil bonding driver!

But it doesn’t end there. Scaling the pipes is only half the story. What about device management? The infamous copy-paste? Cable management? A single misconnected cable that could bring a whole datacenter down? What about those?

Glad you asked 🙂

After a long, thorough evaluation of multiple vendors, we chose Cumulus Networks as our switch Operating System vendor, and Dell as our switch hardware vendor. Much like you would with servers, by choosing Enterprise Redhat, Suse or Ubuntu. Or with mobile devices, by choosing Android. We chose a solution that decouples the switch OS from the hardware it’s running on. One that lets us select hardware from a list of certified vendors, like Dell, HP, Mellanox and others.

So now our switches run Cumulus Linux, allowing us use the very same tools that manage our fleet of servers, to now manage our fleet of switches. To apply the same open mindset in what was previously a closed, proprietary world.

In fact, when we designed the new datacenter, we wrote Chef cookbooks to automate provisioning and config. We wrote unit and integration tests using Chef’s toolchain and setup a CI pipeline for the code. We even simulated the entire datacenter, switches, servers and all, using Vagrant.

It worked so well, that bootstrapping the new datacenter took us just 5 days. Think about it:

the first time we ever saw a real Dell switch running Cumulus Linux was when we arrived on-site for the buildout. And yet, 99% of our code worked as expected. In 5 days, we were able to setup a LAN, VPN, server provisioning, DNS, LDAP and deal with some quirky BIOS configs. On the servers, mind you, not the switches.

We even hooked Cumulus’ built-in cabling validation to our Prometheus based monitoring system. So that right after we turned monitoring on, we got an alert. On one bad cable. Out of 3000.

Infrastructure as Code anyone?

May 30, 2017

Orit Yaron

I WANT IT ALL – Go Hybrid

When I was a kid, my parents used to tell me that I can’t have my cake and eat too. Actually, that’s a lie, they never said that. Still, it is something I hear parents say quite often. And not just parents. I meet the same phrase everywhere I go. People constantly taking a firm, almost religious stance about choosing one thing over another: Mac vs PC, Android vs iOS, Chocolate vs Vanilla (obviously Chocolate!).

So I’d like to take a moment to take a different, more inclusive approach.

Forget Mac vs PC. Forget Chocolate vs Vanilla.

I don’t want to choose. I Want it all!

At Outbrain, the core of our compute infrastructure is based on bare metal servers. With a fleet of over 6000 physical nodes, spread across 3 data centers, we’ve learned over the years how to manage an efficient, tailored environment that caters to our unique needs. One of which being the processing and serving of over 250 Billion personalized recommendations a month, to over 550 Million unique users.

Still, we cannot deny that the Cloud brings forth advantages that are hard to achieve in bare metal environments. And in the spirit of inclusiveness (and maximizing value), we want to leverage these advantages to complement and extend what we’ve already built. Whether focusing on workloads that require a high level of elasticity, such as ad-hoc research projects involving large amount of data, or simply external services that can increase our productivity. We’ve come to view Cloud Solutions as supplemental to our tailored infrastructure rather than a replacement.

Over recent months, we’ve been experimenting with 3 different vectors involving the Cloud:

Elasticity

Our world revolves around publications, especially news. As such, whenever a major news event occurs, we feel immediate, potentially high impact. Users rush to publisher sites, where we are installed. They want their news, they want their recommendations, and they want them all now.

For example, when Carrie Fisher, AKA Princess Leia, passed away last December, we saw a 30% traffic increase on top of our usual peak traffic. That’s quite a spike.

Since usually we do not know when the breaking news event will be, it means that we are required to keep enough extra capacity to support such surges.

By leveraging the cloud, we can keep that additional extra capacity to bare minimum, relying instead on the inherent elasticity of the cloud, provisioning only what we need when we need it.

Doing this can improve the efficiency of our environment and cost model.

Ad-hoc Projects

A couple of months back one of researchers came up with an interesting behavioral hypothesis. For the discussion at hand, lets say that it was “people who like chocolate are more likely to raise pet gerbils.” (drop a comment with the word “gerbils” to let me know that you’ve read thus far). That sounded interesting, but raised a challenge. To validate or disprove this, we needed to analyze over 600 Terabytes of data.

We could have run it on our internal Hadoop environment, but that came with a not-so-trivial price tag. Not only did we have to provision additional capacity in our Hadoop cluster to support the workload, we anticipated the analysis to also carry impact on existing workloads running in the cluster. And all this before getting into operational aspects such as labor and lead time.

Instead, we chose to upload the data into Google’s BigQuery. This gave us both shorter lead times for the setup and very nice performance. In addition, 3 months into the project, when the analysis was completed, we simply shut down the environment and were done with it. As simple as that!

Productivity

We use Fastly for dynamic content acceleration. Given the scale we mentioned, this has the side-effect of generating about 15 Terabytes of Fastly access logs each month. For us, there’s a lot of interesting information in those logs. And so, we had 3 alternatives when deciding how to analyse them:

SaaS based log analysis vendors
An internal solution, based on the ELK stack
A cloud based solution, based on BigQuery and DataStudio

After performing a PoC and running the numbers, we found that the BigQuery option – if done right – was the most effective for us. Both in terms of cost, and amount of required effort.

There are challenges when designing and running a hybrid environment. For example, you have to make sure you have consolidated tools to manage both on-prem and Cloud resources. The predictability of your monthly cost isn’t as trivial as before (no one likes surprises there!), controls around data can demand substantial investments… but that doesn’t make the fallback to “all Vanilla” or “all Chocolate” a good one. It just means that you need to be mindful and prepared to invest in tooling, education and processes.

Summary

I’d like to revisit my parents’ advice, and try to improve on it a bit (which I’m sure they won’t mind!):

Be curious. Check out what is out there. If you like what you see – try it out. At worst, you’ll learn something new. At best, you’ll have your cake… and eat it too.

February 22, 2016

Ohad Shai

Micro Service Split

In this post, I will describe a technical methodology we used to remove a piece of functionality from a Monolith/Service and turn it into a Micro-Service. I will try to reason about some of the decisions we have made and the path we took, as well as a more detailed description of internal tools, libraries and frameworks we use at Outbrain and in our team, to shed some light on the way we work in the team. And as a bonus you might learn from our mistakes!
Let’s start with the description of the original service and what it does.
Outbrain runs the largest content discovery platform. From the web surfer’s perspective it means serving a recommended content list that might interest her, in the form of ‘You might also like’ links. Some of those impression links are sponsored. ie: when she clicks on a link, someone is paying for that click, and the revenue is shared between Outbrain and the owner of the page with the link on it. That is how Outbrain makes its revenue.

My team, among other things, is responsible for the routing of the user to the requested page after pressing the link, and for the bookkeeping and accounting that is required in order to calculate the cost of the click, who should be charged, etc.
In our case the service we are trying to split is the ‘Bookkeeper’. Its primary role is to manage the paid impression links budget. After a budget is spent, The ‘Bookkeeper’ should notify Outbrain’s backend servers to refrain from showing the impression link again. And this has to be done as fast as possible. If not, people will click on links we cannot charge because the budget was already spent. Technically, this is done by an update to a database record. However, there are other cases we might want to stop the exposure of impression links. One such an example is a request from the customer paying for the future click to disable the impression link exposure. So for such cases, we have an API endpoint that does exactly the same with the same code. That endpoint is actually part of the ‘Bookkeeper’ that is enabled by a feature toggle on specific machines. This ‘Activate-Impressionable’ endpoint as we call it, is what was decided to split out of the ‘Bookkeeper’ into a designated Micro-Service.
In order to execute the split, we have chosen a step-by-step strategy that will allow us to reduce the risk during execution and keep it as controlled and reversible as possible. From a bird’s eye view I will describe it as a three steps process: Plan, Up and Running as fast as possible and Refactor. The rest of the post describes these steps.

Plan (The who’s and why’s)

In my opinion, this is the most important step. You don’t want to split a service just in order to split. Each Micro Service introduces maintenance and management overhead, with its own set of challenges[1]. On the other hand, Microservices architecture is known for its benefits such as code maintainability (for each Micro Service), the ability to scale out and improved resilience[2].
Luckily for me, someone already did that part for me and took the decision that ‘Activate-Impressionable’ should split from the ‘Bookkeeper’. But still, Let’s name some of the key factor of our planning step.
Basically, I would say that a good separation is a logical separation with its own non-overlap RESTful endpoints and isolated code base. The logical separation should be clear. You should think what is the functionality of the new service, and how isolated it is. It is possible to analyze the code for inter-dependencies among classes and packages using tools such as lattix. At the bottom line, it is important to have a clear definition of the responsibility of the new Micro Service.
In our case, the ‘Bookkeeper’ was eventually split so that it remain the bigger component, ‘Activate-Impressionable’ was smaller and the common library was smaller than both. The exact lines of code can be seen in the table below.

component

Unfortunately, I assessed it only after the split and not in the plan step. We might say that there is too much code in common when looking at those numbers. It is something worth considering when deciding what to split. A lot of common code implies low isolation level.

Of course part of the planning is time estimation. Although I am not a big fan of “guestimates” I can tell that the task was planned for couple of weeks and took about that long.
Now that we have a plan, let’s get to work.

Up and Running – A Step by Step guide

As in every good refactor, we want to do it in small baby steps, and remain ‘green’ all the time[3]. In continuous deployment that means we can and do deploy to production as often as possible to make sure it is ‘business as usual’. In this step we want to get to a point the new service is working in parallel to the original service. At the end of this step we will point our load-balancers to the new service endpoints. In addition, the code remains mutual in this step, means we can always deploy the original fully functioning ‘Bookkeeper’. We actually do that if we feel the latest changes had any risk.
So let’s break it down into the actual phases:

Overview	Step Details
	Starting phase
	Create the new empty Micro-Service ‘Activate-Impressionable’. In outbrain we do it using scaffolding of ob1k framework. Ob1k is an open source Micro Services Framework that was developed in-house.
	Create a new empty Library dependent both by the new ‘Activate-Impressionable’ service and the ‘Bookkeeper’. Ideally, if there is a full logic separation with no mutual code between the services that library will be deleted in the cleanup phase.
	Move the relevant source code to the library. Luckily in our case, there was one directory that was clearly what we have to split out. Unluckily, that code also pulled up some more code it was dependent on and this had to be done carefully not to pull too much nor too little. The good news are that this phase is pretty safe for static typing languages such as Java, in which our service is written in. The compiler protects us here with compilation errors so the feedback loop is very short. Tip: don’t forget to move unit tests as well.
	Move common resources to the library, such as spring beans defined in xml files and our feature flags files that defined in yaml files. This is the dangerous part. We don’t have the compiler here to help, so we actually test it in production. And when I say production I mean using staging/canary/any environment with production configuration but without real impact. Luckily again, both yaml and spring beans are configured to fail fast, so if we did something wrong it will just blow out in our face and the service will refuse to go up. For this step I even ended up developing a one-liner bash script to assist with those wicked yaml files.
	Copy and edit web resources (web.xml) to define the service endpoints. In our case web.xml can’t reside in a library so it has to be copied. Remember we still want the endpoints active in the ‘Bookkeeper’ at that phase. Lesson learned: inspect all files closely. In our case log4j.xml which seems like an innocent file by its name contains designated appenders that are consumed by other production services. I didn’t notice that and didn’t move the required appender, and it was found only later in production.
Deploy	Deploy the new service to production. What we did is deploy the ‘Activate-Impressionable’ side-by-side on the same machines as the ‘Bookkeeper’, just with a different ports and context path. Definitely makes you sleep better at night.
Up-And-Running	Now is a good time to test once again if both ‘Bookkeeper’ and ‘Activate-Impressionable’ are working as expected. Generally now we are up and running with only few more things to do here.
Clients Redirect	Point clients of the service to the new endpoints (port + context path). A step that might take some time depends on the number of clients and the ability to redeploy them. In outbrain we use HA-Proxy, so reconfiguring it did most of the work, but some clients did require code modifications.
(More) Validation	Move/copy simulator tests and monitors. In our team, we heavily rely on tests we call simulator tests. These are actually black-box tests written in JUnit that runs against the service installed on a designated machine. These tests see the service as a black-box and calls its endpoints while mock/emulate other services and data in the database for the test run. So usually a test run can look like: put something in the database, trigger the endpoint, and see the result in the database or in the http response. There is also a question here whether to test ‘Activate-Impressionable’ or the ‘Bookkeeper’. Ideally you will test them both (tests are duplicated for that phase), and that is what we did.

Refactor, Disconnect & Cleanup

When we got here the new service is working and we should expect no more behavior changes from the endpoints point of view. But we still want the code to be fully split and the services to be independent from each other. In the previous step we performed the phases in a way that everything remains reversible with a simple feature toggle & deploy.

In this step, we move to a state where the ‘Bookkeeper’ will no longer host the ‘Activate-Impressionable’ functionality. Sometimes it is a good idea to have a gap from the previous step to make sure that there are no rejections and backfires that we didn’t trace in our tests and monitoring.
First thing, If was not done up until now, is deploying the ‘Bookkeeper’ without the service functionality and make sure everything is still working. And wait a little bit more…
And now we just have to push the sources and the resources from the library to the ‘Activate-Impressionable’ service. In the ideal case there is no common code, we can also delete the library. This was not how it was in our case. We still have a lot of common code we can’t separate for the time being.
Now is also the time to do resources cleanup, web.xml edit etc’.
And for the bold and OCD among us – packages rename and refactor of code with the new service naming conventions.

Conclusion

The entire process in our case took a couple of weeks. Part of the fun and advantage in such process, is the opportunity to know better an old code and its structure and functionality without the need to modify something for a new feature with its constraints. Especially when someone else wrote it originally.
In order to perform well such a process it is important to plan and remain organized and on track. In case of a context switch it is very important to keep a bookmark of where you need to return to in order to continue. In our team we even did that with a handoff of the task between developers. Extreme Programming, it is.
It is interesting to see the surprising results in terms of lines of code. Originally we thought of it as splitting a micro-service from a monolith. In retrospective, it looks to me more like splitting a service into two services. ‘Micro’ in this case is in the eye of the beholder.

January 13, 2016

Orit Yaron

DevOps – The Outbrain Way

Like many other fast-moving companies, at Outbrain we have tried several iterations in the attempt to find the most effective “DevOps” model for us. As expected with any such effort, the road has been bumpy and there have been many “lessons learned” along the way. As of today, we feel that we have had some major successes in refining this model, and would like to share some of our insights from our journey.

Why to get Dev and Ops together in the first place?

A lot has been written on this topic, and the motivations and benefits of adding the operational perspective into the development cycles has been thoroughly discussed in the industry – so we will not repeat those.

I would just say that we look at these efforts as preventive medicine, like eating well and exercise – life is better when you stay healthy. It’s not as good when you get sick and seek medical treatment to get health again.

What’s in a name?

We do not believe in the term “DevOps”, and what it represents. We try hard to avoid it – why is that?

Because we expect every Operations engineer to have development understanding and skills, and every Developer to have operational understanding of how the service he / she developes works, and we help them achieve and improve those skills – so everyone is DevOps.

We do believe there is a need to get more system and production skills and expertise closer to the development cycles – so we call it Production Engineers.

First try – Failed!

We started by assigning Operations Engineers to work with dedicated development groups – the big problem was that it was done on top of their previous responsibility in building the overall infrastructure (config management, monitoring infrastructure, network architecture etc.), which was already a full time job as it was.

This mainly led to frustration on both sides – the operations eng. who felt they have no time to do anything properly, just touching the surface all the time and spread too thin, and the developers who felt they are not getting enough bandwidth from operations and they are held back.

Conclusion – in order to succeed we need to go all in – have dedicated resources!

Round 2 – Dedicated Production Eng.

Not giving up on the concept and learning from round 1 – we decided to create a new role – “Production Engineers” (or PE for short), whom are dedicated to specific development groups.

This dedication manifest in different levels. Some of them are semi trivial aspects, like seating arrangements – having the PE sit with the development team, and sharing with them the day to day experience; And some of them are focus oriented, like joining the development team goals and actually becoming an integral part of the development team.

On the other hand, the PE needs to keep very close relationship with the Infrastructure Operational team, who continues to develop the infrastructure and tools to be used by the PEs and support the PEs with technical expertise on more complex issues require subject matter experts.

What & How model:

So how do we prevent the brain split situation of the PE? Is the PE part of the development team or the Operations team? When you have several PEs supporting different development groups – are they all independent or can we gain from knowledge transfer between them?

In order for us to have a lighthouse to help us answer all those questions and more that would evident come up, we came up with the “What & How” model:

“What” – stands for the goals, priorities and what needs to be achieved. “The what” is set by the development team management (as they know best what they need to deliver).

“How” – stands for which methods, technologies and processes should be used to achieve those goals most efficiently from operational perspective. This technical, subject matter guidance is provided by the operations side of the house.

So what is a PE @ Outbrain?

At first stage, Operations Engineer is going through an on-boarding period, during which the Eng. gains the understanding of Outbrain operational infrastructure. Once this Eng. gained enough millage he /she can become a PE, joining a development group and working with them to achieve the development goals, set the “right” way from operational perspective, properly leveraging the Outbrain infrastructure and tools.

The PE enjoys both worlds – keeping presence in the Operations group and keeping his/hers technical expertise on one hand, and on the other hand be an integral part of the development team.

From a higher level perspective – we have eliminated the frustrations points, experienced in our first round of “DevOps” implementation, and are gaining the benefit of close relationship, and better understanding of needs and tools between the different development groups and the general Operations group. By the way, we have also gained a new carrier development path for our Operations Eng. and Production Eng. that can move between those roles and enjoy different types of challenges and life styles.

Category: IT/Ops

Introduction

Who are we?

What we do?

The challenge

The “before” picture

The “after” picture

Behind the scenes

Return on investment

A short summary

What’s next

Who we are

Challenges of Scale

Master your Domains

The Demand Domain

The Inventory Lifecycle Domain

The Communication Domain

The Hardware Observability Domain

And then came COVID-19…

Intro

To upgrade?

Switch Re-provisioning Process

15 minutes times 74 is…

A better way

Automated and Robust

Hands-free and auditable

So what just happened?

Intro

In the beginning

It’s good, but is it good enough?

It works so well, why touch it?

One small step for dev, one giant leap for devops

No, really, let’s unfold that last statement

Show me the money

EPILOGUE

Distributed Tracing

What we need to solve

The First Rule

What not to do

Slow and steady wins the race

The Cloud is an Illusion

It was a sunny day in California…

How many Engineers does it take to…?

The people involved:

The process:

Putting Remote Hands in the Driver Seat

How We Built It

More time to build more automation =]

Elasticity

Ad-hoc Projects

Productivity

Summary

Plan (The who’s and why’s)

Up and Running – A Step by Step guide

Refactor, Disconnect & Cleanup

Conclusion

Why to get Dev and Ops together in the first place?

What’s in a name?

First try – Failed!

Round 2 – Dedicated Production Eng.

What & How model:

So what is a PE @ Outbrain?

Search

עברית

Categories

Archive

RSS