Switches, Penguins and One Bad Cable

Back in May 2017, I was scheduled to speak at the DoTC conference in Melbourne. I was really excited and looking forward to it, but fate had different plans. And lots of them. From my son going through an emergency appendicitis operation, through flight delays, and up to an emergency landing back in Tel Aviv… I ended up missing the opportunity to speak at the conference. Amazingly, something similar happened this year! Maybe 3rd time’s a charm?

The post below is the talk I’d planned to give, converted to blog format.


August 13, 2015. Outbrain’s ops on call is just getting out of his car, when his phone rings. It’s a PagerDuty alert. Some kind of latency issue in the Chicago datacenter. He acks it, figuring he’ll unload the groceries first and then get round to it. But then, his phone rings again. And again.

Forget the groceries. Forget the barbecue. Production is on fire.

18 hours and many tired engineers later, we’re recovering from having lost our Chicago datacenter. In the takein that follows, we trace the root cause to a single network cable that’s mistakenly connected to the wrong switch.

Hi, my name is Alex, and I lead the Core Services group at Outbrain. Our group owns everything from the floor that hosts Outbrain’s servers, to the delivery pipelines that ship Outbrain’s code. If you’re here, you’ve likely heard of Outbrain. You probably know that we’re the world’s leading Discovery platform, and that you’ll find us installed on publisher sites like CNN, The Guardian, Time Inc and the Australian news.com, where we serve their readers with premium recommendations.

But it wasn’t always this way.

You see, back when we started, life was simple: all you had to do was throw a bunch of Linux servers in a rack, plug them into a switch, write some code… and sell it. And that we did!

But then, an amazing thing happened. The code that we wrote actually worked, and customers started showing up. And they did the most spectacular and terrifying thing ever – they made us grow. One server rack turned into two and then three and four. And before we knew it, we had a whole bunch of racks, full of penguins plugged into switches. It wasn’t as simple as before, but it was manageable. Business was growing, and so were we.

Fast forward a few years.

We’re running quite a few racks across 2 datacenters. We’re not huge, but we’re not a tiny startup anymore. We have actual paying customers, and we have a service to keep up and running. Internally, we’re talking about things like scale, automation and all that stuff. And we understand that the network is going to need some work. By now, we’ve reached the conclusion that managing a lot of switches is time consuming, error-prone, and frankly, not all that interesting. We want to focus on other things, so we break the network challenge down to 2 main topics:

Management and Availability.

Fortunately, management doesn’t look like a very big problem. Instead of managing each switch independently, we go for a something called “a stack”. In essence, it turns 8 switches into one logical unit. At full density, it lets us treat 4 racks as a single logical switch. With 80 nodes per rack, that’s 320 nodes. Quite a bit of compute power!

Four of these setups – about 1200 nodes.

Across two datacenters? 2400 nodes. Easily 10x our size.

Now that’s very impressive, but what if something goes wrong? What if one of these stacks fails? Well, if the whole thing goes down, we lose all 320 nodes. Sure, there’s built-in redundancy for the stack’s master, and losing a non-master switch is far less painful, but even then, 40 nodes going down because of one switch? That’s a lot.

So we give it some thought and come up with a simple solution. Instead of using one of these units in each rack, we’ll use two. Each node will have a connection to stack A, and another to stack B. If stack A fails, we’ll still be able to go through stack B, and vice versa. Perfect!

In order to pull that off, we have to make these two separate stacks, which are actually two separate networks, somehow connect. Our solution to that is to set up bonding on the server side, making its two separate network interfaces look like a single, logical one. On the stack side, we connect everything to one big, happy, shared backbone. With its own redundant setup, of course.

In case you’re still keeping track of the math, you might notice that we just doubled the number of stacks per datacenter. But we still gained simple management And high availability at 10x scale. All this without having to invest in expensive, proprietary management solutions. Or even having to scale the team.

And so, it is decided. We build our glorious, stack based topology. And the land has peace for 40 years. Or… months.

Fast forward 40 months.

We’re running quite a few racks across 3 datacenters. We’re serving customers like CNN, The Guardian, Time Inc and the Australian news.com. We reach over 500 million people worldwide, serving 250 billion recommendations a month.

We’re using Chef to automate our servers, with over 300 cookbooks and 1000 roles.

We’re practicing Continuous Delivery, with over 150 releases to production a day.

We’re managing petabytes of data in Hadoop, Elasticsearch, Mysql, Cassandra.

We’re generating over 6 million metrics every minute, have thousands of alerts and dozens of dashboards.

Infrastructure as Code is our religion. And as for our glorious network setup? it’s completely, fully, 100% … manual.

No, really. It’s the darkest, scariest part of our infrastructure.

I mean hey, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. But every change in the switches is risky because it’s done using the infamous “config management” called “copy-paste”.

The switching software stack and protocols are proprietary, especially the secret sauce that glues the stacks together. Which makes debugging issues a tiring back-and-forth with support at best, or more often just a blind hit-and-miss. The lead time to setting up a new stack is measured in weeks, with risk of creating network loops and bringing a whole datacenter down. Remember August 13th, 2015? We do.

Again, don’t get me wrong, it’s working, it’s allowed us to scale to many thousands of nodes. And it’s not like we babysit the solution on daily basis. But it’s definitely not Infrastructure as Code. And there’s no way it’s going to scale us to the next 10x.

Fast forward to June 2016.

We’re still running across 3 datacenters, thousands of nodes. CNN, The Guardian, Time Inc, the Australian news.com. 500 million users. 250 billion recommendations. You get it.

But something is different.

We’re just bringing up a new datacenter, replacing the oldest of the three. And in it, we’re rolling out a new network topology. It’s called a Clos Fabric, and it’s running BGP end-to-end. It’s based on a design created by Charles Clos for analog telephony switches, back in the 50’s. And on the somewhat more recent RFCs, authored by Facebook, that bring the concept to IP networks.

In this setup, each node is connected to 2 top-of-rack switches, called leaves. And each leaf is connected to a bunch of end-of-row switches, called spines. But there’s no bonding here, and no backbone. Instead, what glues this network together, is that fact that everything in it is a router. And I do mean everything – every switch, every server. They publish their IP addresses over all of their interfaces, essentially telling their neighbours, “Hi, I’m here, and you can reach me through these paths.” And since their neighbours are routers as well, they propagate that information.

Thus a map of all possible paths to all possible destinations is constructed, hop-by-hop, and held by each router in the network. Which, as I mentioned, is everyone. But it gets even better.

We’ve already mentioned that each node is connected to two leaf switches. And that each leaf is connected to a bunch of spines switches. It’s also worth mentioning that they’re not just “connected”. They’re wired the exact same way. Which means, that any path between two points in the network is the exact same distance. And what THAT means is that we can rely on something called ECMP. Which, in plain English, means “just send the packets down any available path, they’re all the same anyway”. And ECMP opens up interesting options for high availability and load distribution.

Let’s pause to consider some of the gains here:

First, this is a really simple setup. All the leaf switches are the same. And so are all of the spines. It doesn’t matter if you have one, two or thirty. And pretty much the same goes for cables. This greatly simplifies inventory, device and firmware management.

Second, it’s predictable. You know the exact amount of hops from any one node in the network to any other: It’s either two or four, no more, no less. Wiring is predictable as well. We know exactly what gets connected where, and what are the exact cable lengths, right from design phase. (spoiler alert:) We can even validate this in software.

Third, it’s dead easy to scale. When designing the fabric, you choose how many racks it’ll support, and at what oversubscription ratio. I’ll spare you the math and just say:

You want more bandwidth? Add more spines.

Support more racks? Go for spines with higher port density.

Finally, high availability is built into the solution. If a link goes down, BGP will make sure all routers are aware. And everything will still work the same way, because with our wiring scheme and ECMP, all paths are created equal. Take THAT evil bonding driver!

But it doesn’t end there. Scaling the pipes is only half the story. What about device management? The infamous copy-paste? Cable management? A single misconnected cable that could bring a whole datacenter down? What about those?

Glad you asked 🙂

After a long, thorough evaluation of multiple vendors, we chose Cumulus Networks as our switch Operating System vendor, and Dell as our switch hardware vendor. Much like you would with servers, by choosing Enterprise Redhat, Suse or Ubuntu. Or with mobile devices, by choosing Android. We chose a solution that decouples the switch OS from the hardware it’s running on. One that lets us select hardware from a list of certified vendors, like Dell, HP, Mellanox and others.

So now our switches run Cumulus Linux, allowing us use the very same tools that manage our fleet of servers, to now manage our fleet of switches. To apply the same open mindset in what was previously a closed, proprietary world.

In fact, when we designed the new datacenter, we wrote Chef cookbooks to automate provisioning and config. We wrote unit and integration tests using Chef’s toolchain and setup a CI pipeline for the code. We even simulated the entire datacenter, switches, servers and all, using Vagrant.

It worked so well, that bootstrapping the new datacenter took us just 5 days. Think about it:

the first time we ever saw a real Dell switch running Cumulus Linux was when we arrived on-site for the buildout. And yet, 99% of our code worked as expected. In 5 days, we were able to setup a LAN, VPN, server provisioning, DNS, LDAP and deal with some quirky BIOS configs. On the servers, mind you, not the switches.

We even hooked Cumulus’ built-in cabling validation to our Prometheus based monitoring system. So that right after we turned monitoring on, we got an alert. On one bad cable. Out of 3000.

Infrastructure as Code anyone?


How Well Do You Know Your Servers?

It’s a sunny day in Secaucus, New Jersey. The Outbrain Facilities team heads home on a high note. They’ve just racked, stacked, and inventoried over 100 new servers for the Delivery team. As they go to sleep later that night, the Delivery team begins their day in Netanya, Israel. They provision these servers for their new Kubernetes cluster, which the Developers are eager to migrate their services to.

Fast forward… microservices are migrated from bare metal to Docker containers running on our shiny new fleet of Kubernetes servers. The developers notice performance issues, but the behavior is not consistent, so they reach out to the engineers in Delivery. Delivery narrows it down to a few problematic servers that are running at ~50% performance. They all run the same services as the well-performing servers. Their CPU throttling configurations are correctly aligned. They all run the exact same Chef recipes. With all other options exhausted, they turn to the hardware:

server says


As they dig deeper, they realize that the problematic hosts don’t stand out at all. All servers in the cluster are the same model, have the same specs and look 100% healthy by all metrics that we collect. They are the latest and greatest cloud platform servers from Dell: The PowerEdge C6320. With not much left to go on, they finally find a single difference: the IPMI system event logs show that all of the problematic hosts are running on only 1 Power Supply (Instead of 2 for redundancy).

#ipmitool sel elist
8 | 01/11/2018 | 16:20:52 | Power Supply PSU 2 Status | Power Supply AC lost | Asserted

Down the Stack, We Go

Enter Facilities team again: a 2-man team managing thousands of servers across 3 data centers. Of all the issues that could arise, these brand new servers were the least of our worries. It’s a not-so-sunny day in New Jersey when we wake up to an email listing a handful of servers running on one power supply. Luckily, the hardware was fine and they were all just cases of a loose power cable.

Down the Stack, We Go

Redundancy is fundamental when building out any production environment. A server has 2 power supplies for these exact situations. If one power strip (PDU) goes down, if a power supply fails, or a power cable is knocked loose, a server will still be up and running. So from the Operating System’s point of view, running on one power supply should not matter. But lo and behold, after plugging in those loose cables, the performance immediately goes back to 100% on all of those servers.

We contact Dell Support, which assumes it’s the CPU Frequency Scaling Governor and asks us to change the power management profile to Performance. In a nutshell, the Operating System has control of CPU throttling, and they want us to grant that control to the BIOS to rule out OS issues (Dell does not support Ubuntu 16.04, which is what we currently use).

After making these changes, the issue persists:

This is the command used in every test is: stress-ng --all 0 --class cpu --timeout 5m --metrics-brief

Power Profile: PerfPerWatt(OS)
PSUs: 2
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [152332] cpu              48697    300.19    320.79      0.06       162.22       151.77

Power Profile: PerfPerWatt(OS)
PSUs: 1
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [160954] cpu              21954    300.49    315.72      0.28        73.06        69.47

Power Profile: Performance
PSUs: 2
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [3687] cpu              48182    300.18    319.02      0.09       160.51       150.99

Power Profile: Performance
PSUs: 1
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [11623] cpu              21372    300.45    306.56      0.14        71.13        69.68

The number to pay attention to in these tests is the final one on each line (bogo ops/s). When we unplug one power supply, we see the number of operations that the CPU can handle drop dramatically. Dell involves a senior engineer, who confirms that this is actually expected behavior and references a Dell whitepaper about Power Capping on the C6320 Servers. After briefly reviewing the paper, we assume that we could just check the iDrac web interface for power capping:

We see that no power cap policy is set, so things still aren’t adding up. It’s time to really dig into this whitepaper and (with some help from Dell Support) we manage to find a root cause and a solution. Although we received these servers with power capping disabled, Emergency Power Capping was enabled.

What is the Emergency Power Capping Policy? Glad you asked! When a C6320 chassis is running on one power supply, throttle the CPU no matter what the power consumption is. This is a field that is completely hidden in the iDrac web interface, so how can you tell if this policy is enabled? Another great question! It is enabled if the 8th byte of the following ipmitool output is “01”

#ipmitool raw 0x30 0xC5
00 00 00 02 00 00 00 01 00

*You can use ipmitool raw 0x30 0xC5 to get the current policy from any node within the chassis.


Let us save you time and your sanity by explaining what you are looking at and how you can automate your way out of this hidden setting.

That 8th bit means that the chassis should throttle the CPU when running on one power supply, via PROCHOT. The other options for this bit are to emergency throttle via NM (Node manager), or to turn off emergency throttling. PROCHOT is short for processor hot, a processor technology that throttles the processor when certain conditions are met. More information on PROCHOT can be found in Intel’s whitepaper on page 7 (written for the Xeon 5500 series, but is still relevant). NM is short for Intel’s Node Manager. More information on NM can be found here.

Our testing shows that the CPU does NOT throttle on one PSU when using NM, but we can’t find a clear answer on the logic behind the NM, so we decide to avoid it completely and turn off emergency throttling. Instead, we set the Chassis Power Capping Value to 1300W. Each of the power supplies we use are 1400W, so this limit accounts for a situation where the system is running on one power supply and will throttle to avoid power failure or overheating.

We use a Chef recipe to set our desired policy with this command:

ipmitool raw 0x30 0x17 0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The first part of the command sets the chassis power capping policy:
#ipmitool raw 0x30 0x17 

These 9 bits, which are explained in the Dell whitepaper, roughly translate to: 
“Enable chassis power capping with a value of 1300W, and disable emergency throttling”. 
0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The 1st bit enables chassis power capping:0x01
The 2nd and 3rd bits indicated the power limit value:0x14 0x05
The 8th bit disables emergency throttling:0x00

Here is a quick reference for other power limit values:
0x2c 0x01 = 300W
0xf4 0x01 = 500W
0x20 0x03 = 800W
0x14 0x05 = 1300W

*This command can be run from any blade within the chassis, and we tested it on Ubuntu 16.04 and Centos 7.

Our policy only caps power at the chassis level. Dell Support strongly discourages enabling capping at both the chassis and sled/blade level, as this could lead to inconsistent results.

Let’s Improve Together

This obscure hardware issue sent ripples all the way up the stack, and thanks to the close communication between our various development and operations teams, we were able to find the root cause and solution. We hope this information saves other engineering teams time and effort in their future buildouts.

With thousands of servers to manage and a lean Facilities team, hardware automation and visibility are crucial. Have similar challenges? Have a hardware related war story? we’d love to hear about it – leave a reply below!

Structure a Vue.js App from Containers and Components

Recently we’ve begun using Vue.js as a frontend framework for one of our infrastructure projects. We’ve contracted Dr. Yoram Kornatzky to join our Delivery team and dive together, headlong, into this brave new world.

In this blog post by Yoram, and others to come, we’ll share snippets from this journey. We hope you find these beneficial, and invite you to share your own experiences and feedback.

Vue.js using Vuex for state management does not have a clear distinction between containers and components. This is in clear contrast to React using Redux for state management.

We argue that such a distinction between containers components in beneficial for Vue.js as well.


Dyploma is a system for managing containerized applications and services on top of Kubernetes in Outbrain. Dyploma includes the concepts of:

  • artifacts
  • builds
  • deployments
  • services

Dyploma is made out of a Java Spring backend and a Python command-line tool (CLI). The command-line tool operates through API calls to the backend.

The Dyploma Web Application

To facilitate broader adoption of containers within Outbrain, we set up to develop a web application that will have the capabilities of the Dyploma CLI.

The web application will operate by fetching data from the backend and sending operations for execution in the backend. This will be done through the same REST API used by the CLI.

A Vue.js Web Application

We chose Vue.js for constructing the web application. The app was constructed using vue-cli with the webpack template.

The application has three generic screens:

  • list
  • detail
  • form

All concepts have screens from each of these types with similar structure and look and feel, but with different actions and different data.


Vuex is the standard state management approach for Vue.js.

Containers vs Components in React

Let us first recap what are containers and components in React.

A container interacts with the Redux and contains a component. The container supplies data to the component through selectors on the store and provides the actions on the store to the component.

Components are given data and render HTML. They use the actions provided from their container to interact with the state. Such actions modify the state, resulting in the selectors fetching new data, and causing the component to be rendered again.

Vue.js with Vuex

Vue.js standard practice does not have the containers vs components distinction. While constructing the Dyploma web application we found it useful to make such a distinction for the benefits of better code structure and reusability.

Let us first describe how the structure of the Dyploma web application.

Generic Components

We constructed three generic components:

  1. list
  2. detail
  3. form

Which can be composed of a component tree that can have more than 3 levels.

Each of these generic screens was used with some variations by multiple types of data. But the look and feel could be configured through a common JSON describing for each type of data, the different fields.

Type Specific Actions and Getters

The getters and actions to be used for each type of data were different. We constructed our Vuex store with modules and needed to use a separate module for each type.

Distinguish Components and Containers

So we had to think how to resolve two opposite requirements. For the benefits of reusability, we need unified generic components. But for the type specific actions and data, we need to use separate modules. We decided up front that the whole app will be constructed as a set of single file components (SFC).

To resolve these two opposite directions, we found it useful to think of our app as consisting of two things:

  • containers – type-specific that interact with store
  • components – generic


We defined each component to a data props for the data it should render, and a description of the structure of data. For any changes and actions required, it will emit an event.

Data is passed from a component to its constituents with v-bind, like v-bind:list=”deployments”.

Events are hooked up with v-on like v-on:search=”search”.

Components are composed of smaller components. Events are propagated up the tree of components. This bottom-up propagation may be disturbing to some, but it is the best approach for Vue.js. In this respect, Vue.js is definitely different from React.

The component is a single file component (SFC).

Such a component is not necessarily functional.

A Container for Each Type of Data

A container knows which module of the store it deals with, and knows its actions and getters. It fetches data from the store using getters. Such data is passed to the components as props.

It listens to events coming from the components using v-on like v-on:search=”search”. In response to such events, it dispatches actions.

The container does not render anything itself, this is done by the component it contains.

The container is a single file component (SFC)s.

A Clean Separation Facilitates Reusability

This clean separation of components and containers make it simpler to see opportunities for reusability. Come to think of it, in most web apps, the real effort in reusability is reusability of the component. The mixing of components and containers causes many components to be coupled with the store. This makes it harder to identify reusability. By distinguishing components and containers, we isolate the components from the store and see more clearly opportunities for reusability.

Easier Testing

Writing unit tests becomes easier with this separation. One can write three classes of tests:

  1. components
  2. containers
  3. store

Each becoming simpler.

We will discuss this further in a separate article.


Split your Vue.js web app into containers and components.

Live Tail in Kubernetes / Docker Based environment

At Outbrain we are big believers in Observability.

What is Observability, and what is the difference between Observability and Monitoring? I will leave the explanation to Baron Schwartz @xaprb:

“Monitoring tells you whether the system works.  Observability lets you ask why it’s not working.”

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

This presented many new challenges around understanding why things don’t work.

In this post I will be sharing with you our logging implementation which is the first tool used to understand the why.

But first thing first, a short review of our current standard logging architecture:

We use a standard ELK stack for the majority of our logging needs. By standard I mean Logstash on bare metal nodes, Elasticsearch for storage and Kibana for visualizing and analytics.  Apache Kafka is transport layer for all of the above.

A very simplified sketch of the system:Live Tail in Kubernetes

Of course the setup is a bit more complex in real life since Outbrain’s infrastructure is spread across thousands of servers, in multiple physical data centers and cloud providers; and there are multiple Elasticsearch clusters for different use cases.

Add to the equation that these systems are used in a self-serve model, meaning the engineers are creating and updating configurations by themselves – and you end up with a complex system which must be robust and resilient, or the users will lose trust in the system.

The move to Kubernetes presented new challenges and requirements, specifically related to the logging tools:

  • Support multiple Kubernetes clusters and data centers.
  • We don’t want to us “kubectl”, because managing keys is a pain especially in a multi cluster environment.
  • Provide a way to tail logs and even edit log file. This should be available on a single pod or across a service deployed in multiple pods.
  • Leverage existing technologies: Kafka, ELK stack and Log4j on the client side
  • Support all existing logging sources like multiline and Json.
  • Don’t forget services which don’t run in Kubernetes, yes we still need to support those.


So how did we meet all those requirements? Time to talk about our new Logging design.

The new architecture is based on a standard Kubernetes logging setup – Fluentd daemonset running on each Kubelet node, and all services are configured to send logs to stdout / err  instead of a file.

The Fluentd agent is collecting the pod’s logs and adding the Kubernetes level labels to every message.

The Fluentd plugin we’re using is the kubernetes_metadata_filter.

After the messages are enriched they are stored in a Kafka topic.

A pool of Logstash agents (Running as pods in Kubernetes) are consuming and parsing messages from Kafka as needed.

Once parsed messages can be indexed into Elasticsearch or routed to another topic.

A sketch of the setup described:

A sketch of the setup described:

And now it is time to introduce CTail.

Ctail, stands for Containers Tail, it is an Outbrain homegrown tool written in Go, and based on a server and client side components.

A CTail server-side component runs per datacenter or per Kubernetes cluster, consuming messages from a Kafka topic named “CTail” and based on the Kubernetes app label creates a stream which can be consumed via the CTail client component.

Since order is important for log messages, and since Kafka only guarantees order for messages in the same partition, we had to make sure messages are partitioned by the pod_id.

With this new setup and tooling, when Outbrain engineers want to live tail their logs, all they need to do is launch the CTail client.

Once the Ctail client starts, it will query Consul, which is what we use for service discovery, to locate all of the CTail servers; register to their streams and will perform aggregations in memory – resulting in a live stream of log entries.

Here is a sketch demonstrating the environment and an example of the CTail client output:

CTail client output


To view logs from all pods of a service called “ob1ktemplate” all you need is to run is:

# ctail-client -service ob1ktemplate -msg-only

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: uri http://localhost:8181/Ob1kTemplate/ returned status code 200
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: Getting uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate'
2017-06-13T19:16:25.531Z ob1ktemplate-test-ssages-2751568954-n1rte: uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate' returned status code 200

Or logs of a specific pod:

# ctail-client -service ob1ktemplate -msg-only -pod ob1ktemplate-test-ssages-2751568960-n1kwd

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri 
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751568960-n1kwd: uri http://localhost:8181/Ob1kTemplate/ returned status code 200


This is how we solve this challenge.

Interested in reading more about other challenges we encountered during the migration? Either wait for our next blog, or reach out to visibility at outbrain.com.

Clear-Cut Dating Sites Systems – The Best Routes

There are two issues that constitute a difficulty when dating. They are incompatibility and lack of communication. These problems may arise because of because you are considered unsuitable for each and every other. When you find yourself within this condition, how will you handle it? The article is aimed at supplying the means to fix this dilemma.

Take the use of divorce completely from the table prior to getting married. It is not a possibility not in mind. If thoughts like that cross your head prior to being having a wedding, maybe you should be seriously thinking if marriage could be the right thing to do. On your special day when you say your vows don’t believe whether it doesn’t work it’s possible to separate. This would be a defeatist attitude which is most certainly not likely to help stop divorce. If you want to keep your marriage whole which will help prevent divorce and turn into together. a major step towards this goal is always to simply even if it’s just think of divorce as a possibility. When you get married there should be no doubts in you mind and that this may be the person you want to spend your life with. When you are sure, that this is what you would like it will assist a great stop the idea of divorce to enter the mind.

2.    Is she spending most of her time speaking with you together with is apparently a lot enthusiastic about all the things you say? An interested lady will ensure that you could find it by way of showing you that you have her undivided attention. If she likes you, then she is going to keep asking them questions concerning you simply to help keep the discussion going. She will show you that you’re the part of that room and all sorts of her focus will likely be completely for you.

Marriage Advice for Men: You Don’t Have to Yell to Be Heard

Changes in clothing, perfume, and bathing rituals. People are creatures of habit. That is why it’s too easy for people to fall into a rut. If your wife changes a couple of of such things maybe it’s nothing. If she changes the three of these at or around the same time then it is a recipe for disaster for your marriage unless swift action is taken. Clothing and perfume could mean that she’s dressing to impress another individual. Changes in bathing rituals in many cases are indicators that she’s attempting to remove any lingering evidence of her affair which may be remaining. |The only downside to purchasing a used bridal dress is that it may take some alterations done to it. Bridal shops usually don’t do alterations on dresses that weren’t purchased at their store. You will need to locate a local seamstress to produce modifications in your case. Keep in mind that talented seamstresses are challenging to find and may even be quite pricey. You’ll need to aspect in just how much you protected for the dress and simply how much the alterations will cost when deciding to buy a used dress.| Make her feel comfortable – Upon doing your act of accomplishing sex, make sure to make her feel relaxed. When she’s relaxed, she might feel more comfortable with it. She may be able to focus and control her mind and body within the act. When she’s more enjoyable, you could manipulate her and be more lucrative in looking to please her in bed.

|Keeping up the thrilling excitment in the relationship, funnily enough, won’t please take a large amount of effort; it will take a lots of discipline! Many women find it difficult to suppress the urge to touch their guy because they’re so excited they may have found a person who may be the one. But this kind of urge can land you struggling.|Of course those three qualities are attractive themselves however, if they’re missing this place key ingredient, their effectiveness is greatly reduced. In fact, in case you were able attract an incredible woman into your life insurance firms the aforementioned qualities but lacked the trait I’m gonna discuss, your ability to succeed will be short lived.|Into stars? Not the Hollywood kind, the real stars way up on the horizon. See if there’s a local “Sidewalk Astronomers” organization that schedules visits to local parks on Friday or Saturday nights. In Monrovia this group hauls their telescopes worth 1000s of dollars and literally sets up on the sidewalk whenever there is a full moon or when one or more from the planets is aligned just right for individuals to see from the little city not too definately not CalTech and JPL. Afterward walk or drive to some nearby soft ice cream shop and speak about the universe and pay attention to why your date prefers mocha fudge vs strawberry.|Speed dating is all about meeting many women. Therefore you ought not settle your heart and mind if you’ve been fortunate to have met a unique woman from the very beginning. You need to understand that you can have open options in the case. Keep your mind open and do not attach yourself too early over a particular woman; otherwise, you might just miss out the opportunity meet more spectacular women. Just take note in the women you want. Settling on just one woman may leave you wondering down the road the reasons you didn’t give much awareness of the opposite women.|If you cannot get yourself a woman talking, then the whole notion of having the capacity to escalate flirting into something more is simply not going to take place. I’ve seen some guys when they get type of nervous, just start running off their mouth without allowing the lady to penetrate a word of her very own at all. If you do that, not only do you showcase having less social skills, but you also just stop any natural rapport and chemistry that blossoms each time a conversation goes on back and forth. Be sure that you give her a pause here and there and permit her to start talking.|No Sudden Changes – Beware of making sudden changes for your appearance, the music you pay attention to, or cologne or perfume you wear. It is a common trait among those people who are disloyal to change one or most of these elements as new feelings and experiences are encountered. Try to avoid any temptation to generate changes honestly and you may lessen your odds of being caught.|What are my options when you use free internet dating sites?

Step-By-Step Core Elements For https://www.nlpmania.ro/dating-a-man-with-borderline-personality-disorder/ Once you create a free account, you have several ways to seek out ideal partners and speak with them. Depending on the site chosen, the option and options vary considerably. Patience is the key to identifying the right partner. Never forget “Haste Makes Waste.” There are right and wrong ways for everything, including dating. If you are looking to get your soul mate, it will require time, patience and strategy! Creating profile isn’t everything, it should have much needed information along with your profile. Only this may get potential partners in your profile helping establish a worthy contact.|Internet dating works and time saving. One does not require out frequently for the social venues, hanging out there endlessly, planning to meet somebody special. The scientific matching of profiles that is applied with the online dating sites makes it an easy task to have the right matches that you can pick from. After getting the profiles for your criteria, you might be linked to them online so that you can start chatting and acquire to know them prior to deciding to eventually zero in on normally the one that suits you best. This saves time, money and disappointments. With online dating, most people are amongst gamers and it is simply looking for the best match. You are not forced to dress, just go spend some money entertaining or trying to please your date at the dance clubs and other social places if she or he is just not ready to get a relationship.|3. Change in places you meet women. Bars and clubs are ideal for a one-time hook-up. They aren’t the best places to fulfill somebody that you need long-term. Think about the kind of person who you want to love and after that meet women in the places they will go. If you would like a book lover, try the bookstore. If you prefer sporty women, go to the park. People don’t generally see a clubs looking for love. The best love advice for guys is to locate love in everyday places.|Bars and also other nightspots are a great location to cut loose and lock lips with handsome strangers. But if you are considering you’re soul mate, you would have an overabundance of success in places whose atmosphere is conducive to judging compatibility. Instead of bars, fitness centers, churches and volunteer organizations are great places to discover a serious date.|Have a firm grip yourself life. Not only is this essential for your happiness before you even meet a male, however, you should hold on to your life as soon as you become involved with him too. Men don’t really would like the responsibility of filling a lady’s life. They want her to possess her very own things occurring; items that will fill her some time and make her happy.|One day, when my curiosity got the better of me, I asked him why he would not accept this as a valid answer for considering getting married. After all, these folks were telling him which they desired to got married because they loved one another. What could be wrong with that? My Pastor friend answered my question having a question.|Little children which can be below the age of three or two have the least psychological effects. Effects of divorce have a tendency to destruct them inside the minimum amount possible. Young children don’t absorb anything from in conclusion of their parent’s separation. However, if there is a particular bond present between particular parents, the little one could possibly get shattered and depressed. Infants alternatively, would not be in a position to know very well what is happening either nevertheless they could easily get sick from your moods and conflicts with the family.|As far as these dating sites go, there are numerous varieties of interaction that exist into. Some sites offer chat rooms and some have your basic profile search. There are also some online dating sites that supply advice for dating seniors. These articles is definitely an avenue for members to discuss their opinions and as a consequence, meet like-minded people.|The over 50s singles group is really a large part of the population which is natural and normal for dating becoming a part in the dynamics from the group. This population has experienced divorce as being a normal portion of their life and a few choosing a lump sum spouses through death. Either way, they may be very mentally alert, active and then enjoy living. It is only natural for dating to be part in the experience.| Take it slow. When you do decide to start dating again after your divorce, don’t take it too fast. You need to be understanding of your state of mind, because if you rush on finding love after divorce, you might find yourself hurting another individual in the operation. While it could be okay for you to start out dating again, it may be dangerous that you should get serious with someone too fast. Remember, finding love after divorce is incredibly possible, but it is crucial that you take it a measure at a time. Critical Elements In https://www.tofugu.com/japan/dating-a-japanese-guy/ – For Adults

|You can demonstrate intimacy by gestures of love, showing your better half that you simply care, showing your husband or wife that you appreciate and value them. Intimacy can be demonstrated in not taking your spouse as a given, not criticising but building up the confidence of your spouse, showing desire for things that your better half is obsessed with.|King Street, by the Old Vic Theatre hosts likely the oldest pubs in Bristol. My particular favourite is The King William, a three story wooden-beamed building with cosy traditional lounge downstairs and pub gaming on the first floor. Three pool tables, darts and table football, there is loads to perform to chill anxious first-time daters. Further down towards the river, you will find The Old Duke, a jazz pub with live bands, and The Landoger Trow. The latter says he will function as oldest pub in Bristol. Go there during the warm months as well as the drinkers while watching two pubs mingle to make a festival tinged west country take about the French terrasse. It’s not St Tropez, but it’s ours and that we find it irresistible!|The price of getting a new subscriber is not cheap as well as the web owners have declared that the average subscriber will spend $237.00 a year and remain for about 90 days on the webpage before moving forward or deciding on another website.Fifty three percent of online dating websites are directed at one specific social group, seventy nine percent are mainstream dating websites and six percent are match-making websites. The distinguishing factor between these last 2 is that dating websites are operated by databases, whereas match-making websites are operated by real, live staff who supply a more personalised service.|With your relationship about the rocks, your self-confidence is most probably suffering right now. What you think about yourself comes with a affect how you live life. It would greatly help your relationship if you could foster a sense of self-love that may spill over from you and assist to nourish your relationship with your partner. If you don’t love yourself, you might be basically pushing love away by acting in needy or desperate ways.|Try to conserve a courteous relationship together with his child’s mother. Assuming the man you’re with is not a widower, you will want being somewhat respectful of his ex. Don’t ever speak ill of her before the children and at least play nice once you have to interact. This can be a dangerous situation, while unsure just err to the side of formal respect.|From the four pick between one and three. If she says a cologne is great, that is not adequate, bin it, you desire a lot more than to smell nice. If she say’s “it’s interesting” then she really likes it or there’s an influence on her, if she says “wow that really has something” then this is a winner. Remember faint praise is damning. I said 1-3 because in my opinion and this is something people debate, you should have a signature scent, one you wear more often than not. Other people will tell you to choose variety to cause you to seem spontaneous and prevent accusations for being boring. So why do I discuss a signature scent, because I think it’s hugely crucial that you be remembered and contemplated. Whenever she smells that cologne, regardless of whether it is subtly different on somebody else you want her to consider An Introduction To Trouble-Free online dating sites Programs

Additionally, the paid websites have filters installed to avoid quick, automated applications, whereas the free websites tend not to.Mobile applications are quickly becoming the new technique of online dating for singles. Members can just turn their phone on and check for any partner no matter their location.

Additionally, the paid websites have filters installed to prevent quick, automated applications, whereas the free websites don’t.Mobile applications are fast-becoming the brand new approach to online dating for singles. Members can just turn their phone on and search for any partner in spite of their location.

Critical Criteria For XXX Sites Around The Uk

Trying to make a woman like you is a typical male activity — complete thing. to be flanked by lovely ladies, have dates every weekend and eventually, get yourself a girlfriend. It all sounds easy, doesn’t it? Of course, I’m not saying it is a feat so breezy you never even must blink — however, girls could be pretty to thrill too. There are just certain techniques and a little know-how you will need to learn how to help it become run nicely and smoothly. Below are five techniques concerning how to get a female to like you — better be aware of moves now!

There was when a young couple, who for reasons that will become more apparent later, will continue to be anonymous. They were soon to be married and then suddenly the female begun to receive sexual harassment over the telephone. It started one evening when the phone suddenly rang and once the female acquired the telephone there is a man voice explaining graphically what he wished to do in order to her.

How Can People Have Both Passion and Stability in Their Relationships?

Seriously, the dating game can suck for a woman up to it can for men, and when she gets that they should be “fake” then eventually, that feeling will spell the conclusion of whatever you have with your ex. To make a woman feel relaxed, you NEED to be able to make her feel she can be herself without much reservations. Now, there ARE limits to this, since you don’t want to make her feel some comparable to she could be herself that you just begin to seem like a pal or worse, like family to her.

When dating on the internet you have the opportunity to really familiarize yourself with someone before meeting them. You will have the availability of learning if you be described as a match before you go with a date. There are many folks that share your same preferences all around the globe and chatting online helps make the possiblity to meet others more available than through traditional dating methods. However you should remember to not spend too much time chatting, otherwise your potential dream partner may create unrealistic picture in regards to you, and may end up with surprised when meeting personally. Critical Aspects For http://www.marieclaire.co.uk/life/sex-and-relationships/sexiest-things-about-men-507045 – An Intro

Small Penis Syndrome – How to Deal With Your Small Manhood

When it comes to M4M massages or male massages, there are all kinds of male on male massage services. If you are interested in M4M massage services, you will find some very specific services that you’ll wish to look at, so make sure to determine what you are receiving yourself into so you find exactly what you are looking for. A little research in this department go quite a distance and will help you have the massage of the lifetime and give you everything that you realized. This is a great way to get relaxed where you can terrific time all-in-one. Fundamental Elements In chaturbate free cams – The Basics

Books which can be purchased can be seen on practically just about any device containing access to the internet, as being a PC, smartphone or tablet. It is also possible to utilize an e-reader much like the Kobo, the Nook or even the Sony Reader. It is worth noting that this Amazon Kindle, which is one of the primary competitors from the Google Book store, will not be suitable for the service, probably as a result of Amazon wanting to protect its brand from competition to make users from the Kindle buy their electronic books from their Amazon Kindle Store. Selecting Speedy Secrets In https://www.psychologytoday.com/blog/shameless-woman/201110/three-secrets-women-wont-tell-you-about-how-make-sex-better

Sexy Guy’s Underwear – Guys can be bashful or bold when it comes to their undergear. Tantalize his sexy side with a sporty bikini brief, or form-enhancing brief that cause what she has. Or try a classy couple of silk boxers. Do a web search for sexy men’s underwear. You’ll find a good amount of choices to order.

The right attitude is definitely everything! Have the proper attitude and behavior while attempting to be intimate with your man. Don’t try to intimidate or force him into anything. He should need to be intimate along with you. Be gentle and persuade him with tenderness and care. Once he knows that you are genuine and honest he will be encourage to bond together with you.

Keep bugs out of production

Production bugs are painful and can severely impact a dev team’s velocity. My team at Outbrain has succeeded in implementing a work process that enables us to send new features to production free of bugs, a process that incorporates automated functions with team discipline.

Why should I even care?

Bugs happen all the time – and they will be found locally or in production. But the main difference between preventing and finding the bug in a pre-production environment is the cost: according to IBM’s research, fixing a bug in production can cost X5 times more than discovering it in pre-production environments (during the design, local development, or test phase).

Let’s describe one of the scenarios happen once a bug reaches production:

  • A customer finds the bug and alerts customer service.
  • The bug is logged by the production team.
  • The developer gets the description of the bug, opens the spec, and spends time reading it over.
  • The developer then will spend time recreating the bug.
  • The developer must then reacquaint him/herself with the code to debug it.
  • Next, the fix must undergo tests.
  • The fix is then built and deployed in other environments.
  • Finally, the fix goes through QA testing (requiring QA resources).

How to stop bugs from reaching production

To catch and fix bugs at the most time-and-cost efficient stage, we follow these steps, adhering to the several core principles:

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Create the design before coding: try to divide difficult problems into smaller parts/steps/modules that you can tackle one by one, thinking of objects with well-defined responsibilities. Share the plans with your teammates at design-review meetings. Good design is a key to reducing bugs and improving code quality.

Step 2: Start Coding

Code should be readable and simple. Design and development principles are your best friends. Use SOLID, DRY, YAGNI, KISS and Polymorphism to implement your code.
Unit tests are part of the development process. We use them to test individual code units and ensure that the unit is logically correct.
Unit tests are written and executed by developers. Most of the time we use JUnit  as our testing framework.

Step 3: Use code analysis tools

To help ensure and maintain the quality of our code, we use several automated code-analysis tools:
FindBugs – A static code analysis tool that detects possible bugs in Java programs, helping us to improve the correctness of our code.
Checkstyle –  Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. It automates the process of checking Java code.

Step 4: Perform code reviews

We all know that code reviews are important. There are many best practices online (see 7 Ways to Up Level Your Code Review Skills, Best Practices for Peer Code Review, and Effective Code Reviews), so let’s focus on the tools we use. All of our code commits are populated to ReviewBoard, and developers can review the committed code, see at any point in time the latest developments, and share input.
For the more crucial teams, we have a build that makes sure every commit has passed a code review – in the case that a review has not be done, the build will alert the team that there was an unreviewed change.
Regardless of whether you are performing a post-commit, a pull request, or a pre-commit review, you should always aim to check and review what’s being inserted into your codebase.

Step 5: CI

This is where all code is being integrated. We use TeamCity to enforce our code standards and correctness by running unit tests, FindBugs validations Checkstyle rules and other type of policies.

Stage 2 – Testing Environment

Step 1: Run integration tests

Check if the system as a whole works. Integration testing is also done by developers, but rather than testing individual components, it aims to test across components. A system consists of many separate components like code, database, web servers, etc.
Integration tests are able to spot issues like wiring of components, network access, database issues, etc. We use Jenkins and TeamCity to run CI tests.

Step 2: Run functional tests

Check that each feature is implemented correctly by comparing the results for a given input with the specification. Typically, this is not done at the development level.
Test cases are written based on the specification, and the actual results are compared with the expected results. We run functional tests using Selenium and Protractor for UI testing and Junit for API testing.

Stage 3 – Staging Environment

This environment is often referred to as a pre-production sandbox, a system testing area, or simply a staging area. Its purpose is to provide an environment that simulates your actual production environment as closely as possible so you can test your application in conjunction with other applications.
Move a small percentage of real production requests to the staging environment where QA tests the features.

Stage 4 – Production Environment

Step 1: Deploy gradually

Deployment is a process that delivers our code into production machines. If some errors occurred during deployment, our Continuous Delivery system will pause the deployment, preventing the problematic version to reach all the machines, and allow us to roll back quickly.

Step 2: Incorporate feature flags

All our new components are released with feature flags, which basically serve to control the full lifecycle of our features.  Feature flags allow us to manage components and compartmentalize risk.

Step 3: Release gradually

There are two ways to make our release gradual:

  1. We test new features on a small set of users before releasing to everyone.
  2. Open the feature initially to, say, 10% of our customers, then 30%, then 50%, and then 100%.

Both methods allow us to monitor and track problematic scenarios in our systems.

Step 4: Monitor and Alerts

We use the ELK stack consisting of Elasticsearch, Logstash, and Kibana to manage our logs and events data.
For Time Series Data we use Prometheus as the metric storage and alerting engine.
Each developer can set up his own metrics and build grafana dashboards.
Setting the alerts is also part of the developer’s work and it is his responsibility to tune the threshold for triggering the PagerDuty alert.
PagerDuty is an automated call, texting, and email service, which escalates notifications between responsible parties to ensure the issues are addressed by the right people at the right time.

stop bugs
All in All,
Don’t let the bugs fly out of control.

Migrating Elephants – How To Migrate Petabyte Scale Hadoop Clusters With Zero Downtime

Outbrain has been an early adopter of Hadoop and we, the team operating it, have acquired a lot of experience running it in production in terms of data ingestion, processing, monitoring, upgrading etc. This also means that we have a significant ecosystem around each cluster, with both open source and in-house systems.

A while back we decided to upgrade both the hardware and software versions of our Hadoop clusters.

“Why is that a big problem?” you might ask, so let me explain a bit about our current Hadoop architecture. We have two clusters of 300 machines in two different data centers, production and DR. Each cluster has a total dataset size of 1.5 PB with 5TB of compressed data loaded into it each day. There are ~10,000 job executions daily of about 1200 job definitions that were written by dozens of developers, data scientists and various other stakeholders within the company, spread across multiple teams around the globe. These jobs do everything from moving data into Hadoop (for ex. Sqoop or Mysql to Hive data loads), processing in Hadoop (for ex. running Hive, Scalding or Pig jobs), and pushing the results into external data stores (for ex. Vertica, Cassandra, Mysql etc.). An additional dimension of complexity originates from the dynamic nature of the system since developers, data scientists and researchers are pushing dozens of changes to how flows behave in production on a daily basis.

This system needed be migrated to run on new hardware, using new versions of multiple components of the Hadoop ecosystem, without impacting production processes and active users. A partial list of the components and technologies that are currently being used and should be taken into consideration is HDFS, Map-Reduce, Hive, Pig, Scalding and Sqoop. On top of that, of course, we have several more in-house services for data delivery, monitoring and retention that we have developed.

I’m sure you’ll agree that this is quite an elephant.

Storming Our Brains

We sat down with our users, and started thinking about a process to achieve this goal and quickly arrived at several guidelines that our selected process should abide by:

  1. Both Hadoop clusters (production and DR) should always be kept fully operational
  2. The migration process must be reversible
  3. Both value and risk should be incremental

After scratching our heads for quite a while, we came up with these options:

  1. In place: In place migration of the existing cluster to new version and then rolling the hardware upgrade by gradually pushing new machines into the cluster and removing the old machines. This is the simplest approach and you should probably have a very good reason to choose a different path if you can afford the risk. However since upgrading the system in place would expose clients to a huge change in an uncontrolled manner and is not by any means an easily reversible process we had to forego this option.
  2. Flipping the switch: The second option is to create a new cluster on new hardware, sync the required data, stop processing on the old cluster and move it to the new one. The problem here is that we still couldn’t manage the risk, because we would be stopping all processing and moving it to the new cluster. We wouldn’t know if the new cluster can handle the load or if each flow’s code is compatible with the new component’s version. As a matter of fact, there are a lot of unknowns that made it clear we had to split the problem into smaller pieces. The difficulty with splitting in this approach is that once you move a subset of the processing from the old cluster to the new, these results will no longer be accessible on the old cluster. This means that we would have had to migrate all dependencies of that initial subset. Since we have 1200 flow definitions with marvelous and beautiful interconnections between them, the task of splitting them would not have been practical and very quickly we found that we would have to migrate all flows together.
  3. Side by side execution: The 3rd option is to start processing on the new cluster without stopping the old cluster. This is a sort of an active-active approach, because both Hadoop clusters, new and old, will contain the processing results. This would allow us to migrate parts of the workload without risking interfering with any working pipeline in the old cluster. Sounds good, right.


First Steps

To better understand the chosen solution let’s take a look at our current architecture:

First Steps

We have a framework that allows applications to push raw event data into multiple Hadoop clusters. For the sake of simplicity the diagram describes only one cluster.

Once the data reaches Hadoop, processing begins to take place using a framework for orchestrating data flows we’ve developed in house that we like to call the Workflow Engine.

Each Workflow Engine belongs to a different business group. That Workflow Engine is responsible for triggering and orchestrating the execution of all flows developed and owned by that group. Each job execution can trigger more jobs on its current Workflow Engine or trigger jobs in other business groups’ Workflow Engines. We use this partitioning mainly for management and scale reasons but during the planning of the migration it provided us with a natural way to partition the workload, since there are very few dependencies between groups vs within each group.


Now that you have a better understanding of the existing layout you can see that the first step is to install a new Hadoop cluster with all required components of its ecosystem and begin pushing data into it.

To achieve this, we configured our dynamic data delivery pipeline system to send all events to the new cluster as well as the old, so now we have a new cluster with a fully operational data delivery pipeline:

data delivery pipeline


Side by Side

Let’s think a bit about what options we had for running a side by side processing architecture.

We could use the same set of Workflow Engines to execute their jobs on both clusters, active and new. While this method would have the upside of saving machines and lower operational costs it would potentially double the load on each machine since jobs are assigned to machines in a static manner. This is due to the fact that each Workflow Engine is assigned a business group and all jobs that belong to this group are executed from it. To isolate the current production jobs execution from the ones for the new cluster we decided to allocate independent machines for the new cluster.

Let the Processing Commence!

Now that we have a fully operational Hadoop cluster running alongside our production cluster, and we now have raw data delivered into it, you might be tempted to say: “Great! Bring up a set of Workflow Engines and let’s start side by side processing!”.

Well… not really.

Since there are so many jobs and they doing varied types of operations we can’t really assume that letting them run side by side is a good idea. For instance, if a job calculates some results and then pushes them to MySql, these results will be pushed twice. Aside from doubling the load on the databases for no good reason, it may cause in some cases corruption or inconsistencies of the data due to race conditions. In essence, every job that writes to an external datasource should be allowed to run only once.

So we’ve described two types of execution modes a WorkflowEngine can have:

Leader: Run all the jobs!

Secondary: Run all jobs except those that might have a side effect external to that Hadoop cluster (e.g. write to external database or trigger an applicative service). This will be done automatically by the framework thus preventing any effort from the development teams.

When a Workflow Engine is in secondary mode, jobs executed from it can read from any source, but write only to a specific Hadoop cluster. That way they are essentially filling it up  and syncing (to a degree) with the other cluster.

Let’s Do This…

Phase 1 of the migration should look something like this:

Let's Do This...


Notice that I’ve only included a Workflow Engine for one group in the diagram for simplicity but it will look similar for all other groups.

So the idea is to bring up a new Workflow Engine and give it the role of a migration secondary. This way it will run all jobs except for those writing to external data stores, thus eliminating all side effects external to the new Hadoop cluster.

By doing so, we were able to achieve multiple goals:

  1. Test basic software integration with the new Hadoop cluster version and all services of the ecosystem (hive, pig, scalding, etc.)
  2. Test new cluster’s hardware and performance compared to the currently active cluster
  3. Safely upgrade each business group’s Workflow Engine separately without impacting other groups.


Since the new cluster is running on new hardware and with a new version of Hadoop ecosystem, this is a huge milestone towards validating our new architecture. The fact the we managed to do so without risking any downtime that could have resulted from failing processing flows, wrong cluster configurations or any other potential issue was key in achieving our migration goals.


Once we were confident that all phase 1 jobs were operating properly on the new cluster we could continue to phase 2 in which a migration leader becomes secondary and the secondary becomes a leader. Like this:

new cluster


In this phase all jobs will begin running from the new Workflow Engine impacting all production systems, while the old Workflow Engine will only run jobs that create data to the old cluster. This method actually offers a fairly easy way to rollback to the old cluster in case of any serious failure (even after a few days or weeks) since all intermediate data will continue to be available on the old cluster.

The Overall Plan

The overall process is to push all Workflow Engines to phase 1 and then test and stabilize the system. We were able to run 70% (!) of our jobs in this phase. That’s 70% of our code, 70% of our integrations and APIs and at least 70% of the problems you would experience in a real live move. We were able to fix issues, analyze system performance and validate results. Only once everything seems to be working properly we can start pushing the groups to phase 2 one by one into a tested, stable new cluster.

Once again we benefit from the incremental nature of the process. Each business group can be pushed into phase 2 independently of other groups thus reducing risk and increasing our ability to debug and analyze issues. Additionally, each business group can start leveraging the new cluster’s capabilities (e.g. features from newer version, or improved performance) immediately after they have moved to phase 2 and not after we have migrated every one of the ~1200 jobs to run on the new cluster. One pain point that can’t be ignored is that inter-group dependencies can make this a significantly more complicated feat as you need to bring into consideration the state of multiple groups when migrating.

What Did We Achieve?

  1. Incremental Migration – Due to the fact that we had an active – active migration that we could apply on each business group, we benefited in terms of mitigating risk and gaining value from the new system gradually.
  2. Reversible process- since we kept all old workflowEngines (that executed their jobs on the old Hadoop cluster) in a state of secondary execution mode, all intermediate data was still being processed and was available in case we needed to revert groups independently from each other.
  3. Minimal impact on users – Since we defined an automated transition of jobs between secondary and leader modes users, didn’t need to duplicate any of their jobs.

What Now?

We have completed the upgrade and migration of our main cluster and have already started the migration of our DR cluster.

There are a lot more details and concerns to bring into account when migrating a production system at this scale. However, the basic abstractions we’ve introduced here, and the capabilities we’ve infused our systems with have equipped us with the tools to migrate elephants.

For more information about this project you can check out the video from Strata 2017 London where I discussed it in more detail.

Failure Testing for your private cloud – Introducing GomJabbar

TL;DR Chaos Drills can contribute a lot to your services resilience, and it’s actually quite a fun activity. We’ve built a tool called GomJabbar to help you run those drills.

Failure Testing for your private cloud

Here at Outbrain we manage quite a large scale deployment of hundreds of services/modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can’t ensure a 100% fault-free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix’s Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool is that failures have the tendency of occurring when you’re least prepared, and in the least desirable time, e.g. Friday nights, when you’re out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don’t always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn’t it be great if we could have “chaos drills”, where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:

  1. Did the system handle the failure case correctly?
  2. Was our alerting strategy effective?
  3. Did the team have the knowledge to handle, and troubleshoot the failure?
  4. Was the issue investigated thoroughly?

These take-ins lead to super valuable inputs, which we probably wouldn’t collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.

"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)

So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:

  • There’s an obvious need to avoid unnecessary damage.
    • We’ve created filters to ensure only approved targets get to participate in the drills.
      This has a side effect of pre-marking areas in the code we need to take care of.
    • We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
      we reschedule.
    • When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
    • We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
      and moved on to more interesting testing like faulty network emulation.
  • We’ve measured the time teams spent on these drills, and it turned out to be negligible.
    Most of the time was spent on preparations. For example, ensuring we have proper alerting,
    and correct resilience features in the clients.
    This is actually something you need to do anyway. At the end of the day, we’ve heard no complaints about interruptions, nor time waste.
  • We’ve made sure teams, and engineers on the call were not left on their own. We wanted everybody to learn
    from this drill, and when they weren’t sure how to proceed, we jumped in to help. It’s important
    to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.

All that said, it’s important to remember that we basically simulate failures that occur on a daily basis. It’s only that when we do that in a controlled manner, it’s easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap – What next?

  • Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
  • Add new kinds of failures, like disk space issues, power failures, etc.
  • So far, we were only brave enough to run this on applicative nodes, and there’s no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
  • Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):

  • Graceful shutdowns of service instances.
  • Graceless shutdowns of service instances.
  • Faulty Network Emulation (high latency, and packet-loss).

Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you’ve read so far, you may be asking yourself what’s in it for me? What kind of lessons can I learn from these drills?

We’ve actually found and fixed many issues by running these drills, and here’s what we can share:

  1. We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We’ve found that we didn’t compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
  2. We’ve found services that had no owner, became obsolete and were basically running unattended in production. The horror.
  3. During the faulty network emulations, we’ve found that we had clients that didn’t implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We’ve also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
  4. We’ve also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.


We’ve found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We’re by no means anywhere near perfection. We’re actually pretty sure we’ll find many many more issues we need to take care of. We’re hoping this exciting new tool will help us move to the next level, and we hope you find it useful too 😉

I WANT IT ALL – Go Hybrid

When I was a kid, my parents used to tell me that I can’t have my cake and eat too.  Actually, that’s a lie, they never said that. Still, it is something I hear parents say quite often. And not just parents. I meet the same phrase everywhere I go. People constantly taking a firm, almost religious stance about choosing one thing over another: Mac vs PC, Android vs iOS, Chocolate vs Vanilla (obviously Chocolate!).

So I’d like to take a moment to take a different, more inclusive approach.

Forget Mac vs PC. Forget Chocolate vs Vanilla.

I don’t want to choose. I Want it all!

At Outbrain, the core of our compute infrastructure is based on bare metal servers. With a fleet of over 6000 physical nodes, spread across 3 data centers, we’ve learned over the years how to manage an efficient, tailored environment that caters to our unique needs. One of which being the processing and serving of over 250 Billion personalized recommendations a month, to over 550 Million unique users.

Still, we cannot deny that the Cloud brings forth advantages that are hard to achieve in bare metal environments. And in the spirit of inclusiveness (and maximizing value), we want to leverage these advantages to complement and extend what we’ve already built. Whether focusing on workloads that require a high level of elasticity, such as ad-hoc research projects involving large amount of data, or simply external services that can increase our productivity. We’ve come to view Cloud Solutions as supplemental to our tailored infrastructure rather than a replacement.

Over recent months, we’ve been experimenting with 3 different vectors involving the Cloud:


Our world revolves around publications, especially news. As such, whenever a major news event occurs, we feel immediate, potentially high impact. Users rush to publisher sites, where we are installed. They want their news, they want their recommendations, and they want them all now.

For example, when Carrie Fisher, AKA Princess Leia, passed away last December, we saw a 30% traffic increase on top of our usual peak traffic. That’s quite a spike.

Since usually we do not know when the breaking news event will be, it means that we are required to keep enough extra capacity to support such surges.

By leveraging the cloud, we can keep that additional extra capacity to bare minimum, relying instead on the inherent elasticity of the cloud, provisioning only what we need when we need it.

Doing this can improve the efficiency of our environment and cost model.

Ad-hoc Projects

A couple of months back one of researchers came up with an interesting behavioral hypothesis. For the discussion at hand, lets say that it was “people who like chocolate are more likely to raise pet gerbils.” (drop a comment with the word “gerbils” to let me know that you’ve read thus far). That sounded interesting, but raised a challenge. To validate or disprove this, we needed to analyze over 600 Terabytes of data.

We could have run it on our internal Hadoop environment, but that came with a not-so-trivial price tag. Not only did we have to provision additional capacity in our Hadoop cluster to support the workload, we anticipated the analysis to also carry impact on existing workloads running in the cluster. And all this before getting into operational aspects such as labor and lead time.

Instead, we chose to upload the data into Google’s BigQuery. This gave us both shorter lead times for the setup and very nice performance. In addition, 3 months into the project, when the analysis was completed, we simply shut down the environment and were done with it. As simple as that!


We use Fastly for dynamic content acceleration. Given the scale we mentioned, this has the side-effect of generating about 15 Terabytes of Fastly access logs each month. For us, there’s a lot of interesting information in those logs. And so, we had 3 alternatives when deciding how to analyse them:

  •      SaaS based log analysis vendors
  •      An internal solution, based on the ELK stack
  •      A cloud based solution, based on BigQuery and DataStudio

After performing a PoC and running the numbers, we found that the BigQuery option – if done right – was the most effective for us. Both in terms of cost, and amount of required effort.

There are challenges when designing and running a hybrid environment. For example, you have to make sure you have consolidated tools to manage both on-prem and Cloud resources. The predictability of your monthly cost isn’t as trivial as before (no one likes surprises there!), controls around data can demand substantial investments… but that doesn’t make the fallback to “all Vanilla” or “all Chocolate” a good one. It just means that you need to be mindful and prepared to invest in tooling, education and processes.


I’d like to revisit my parents’ advice, and try to improve on it a bit (which I’m sure they won’t mind!):

Be curious. Check out what is out there. If you like what you see – try it out. At worst, you’ll learn something new. At best, you’ll have your cake… and eat it too.