January 15, 2020

Dori Shmuel

Upgrading your datacenter network with the push of a button

Intro

In this blog post, I am going to share with you the story of how we automated switch upgrades, to perform them at scale. First, however, I recommend reading one of our previous blog posts, to get a better understanding of our network environment: Switches, Penguins and One Bad Cable

To upgrade?

It all started with a PagerDuty call – a failed switch in one of our datacenters. One support ticket later, and our vendor says it’s time for a firmware upgrade. On all of our switches. We flinched. We’ve done this many times before and somehow, there were always edge cases, weird states, and other loose ends. Upgrading a single switch? it was a hassle we were used to, which thankfully didn’t happen too often. Frankly, not a big deal. But upgrading a fleet of 200? uh…

Just to get you up to speed, our network environment is based on Clos fabric, or more specifically, a leaf spine topology. BGP is the dynamic routing protocol that glues it all together.

We’re using Cumulus Linux as the operating system of our switches, Chef to manage their configuration and Jenkins to automate provisioning. In case you haven’t read the article in the intro and you’re wondering what I mean, here’s a link to it again.

So let’s jump straight to it. How do we upgrade switch firmwares at Outbrain? Simple. We don’t. We re-provision them instead. In case you’re not sure what I mean, I’ll clarify.

Switch Re-provisioning Process

Cumulus Linux is a Linux distribution designed for switches. As such, you can manage it using standard Linux tools, and you can also install it using standard Linux methodologies, such as using PXE boot. Except in the case of switches, PXE boot is replaced with the Open Network Install Environment (ONIE).

Once a switch is instructed to re-provision, on the next boot the following things will happen:

The switch requests an IP address via DHCP
The DHCP server acks and responds with DHCP option 114 for the location of the installation image
The switch uses ONIE to download the Cumulus Linux disk image, installs, and reboots
Success! You are now running Cumulus Linux

By the end of this process, we have an updated firmware (OS) version running on the switch without any special configuration. Just like a brand new switch.

“But what about the config?” you might ask. Excellent question! In a word, ZTP. In three words, “Zero Touch Provisioning”:

The switch boots to a fresh Cumulus Linux version (see step 4 of ONIE above)
The management interface (eth0) is configured by default for DHCP
eth0 sends a DHCP request
The DHCP server returns a URL for a ZTP script
The ZTP script installs the Chef client, configs and runs it
Chef-client runs our homemade run_list

Once the Chef client run completes successfully, the switch is fully configured with all the relevant configuration tailored for it. BGP sessions are established and the switch will be back in production.

Now that you have a grasp of the process, here’s what it actually takes to re-provision a switch:

“onie-select” selects ONIE mode to use during the next boot
“-i” means “Install Boot”: at the next reboot, load ONIE in ‘install’ mode. This installs a new operating system, overwriting the current one
“-f” means “Force the operation”: assumes “yes” to any questions the script might ask
“&& reboot” will reboot the switch only if the previous onie-select command succeeded

15 minutes later, and the switch is back in the game, running an updated firmware and its own tailored config. Pretty cool, right? Now we just have to do that to 74 switches, which is what we have in a datacenter.

15 minutes times 74 is…

… a lot. Especially if you add some validations pre and post “re-provisioning”, which bring you to 30 minutes per switch. Multiplied by 74? 2220 minutes, which is 37 hours. That’s a lot of time to be babysitting switches. Not to mention failures, edge cases, missing a switch or two…

And that’s just one datacenter. Multiplied by 3, that’s 6660 minutes, which is 111 hours. So if you spend 8 hours every day doing just this, one switch at a time, with nothing failing at all, it would take you about 2 weeks of repetitive work to do the full upgrade.

There must be a better way. A smarter way. A faster way. A way that is:

Automated
Robust
Hands-free
Auditable

A better way

3 days later and we have something to show for. Allow me to introduce you to our “Switch Upgrade Automation” process (notice that “has a catchy name” wasn’t part of the requirements).

The “switch upgrade automation” process is a Ruby-based program running inside a docker container. It is built as a Jenkins job which is executed periodically. The program covers all the necessary validations to pick a valid candidate switch, re-provision it and make sure the switch is ready to be put back into production.

Diving into the process, each Jenkins build will spawn a docker container, getting the following arguments (there’s actually more, but we’ve chosen the main ones):

Datacenter name
Switch type: leaf or spine
Switch OS version to upgrade from
The 1st or 2nd switch in the rack

Automated and Robust

We first check if there is already a running build, as we don’t want two switches re-provisioning at the same time. We took the conservative approach and decided to run the process serially. Also, we don’t want to move on to the next switch if the previous one failed, as there may be a problem with the process, environment or switch itself.

Next, we build a dynamic Prometheus query that returns a list of all switches running the OS version we’re upgrading from. Using this list we choose a random switch and trigger yet another Prometheus query. The additional query allows us to retrieve the following information about the candidate switch:

Global health status – a metric built out of many local tests running on the switch, including hardware sensors, ztp_exit_status, services status, and PTM
BGP peer status – number of established BGP peers
Number of Active links

Using these values we can then build a data structure containing the switch name, the metric values, and the validation phase: the pre-upgrade status. Remember this structure, we’ll come back to it later.

switch_list.each do |switch|
  puts "Candidate switch for upgrade: #{switch}"

  switch_data_structure = {
    switch => {
      'pre' => {},
      'post' => {}
    }
  }

  switch_data = fetch_switch_data(switch_data_structure, switch, switch_type, 'pre')
  ...

We now have the pre-upgrade status of the candidate switch however, this is not enough. We also need to validate the functionality of the candidate’s sibling switch (the other switch in the rack) before taking any action. We run the same Prometheus query again to retrieve the metrics we’d mentioned above, but this time for the sibling switch. The name of the sibling switch and its metric values are inserted into the pre-upgrade status.

  sibling_switches = get_sibling_switches(switch, switch_type, switch_list)

  sibling_switches.each do |sibling|
    sibling_data_structure = {
      sibling => {
        'pre' => {},
        'post' => {}
      }
    }
    sibling_data = fetch_switch_data(sibling_data_structure, sibling, switch_type, 'pre')
    switch_data.merge!(sibling_data)
  end
  ...

Compare the candidate switch with its sibling: if they match, or the upgrade candidate has fewer active links / BGP neighbors than its sibling (amongst other metrics), we’re good to go – we know we won’t lose connectivity to any of the machines in the rack. This gives us the confidence to run the re-provisioning process while there are active workloads in the rack.

Cross-checking switch metrics with sibling switch: leaf-r32-p1-pod1 <> leaf-r32-p2-pod1

leaf-r32-p1-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
leaf-r32-p2-pod1: ----> {"bgp_peer_status"=>"32","interface_link_status"=>"37","health_global_status"=>"0"}
Status: Pre upgrade validation passed!

Sending changelog for switch leaf-r32-p1-pod1
Sending leaf-r32-p1-pod1 to Re-provisioning, command is '/root/run_switch_provision.sh -F'

And if validation fails? the process would skip this candidate switch and elect a new candidate instead.

Hands-free and auditable

Just before rebooting our candidate switch, we silence alerts, so as to not disrupt our on-call while the process runs in the background. Also, to make sure we know “what happened, when and why”, we use our in-house changelog mechanism to fire a “machinelog” event for future auditing. What are these changelog and machinelog things? They’re another blog post waiting to be written.

With all pre-provisioning validations, alert masking and audit events out of the way, it’s showtime! We trigger the Switch Re-provisioning Process for our elected candidate switch and send it for a reboot.

Once rebooted, we monitor whether the switch accepts connections on TCP port 9100, which is the Prometheus node exporter port. Why? Because we’re using Chef to install the Prometheus node exporter, which gives us a good indication that Chef (which is the final provisioning step) completed successfully. In addition, node exporter is in charge of exposing switch metrics, on which we rely for post-provisioning validation. Once we manage to establish a connection, we know the switch is up again, running the new OS.

Last but not least, we need to validate that our switch is in the same state as it was prior to re-provisioning. We want to make sure we haven’t lost any links, peers, etc. How do we do that?

Remember the pre-upgrade status we saved when we started the validation process? It’s time to get the post-upgrade status and compare the two. Basically, it means running the same query as before, building the data structure and running a diff. If all is well, the process completed successfully and we can move on to the next candidate. If not, we abort and notify a team member that something went wrong.

Retrieving switch metrics: leaf-r32-p1-pod1 Upgrade_phase: post
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p1-pod1"}==0) - Prometheus response: 2
Prometheus query: count(fabric_switch_bgp_peer_status{instance=~"leaf-r32-p2-pod1"}==0) - Prometheus response: 2

Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p1-pod1"}==2) - Prometheus response: 7
Prometheus query: count(fabric_switch_interface_link_status{instance=~"leaf-r32-p2-pod1"}==2) - Prometheus response: 7

Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p1-pod1"}==0 - Prometheus response: 0
Prometheus query: fabric_switch_health_global_status{instance=~"leaf-r32-p2-pod1"}==0 - Prometheus response: 0

Data for switch: leaf-r32-p1-pod1.nydc1.outbrain.com was validated SUCCESSFULY after upgrade

So what just happened?

If you’ve read carefully so far, you might notice that we haven’t gone into time savings in this whole process. We chose not to parallelise re-provisioning at this time, to make sure our blast radius, in case of failure, is limited to a single rack. However, we removed the human factor from the process, which means it can run the full 2 weeks in the background and let us know when things are done (or not working as expected). This, in turn, frees us to deal with the more important aspects of running a production system at scale.

December 3, 2019

Or Gerson

Frustration, Immortal TEZ Queries And Hope

My name is Or Gerson, and I am part of the DataInfra team in Outbrain.
DataInfra is responsible for a job scheduling system that supplies teams with the ability to define jobs that run queries on Hadoop cluster hosting about 2 PB of data.
Sometimes these queries can be inefficient – resulting in high processing time and extra load on the cluster.
Therefore we define a timeout limit for queries to execute, The idea is to kill the query when timeout occurred – however things are not so simple…

Kill them! Close their resource! Let none escape!

We use Apache-DBCP as our connection pool library and Hive2 driver to submit queries to Hive a pretty simple setup.
Apache DBCP provides connection pool services and is generally recommended as a connector to many relational databases.

Queries are done using Spring JdbcTemplate, but the implementing connections are managed by Apache DBCP.

These queries have a specific timeout set by the calling thread.
When the timeout occurs, the calling thread shuts down the async executor and calls the close() function on a “javax.sql.DataSource” object

dataSource.close()

This works well when using MapR engine, but when using TEZ the job continues to live long in the cluster, even after the VM is gone.

Seems like a common problem.
To my surprise, my online search found this to be a recurring issue, but without a good solution.

I started drilling down into the “org.apache.commons.dbcp.PoolableConnection” class to understand the problem better.

Shouldn’t closing the datasource be enough?

Under the DBCP and JDBC abstractions I found “org.apache.hive.jdbc.HiveStatement” which uses a thrift client to execute operations on Hive.

When a query was submitted using MapR and reached a timeout, the running job recognized that its handler had died and closed, resulting in “Query Cancelled” status on the waiting thread in HiveStatement.

TEZ did not recognize that its handling connection died, and continued to finish its run (even though the GC had definitely cleared this object).

Moreover, I found out that closing the datasource (using close() method) did not call the close() method on the connection object.
Searching the APIs in “org.apache.commons.dbcp” revealed that it did not expose its connection pool or the objects borrowed from it, so I had no way of interacting with them.

Now What?

First I needed to verify that I can actually close the connection from the client without interacting with the resource manager running the TEZ job (Yarn) directly.
Luckily I found that “HiveStatement” class behaves well and when calling close() method, the thrift session closes and does indeed kill the TEZ job.

I decided to create a data source that will keep references to the connections being used, allowing it to close them.

https://gist.github.com/kazabubu/5ff8d6f5faddb01dce9f93fd98f19458

Now, our datasource keeps its connection references and all we need is to use those “register” and “unregister” methods which are part of a simple “ConnectionRegister” interface supplying these methods.

https://gist.github.com/kazabubu/ffcdc774307abf810c0d3d80886e4f62

After implementing the aforementioned solution, queries using TEZ die on timeout, by explicitly closing the opened connection which delegates it to the thrift layer.

November 11, 2019

Avi Youkhananov

Oh my Guava! We are moving to Caffeine.

Caching is extremely important! It provides fast response time, enabling effortless performance improvements in certain use cases.
At Outbrain, we have recently moved to Caffeine caching, after having used Guava in-memory caching for many years.

Background

Caffeine library is a rewrite of Guava’s cache that uses a Guava-inspired API that returns CompletableFutures, allowing asynchronous automatic loading of entries into a cache. The library was written by Ben Manes who is the author of ConcurrentLinkedHashMap on which Guava cache is based.

Guava OUT

Guava blocks during loading when a key is not present in the cache.
We wanted to change the API to work asynchronously, but the added complexity made the code difficult to understand and troubleshoot.
To make Guava non-blocking, we overrode the load and loadAll methods of CacheLoader to make the API return a future.
Since we changed the API to return a future, we encountered a problem. Guava is unaware that we use it to store futures, so it stores futures completed with an exception as well. To prevent the exceptions from being stored in the cache, we needed to add more complexity to our code.

Caffeine IN

Caffeine is a rewrite of Guava’s cache that uses an API that returns CompletableFutures out of the box, allowing asynchronous automatic loading of entries into a cache.
Caffeine removes futures that complete with an exception from the cache.
Caffeine uses both a Least Recently Used (LRU) eviction policy and a frequency-based admission policy relying on CountMin sketch. It has a better hit rate than LRU for many workloads.

Guava’s hit-rate benchmark vs Caffein’s

We analyzed service behavior with real traffic under different cache configurations to get an idea of how production services will behave.

The benchmarked cache contains image URLs keyed by UUID and follow an access pattern of most-frequent / least-frequent data, which is our most common use case at Outbrain.

In the benchmarks below, we did not measure and therefore have no interpretation of memory usage. But analyzing hit rates leads to some interesting insights.

1. Cache size: 10k items
Expiration after write: 5 min

Hit rate with 10k items:

Caffeine 28.33 %
Guava 20.95%

2. Cache size: 50k items
Expiration after write: 20 min

Hit rate with 50k items:

Caffeine 56.04 %
Guava 50.01%

3. Cache size: 100k items
Expiration after write: 20 min

Hit rate with 100k items:

Caffeine 70.10 %
Guava 66.77%

4. Cache size: 300k items
Expiration after write: 20 min

Hit rate with 300k items:

Caffeine 87.19 %
Guava 84.85%

Conclusion

We did it! With minor changes to infrastructure code (as Caffeine and Guava API are almost identical), we improved our hit rate and reduced code complexity for critical services in our system.

Honestly, Caffeine smells better than Guava.

November 5, 2019

Alex Balk

Why we’re not using Kubernetes (kind of)

Intro

Kubernetes is the best thing since sliced bread. Everyone is talking about its internal parts, how to use it, best practices and the latest and greatest supporting tools for it.

This isn’t a story about how great Kubernetes is (yes, okay, it’s great). It’s about our journey into the realm of large scale deployments, and why we’re not using Kubernetes. Okay, we are, but we hide it.

This blog post was put together by Shahaf Sages, Dafna Frank and Alex Balk.

In the beginning

A long time ago in the early 2000s, in a startup far-far away, there was a lone Java developer. The developer had a mission. It was an ambitious mission, one that was not for the faint of heart. His mission was to write the most beautiful Java code ever written, and make some sweet startup money along the way.

Actually, this developer wasn’t alone. And frankly, his code wasn’t that good, but it worked, and he, and his fellow “lone Java developers” needed to get this code to production. And so they did, using the tools they knew best: copy & paste… but in the Linux variant, called “scp”. And the sweet startup money flowed. Or at least trickled a bit.

It’s good, but is it good enough?

This worked well for a while, but not very well or for very long. Fairly quickly, problems started showing up. Problems that needed more code to be written, uploaded and managed on production machines. Which too started to add up. Version control, which was so common in the process of writing code, was desperately needed in the process of deployment. And rollbacks, because the beautiful code was sometimes moody. And some way to keep tab of these (also moody) machines where the code was running. Because sometimes they had an annoying tendency to die or just stop responding.

And so it was decided by the great powers of infrastructure development that be. There will be no more “scp”ing, as it was declared to be manual and thus the root of all evil. Instead, a system shall be born – a deployment system, with artifact version control, build management, rollbacks, progress bars, a model to account for all machines and services, and access control to grant permissions to the lucky few who shall unleash their code onto production. And it was dubbed GluFeeder, for it was based on the Open Source Glu framework from LinkedIn which was state of the art at the time. And it was GOOD.

It works so well, why touch it?

Until, 8 years later, it wasn’t all that good anymore. But that took a long time, and everyone got used to it and it hid away many problems. So if it works don’t touch it, right?

Maybe not. Problems were abundant:

The small (not so much) startup ran on physical machines, and each service had an entire machine for its own, which meant quite a bit of waste
The small (not so much) startup was moving towards microservices because they’re cool and scalable and async and shiny, so said waste was about to get out of hand because there were now a lot of services
The moody physical machines were still moody and whenever they decided they didn’t want to work anymore, all the services running on them just died in nasty, nasty ways
The model for describing “what service runs where” was handled in one big yaml file and all the developers had their sticky fingers touching it directly, with no validation
Adding more service instances was really really REALLY painful because the new physical machines that were ordered took a long, long time to arrive at the (not so small) startup’s datacenters

And so it was decided by the great powers of infrastructure development that be. There will be no more GluFeeding because it worked but wasn’t “10x scale”. Instead, a system shall be born – a deployment system, with:

Support for containers, because the Java developers were making friends with JavaScript and Python and Go developers
A per-service model stored in a real database with all the metadata goodness that describes how to build, run and operate micro, macro or mega services
A resource management system, the best Open Source had to offer, to manage the moody server resources and ensure code was running even if servers went away – the almighty Kubernetes
Orchestration of many, many resource managers (Kubernetes clusters), across many, many datacenters, or at least 3
A well defined contract between the services and the environment they run in, so that everyone gets their metrics, logs, environment variables and properties in the format and flavor they prefer, without dealing with the gory, gruesome details

A poll was run and the people voted. The name that was chosen was Dyploma (DYnamic dePLOyment MAnagement system), which only shows that democracy doesn’t always work very well.

One small step for dev, one giant leap for devops

In the course of a year and a half, the infrastructure developers and the Java early adopter developers met every week to present, discuss and test what was being developed. A Python CLI was chosen as the initial interface for the system, as it was quick to develop and required little UX skills. Little by little the features were added, tweaked and tuned. And little by little confidence was built in the new system and the great Kubernetes beast which it controlled through the scriptures of the fabric8 java client. Until the Python CLI was no longer enough and a Vue.js Web UI was added instead.

Great care was taken in the design to ensure only metadata was kept within Dyploma, so as not to contaminate it with duplicate state of the Kubernetes beasts. And so Dyploma was lean on data and mostly just passed orders to Kubernetes, Prometheus, Consul, Jenkins, TeamCity, Bitbucket and anyone else it could boss around. The system would let the developer:

Describe the service’s endpoints
Provider special build parameters
Set runtime information, such as environment, cluster, number of instances
Build, deploy, scale up/down and disable with one click
View what’s running, where, how much, why and who gave the order
And just plain hide away all of the underlying systems details because the developers were lazy and spoiled and we LOVE them that way

No, really, let’s unfold that last statement

The developers didn’t need to know anything about making the underlying infrastructure work. They just got it all for free. Gift wrapped with a nice web UI. Once they had their service defined, all they had to do was decide how much, which version and where, and pay the bill (okay, not yet, but it’s coming). There were no yaml files, no Helm charts, no configMaps or anyone called Jason. There was a single place to view the runtime status of a service, its history of changes, its logs, graphs and a single place to control it all, which even had batch operations but that’s just showing off.

And so simplicity was restored, velocity was increased, stability was a welcome side-effect and much cost was saved through better resource utilisation and reuse of aging moody machine hardware. And it was GOOD.

Until, one day, a lone Java developer had an idea.

“Why don’t we ditch Dyploma and use Kubernetes instead?”, he said. “I’ve read that it’s the best thing since sliced bread.” And the infrastructure developers just stared.

Show me the money

Every Kubernetes blog post that respects itself shows off some numbers and yaml files.

We have no yaml files to show you, but we do respect ourselves, so here are some numbers:

GluFeeder

Machines managed: 2000

Services running on the managed machines: 2500

Unique types of services: 150

Dyploma

Machines managed: 1600

Services running on the managed machines: 7500

Unique types of services: 400

Kubernetes deployments: 2300

EPILOGUE

Since you’ve read up to here, we’ll assume that you’re interested in getting some insights around building a deployment system (vs yet more Kubernetes tips & tricks), so we’ll give you some of our inputs:

We wanted flexibility AND simplicity. This isn’t cheap. If you want developers to “just write code”, you’ll have to have other developers “just write infra”.
Deployment systems are built for users. Bring the users onboard for the ride if you’re building one.
Use the terminology of the system you’re relying on. If it’s called a pod, call it a pod. Don’t call it a FLDSMDFR. We called a “deployment” a “service”. Don’t.
Kubernetes is complicated. So is your runtime context (at least when you’re big). The challenge is in using the former to contain the latter, while simplifying it for the users. This means that the user should be able to say the absolute minimum and get sane defaults, but also be able to override everything in the runtime without having to speak any Kubernetes. Simple, right?
Protect yourself. People will make mistakes. They will put the SVN version number in the replicas field. And you will cry.

Looking forward, this is what we’re working on these days:

Horizontal autoscaling. Because after you’re done migrating, you start optimising. And you want it simple enough for devs to “click here”.
Deployment A/B testing. All the levers are there, but you have to build them into a usable tool.
Requests, limits and “let me set that for you”. Because they don’t necessarily mean what you think they mean.
Jobs. Because crons are running wild and it doesn’t hurt us now, but only because we’re not looking.
Exposing cost to owners. Because nothing is free, not even your own bare metal.
Open Source. Because the world needs this.

September 18, 2019

Davorin Kopic

Sharing Data Science Knowledge and Experience

Two years ago Outbrain and Zemanta joined forces. This, among many other great things, also resulted in big bi-directional knowledge exchange between Zemanta’s Data Science and Outbrain’s Recommendations teams. Inevitably, a lot of important progress has been made in both our algorithms and understanding. And of course, we still haven’t exhausted the huge backlog of new ideas which are sprouting from our discussions.

At Outbrain and Zemanta we know how important internal sharing of knowledge and experience is – but we also believe it is crucial to share knowledge with the wider community. And in our aspirations to do so, among other projects, we also started Zemanta’s Data Science Summer School.

The second annual Data Science Summer School

In July we hosted the second annual Data Science Summer School in our Zemanta’s office in Ljubljana, Slovenia. Among many applicants, we selected a group of very perspective young professionals and/or students and invited them to join us for a week of data science-flavored activities where they learned how we apply data science and machine learning in this data-rich industry.

The structure of the summer school

The week-long curriculum was set to be very practical and hands-on, but also have theoretical lectures intertwined. Participants first learned about the tools and techniques we are using in our day-to-day as data scientists in the industry. They learned how to use tools like git for version control, correctly set up python environments, use some python libraries like numpy, pandas for crunching data, matplotlib for visualization and scikit-learn to build some basic predictors.

Then, after setting up their environments, they got their feet wet by participating in a Kaggle challenge. Some participants had already participated in Kaggle challenges before so they shared their experiences and know-how, and for some, it was their first time so they tried to sponge up as much information as they could.

Finally, we provided them with a massive real dataset extracted from production, on which they had a chance to build their own predictors for estimating probabilities of clicks (CTR). After careful examination and analysis of 50+ provided features, they had an opportunity to use a tool of their choice to make predictions – some explored scikit-learn in more detail, while others chose various libraries like XGBoost for gradient boosted trees, XLearn for factorization machines or TensorFlow for neural networks. Finally, all teams presented their work and shared the gained knowledge.

Mixed in between hands-on experimentation they participated in many interesting talks and discussions on topics ranging from how programmatic advertising works, what is real-time bidding, theory powering auctions and what kind of algorithms and systems we are developing at Zemanta; all the way to data analysis, deploying machine learning models to production and some of our real-life scenarios and stories.

What the participants had to say

After successfully completing the week-long curriculum participants received their certificates and filled anonymous feedback forms saying things like “Great way to spend a week – the atmosphere was excellent!”, “The talks were especially interesting since they give a nice insight into the company”, “Working on real data gave me the opportunity to experience first-hand the problems data scientists are working on” – so we can say with great certainty the participants learned a lot and had tons of fun doing it.

Conclusion

This was the second iteration of Zemanta’s Data Science Summer School in our Ljubljana office in Slovenia. Mentors Robert, Luka, and Anže and I had a great time sharing knowledge with the students, who gained important insights into the processes behind applying data science and machine learning to solve real problems in the industry, so we are very excited to host more such events in the future.

Davorin Kopic
Head of Data Science @ Zemanta, an Outbrain Company

September 2, 2019

Doug Chimento

The First Rule of Distributed Tracing Is…

Distributed Tracing

Distributed Tracing is a mechanism to collect individual requests across micro-services boundaries. It also enables instrumentation of application latency (how long each request took), tracking the life-cycle of network calls (HTTP, RPC, etc) and also identify performance issues by getting visibility on bottlenecks. Plenty of articles are on the inter-webs describing various implementations of Distributed Tracing. This blog will focus on our experience rolling out tracing into our ecosystem with tales of what not to do.

What we need to solve

Outbrain has hundreds of micro-services running on top of Kubernetes. We have metrics via Prometheus and Netflix Hystrix dashboard to monitor overall health, but we lack visibility on business requests. For example, when a user clicks an ad, we want to see the exact flow. In particular, we may want to see the bid request/response to our partners for performance or fraud reasons. But we don’t have something that allows us to examine an individual request. This is what tracing solves for us; the ability to capture exact business logic for a specific request. Tracing can answer such questions:

Did I hit cache or go to a datastore? What services were used for this specific business logic request?

We implemented tracing with Jaeger to visualize and store our traces:

The First Rule

The first rule of tracing is no one should know about tracing. Tracing should be behind a magical curtain of abstraction; all without developers knowledge. Depending on your organization development practices and usage of core infrastructure (e.g. HTTP clients, RPC framework, database drivers…etc), this maybe difficult. But if the majority of your organization uses one or two HTTP clients (i.e. OkHttp, Apache HTTP) it is rather trivial instrument those clients with tracing capabilities without much developer intervention. By instrumenting only HTTP clients and servers you will get a huge benefit with tracing. Think of the 80/20 rule when you plan your tracing implementation.

What not to do

Conceptual, tracing is easy to understand. Operationally, it can turn into a mess. There is an overhead to tracing, albeit small, but still you don’t want to get OutOfMemory errors because of tracing. So don’t by start tracing everything!

Besides, if you start tracing everything what are you going to do with all that data? Depending on your request load this could be huge. Sure, you solved tracing your micro-services, but you created massive amount of data (and hardware) that probably no one is going to look at. In the beginning, keep your operational costs to a bare minimum. You could start by dumping traces into your existing logging infrastructure (i.e. put the trace Id as MDC context in all logging statements). No extra hardware (collectors) required! Like designing software, start small and increment your changes overtime.

Slow and steady wins the race

Begin your tracing quest by providing a mechanism which will “turn on” tracing manually. Suppose you have a REST API, by detecting a special request header, say “X-AdHocTrace”, will inform the service that tracing has been requested. This will enable tracing to all downstream servers as well. We use swagger at Outbrain and added the header automatically:

window.onload = function() {
 const ui = SwaggerUIBundle({
 url: "../api/swagger/apiDocs",
 requestInterceptor: function(request) {
   request.headers['X-AdHocTrace'] = "true";
   return request;
}

When developers are testing their services (locally or remotely) they automatically get all their calls traced when using swagger.

Overtime you can add more sophisticated sampling mechanisms. For instance you could sample all request that result in a 500 HTTP status code. Read more >

August 11, 2019

Daria Litvinov

Understanding Spark Streaming with Kafka and Druid

As a Data Engineer I’m dealing with Big Data technologies, such as Spark Streaming, Kafka and Apache Druid. All of them have their own tutorials and RTFM pages. However, when combining these technologies together at high scale you can find yourself searching for the solution that covers more complicated production use-cases.
In this blogpost I’m going to share the knowledge I gained by combining Spark Streaming, Kafka and Apache Druid all together for building real time analytics dashboard, guaranteeing precise data representation.

Before we dive in…few words about Real Time Analytics

The Real-time analytics is a new trend in Big Data technologies, and usually has significant business effect. When analyzing fresh data, the insights are more precise. For example, providing real time analytics dashboard for Data Analysts, BI and Account Managers teams can help these teams to make fast decisions.
The commonly used architecture for real time analytics at scale is based on Spark Streaming and Kafka. Both these technologies are very well scalable. They run on clusters and divide the load between many machines. The output of Spark jobs can go to many various destinations, it depends on the specific use case and the architecture. Our goal was to provide the visual tool displaying real-time events. For this purpose we chose Apache Druid database.

Data visualization in Apache Druid

Druid is a high performance real-time analytics database. One of its benefits is the ability to consume real time data from Kafka topic and build powerful visualizations on top of it using Pivot module. Its visualizations enable running various ad-hoc “slice and dice” queries and get visual results quickly. It is very useful for analyzing various use cases, for example how specific campaigns perform in certain countries. Data is retrieved at real-time, with 1-2 minutes delay.

The architecture

So we decided to build our Real Time analytics system based on Kafka events and Apache Druid. We already had events in Kafka topic. However we could not just ingest them into Druid as is. We needed to add more dimensions to each event. We needed to enrich every event with more data in order to see it in Druid in a convenient way. Regarding the scale, we’re dealing with hundreds of thousands of events per minute, so we needed to use technology that can support these numbers. We decided to use Spark Streaming job for the enrichment of original Kafka events.

Figure 1. Real time analytics architecture

Spark Streaming job runs forever? Not really.

The idea of Spark Streaming job is that it is always running. The job should never stop. It constantly reads events from Kafka topic, processes them and writes the output into another Kafka topic. However, this is an optimistic view. In real life things are more complicated. There are driver failures in the Spark cluster, in which case the job is restarted. Sometimes the new version of spark application is deployed into production. What happens in this case? How does the restarted job read Kafka topic and process the events? Before we dig into these details, this figure shows what we see in Druid when the Spark Streaming job is restarted:

Figure 2. Data loss on job restart

It is definitely data loss!

What problem are we trying to solve?

We are dealing with Spark Streaming application which reads events from one Kafka topic and writes them into another Kafka topic. These events are visualized later in Druid. Our goal is to enable smooth data visualization during the restart of our Spark Streaming application. In other words, we need to ensure that no events are lost or duplicated during the Spark Streaming job restart.

It’s all about offsets

In order to understand why data is lost on jobs restart, we need to get familiar with some terms in Kafka architecture. Here you can find an official Kafka documentation. In a nutshell: events in Kafka are stored in topics; each topic has divided into partitions. Each record in a partition has an offset – a sequential number which defines the order of the record. When application consumes the topic, it can handle offsets in several ways. The default behavior is always to read from the latest offsets. Another option is to commit offsets, i.e. to persist offsets so the job can read the committed offsets on restart and continue from there. Let’s see our steps towards the solution, and get a deeper understanding of Kafka offsets management with each step.

Step#1 – auto commit offsets

The default behavior is always to read from the latest offsets. This will not work because when the job is restarted, there are new events in the topic. If the job reads from latest, it loses all messages that were appended during the restart, as can be seen in Figure 2. There is a “enable.auto.commit” parameter in Spark Streaming. Its value is false by default. Figure 3 shows the behavior after changing its value to true, running the Spark application and restarting it.

Figure 3. Data spike of job restart

We can see that using Kafka auto-commit feature causes a new effect. There is no “data loss” , however we now see duplicate events. There was no real “burst” of events. What actually happened is that auto commit mechanism commits offsets “from time to time” . There are many messages in the output topic that were not committed. After the restart the job consumes messages from the latest committed offsets and processes some of these events again. That’s why on the output we get a burst of events.

Clearly, incorporating these duplications into our visualization may mislead the business consumers of this data and have an impact on their decisions and trust in the system.

Step#2: Commit Kafka offsets manually

So we can’t rely on Kafka auto-commit feature. We need to commit Kafka offsets by ourselves. It order to do this, let’s see how Spark Streaming consumes data from Kafka topics. Spark Streaming uses an architecture called Discretized Streams, or DStream. DStream is represented by a continuous series of RDDs (Resilient Distributed Datasets), which is one of the Spark’s main abstractions. Most Spark Streaming jobs look something like this:

dstream.foreachRDD { rdd =>
    rdd.foreach { record => process(record) }
}

In our case processing the record means writing the record to the output Kafka topic. So, in order to commit Kafka offsets we need to do the following:

dstream.foreachRDD { rdd =>
   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd.foreach { record => process(record)}
   stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) 
}

This is a straight-forward approach and before we discuss it deeper let’s take a look at the big picture. Let’s assume that we handle offsets correctly. Namely, that offsets are saved after each processing of RDD. What happens when we stop the job? The job is stopped in the middle of the processing of RDD. The part of micro-batch is written to the output Kafka topic and is not committed. Ones the job runs again, it will process some messages for the second time and the spike of duplicate messages will appear (as before) in Druid:

Figure 4. Data spike on job restart

Graceful Shutdown

Turns out there is a way to ensure a job is not killed during an RDD processing. It is called a “Graceful Shutdown”. There are several blog posts describing how you can kill your Spark application gracefully, but most of them relate to old versions of Spark and have many limitations. We were looking for a “safe” solution that works for any scale and does not depend on a specific Spark version or operating system. To enable Graceful Shutdown, the Spark context should be created with the following parameters:
spark.streaming.stopGracefullyOnShutdown = true.
This instructs Spark to shut down StreamingContext gracefully on JVM shutdown rather than immediately.
In addition, we need a mechanism to stop our jobs intentionally, for example when deploying a new version. We’ve implemented the first version of this mechanism by simply checking the existence of an HDFS file that instructs the job to shut down. When the file appears in HDFS, the streaming context will stop with the following parameters:
ssc.stop(stopSparkContext = true, stopGracefully = true)

In this case the Spark application stops gracefully only after all received data processing is completed. This is exactly what we need.

Step #3: Kafka commitAsync

Let’s recap on what we have so far. We intentionally commit Kafka offsets in each RDD processing (using Kafka commitAsync API) and we use spark graceful shutdown. Apparently, there was another caveat. Digging into the documentation of Kafka API and Kafka commitAsync() source code, I’ve learned that commitAsync() only puts the offsetRanges into a queue, which is actually processed only in the next loop of foreachRDD. Even if the Spark job is stopped gracefully and finishes processing of all its RDDs, the offsets of last RDD are actually not committed. To solve this problem, we’ve implemented a code that persists Kafka offsets synchronously and does not rely on Kafka commitAsync(). For each RDD we then stored the committed offsets in an HDFS file. When the job starts running again, it loads the offsets file from the HDFS and consumes Kafka topic from these offsets and on.

Here, it works!

It was only the combination of a graceful shutdown and a synchronous storage of Kafka offsets that provided us with the desired result. No data loss, no data spikes during restarts:

Figure 5. No data loss of spikes during Spark job restart

Conclusion

Solving the integration problem between Spark Streaming and Kafka was an important milestone for building our real-time analytics dashboard. We’ve found the solution that ensures stable dataflow without loss of events or duplicates during the Spark Streaming job restarts. We now have the trustworthy data which is visualized in Druid. Thanks to this, we’ve added more types of events (Kafka topics) into Druid and built real time dashboards. These dashboards provide insights for various teams, such as BI, Product and Customer Support. Our next goal is to utilize more features of Druid, like new analytical functions and alerts.

March 4, 2019

Gerardo Laracuente

Migrating Servers in Our Sleep

The Cloud is an Illusion

Cloud service providers have enabled innovations in many areas of our society from the way we watch movies to the way we share media with our family and friends. Companies can focus on amazing products without having to worry about managing data centers, servers, networking equipment, and all of the complexities therein.

But the cloud is an illusion, just an abstraction. Behind every cloud are data centers full of countless racks of servers, routers, firewalls, and engineers who design, deploy, and manage them. There is physical infrastructure that powers the technology we enjoy everyday and, at Outbrain, we run the majority of our workloads on bare metal infrastructure. We are, for that matter, our own cloud provider. This post is a peak behind the curtain of how we do this, at scale.

It was a sunny day in California…

In the summer of 2017, our new West Coast data center went live, along with our first fully automated 10G network deployment. It was a great success, which is why we wanted to also roll it out in our other data centers. The new mission was to migrate every server from our legacy 1G network to our shiny new 10G network. Essentially, this meant installing 10G network cards and running new twinax cabling to a good number of thousands of servers. No biggie… except:

The vast majority of those servers run production workloads
Many of them run stateful applications such as data stores (mysql, elasticsearch, etc.)
Many of these stateful applications can only tolerate a limited amount of downtime before they start shuffling (a lot of) data around
Servers need to be migrated around the clock, with no downtime to any cluster
The application and networking engineers are in Israel
The physical server engineers are in NYC
There is a 7 hour time difference between NYC and Israel (the workweek also only overlaps for 4 days of the week)
The on-site technicians who work on our servers cannot log into them for reboots, health checks, etc.

How many Engineers does it take to…?

Let’s take a look at everything that goes into a manual migration of one server, so we can better understand the task at hand. We will narrow the focus to one specific case: migrating a mysql node to the new network.

The people involved:

Gerry – Data Center Server Engineer (NYC)
Yuval – Data Storage Engineer (Israel)
Adi – Networking Engineer (Israel)
Mike – On-site Remote Hands technician (non Outbrain employee)

The process:

Yuval removes the mysql server from the cluster and makes sure that the cluster is still healthy.
Gerry properly powers down the server, and sends the location details to the remote hands team. Based on the location of the server within the rack, he also lets them know which switch ports to use.
Mike opens up the server, installs the 10G network card, and runs redundant twinax cabling to the 10G switches. He powers it back up, connected to both the legacy and the 10G networks.
Adi prepares the server to join the 10G network and reboots it for the final changes to take effect. After the reboot, he checks that the server is indeed part of the new network and that both twinax connections are up and stable.
Yuval adds the server back to the mysql cluster and makes sure everything looks healthy.
Mike removes the old RJ-45 cables and waits for Yuval and Gerry to prepare and shut down the next server (to avoid multi-node shutdown in the same cluster).

* This is a simplified explanation of the process, and assumes everything goes as smooth as possible. It involves 3 Outbrain engineers to be available across a major time zone difference + on-site remote hands. That’s 4 engineers to migrate a single node.

And now that we’re done, there’s only… a few thousand servers left to migrate…running on various types of hardware, with different versions of Linux, different applications and operating under different availability restrictions.

Which begs the question – Will it Scale?

Putting Remote Hands in the Driver Seat

After a few iterations, this is what the process looks like from Outbrain’s perspective:

Gerry executes a Rundeck job with one mouse click, and then emails Mike a list of server locations.
Gerry goes to sleep.

This is the process from the perspective of a remote hands technician at the data center:

Mike receives a list of server locations, and plugs an iPad into the first server on the list.
The iPad is running Slack and the chat room starts displaying new messages. It lets Mike know that the server is attempting to safely stop mysql and power itself down.
The server shuts itself down, but right before it goes down, it sends a message to the Slack channel explaining all the next steps, including which switch ports to connect to server to.
Mike installs the 10G network card, finishes up the cabling, and powers the server back up.
The server runs the network preparation scripts, reboots itself, checks the network status, starts up mysql, runs health checks, and lets Mike know that it’s time to remove the old cabling and move on to the next server.

Meanwhile, Gerry and Yuval’s teams are getting updates via email every time a server begins and succeeds the migration process. They can monitor the Slack channel during the process, or even look back at all of the migrations in a Kibana dashboard. If anything ever goes wrong, the iPad communicates that Mike should stop and contact Outbrain. The iPad won’t take any wrong actions if it is plugged into the wrong server or reseated at any point.

How We Built It

When the iPad is plugged into the server, it is recognized by a udev rule. This triggers a wrapper script that contains all of the migration logic. Here’s a deeper look at the individual steps and components:

Rundeck is used to put the desired servers into maintenance mode. It does this by touching empty files onto the servers that indicate that they are in maintenance mode, and they are in the first stage of the network migration process.

Chef contains all of the necessary scripts to run the migration. This includes the udev rules and the wrapper, networking, and application specific scripts. The Chef recipe chooses the proper pre and post migration scripts based on the role of the server.

UDEV Rules are very powerful, and we use them define what happens when the iPad is plugged into a server:

SUBSYSTEM=="usb",
ACTION=="add", 
ENV{ID_SERIAL}=="Apple_Inc._iPad_averylonguniqueserialnumber",
RUN+="/path/to/wrapper_script.sh"

This rule roughly translates to: “When a usb device with this serial number in plugged into the server, run the following wrapper script”

A custom startup script is what triggers the wrapper script when the server boots back up after a shutdown or reboot to perform the next migration step.

The wrapper script holds all of the logic and functionality of the process.

It only runs if a maintenance file exists on this server.
It checks for a state file, and depending on what it finds, understands which part of the migration process to run next.
It sends the event logs (for the Kibana dashboard), emails (to the proper teams who manage this server), and Slack messages (via webhooks).
It handles the reboots and shutdowns.
When it runs an application specific pre/post migration script, it’s expecting to get an exit status of 0 (or else it will let everyone involved know that something went wrong).
When it runs the network preparation script (which is worthy of its own blog post), it can make a few decisions based on the exit status. It can rings the alarms, move on to the next step, or even let the remote hands technician know that one of the twinax cables seems loose and should be fully seated.
It cleans up by removing all migration related files, including itself.

More time to build more automation =]

Now we can set a bunch of servers to maintenance mode, hand off the list of server locations to remote hands, and continue our daily work while they are migrated.

And since this works so well using the current approach, work is underway to generalise the concept and make it available to many other types of physical maintenance.

So that our cloud can continue to build itself… while we sleep.

February 19, 2019

Yulia Stolin

Nurit Moscovici

Coding Game Story Two – How can Minions be used as a strategy and not just as a cartoon

This is part two of the Outbrain Haggling Game blog series. For the full description of the game we highly recommend you read the first part here.

When we heard about the game we were very excited. It’s fun to take a break from everyday work every once in a while to create something new and cool. Additionally, these games are a great opportunity to get to know coworkers we don’t get to interact with on a day to day basis. Moreover, the competitive nature of the haggling game was a huge draw to us – we love to win!

Before we describe our strategy and the implementation details, we want to reiterate the rules of the Haggling Game:

The Rules

The game is played by two players who negotiate the division of a group of items between them. The goal of each player is to maximise their score.

At the start of the game a set of ‘items’ is placed on the table (for example, a pair of sunglasses, two cups and a pencil). The players must agree on how to divide these items between them. The items are worth a different amount to each of the players, and the players know only their own preferences, and have no idea of the opponent’s. The total worth of all objects on the table is the same for both players.

The game consists of 9 rounds. In each round one of the players receives an offer of how to split the items and decides whether to accept it or return a counteroffer. If after 9 rounds neither player has accepted any offer the game ends with both players receiving 0. If at any point an offer is accepted each player receives the amount that their portion of the goods is worth to them.

Strategy – General Overview

Technical Strategy

We implemented a collaborative strategy, using two types of players:

The “Overlord”, our winning player
The “Minions”, helper bots whose job was to give points to the Overlord while blocking other players.

The Overlord and Minions recognize each other using a triple handshake protocol, based on mathematical calculations of the game parameters. Whenever a handshake completes successfully the Overlord requests the entire pot, while the Minion accepts the offer. In case the Overlord sees an unsuccessful handshake it moves to a competitive strategy and tries to claim the highest possible rewards using a player-vs-competitor bargaining strategy. Similarly, when a Minion realises that its opponent isn’t the Overlord it requests the entire pot in every round until the end of the game. This effectively blocks the opponent from receiving any points in this game.

Social Engineering

Beyond the technical strategy, we also employed a sort of “social engineering” : we hid the strength of the Overlord by ensuring that for the majority of the two-week development period the Overlord went no higher than third place. Given the amount of hacking going on during the game we wanted to make sure we didn’t draw undue attention, or become a major target to beat. We did this by populating the game with “sleeper cells” — players under our control running basic strategies ready to turn into minions at the right moment.

We waited until the last couple of hours of the game to start converting the minions. During the final hour, the Overlord shot to first place, outpacing the competitors by a huge margin. The strength of the technical strategy is evident by our spectacular win. However, the social engineering aspect is not to be diminished. In the last half hour of the game, we saw an attempt to hack our code in order to beat the Overlord. Our sneaky strategy did not leave enough time for the hacker to bring about our downfall.

Strategy – The Protocol

As we described, our protocol employs a triple handshake between the Overlord and a Minion. The roles of the Overlord and the Minion in the handshake are asymmetric.

The Overload:

Never initiates the handshake
Keeps the handshake state between rounds (Is the opponent possibly a Minion?)
If at any round the handshake fails, falls back to play the “competitive strategy”
If the opponent is believed to be a minion, demands all items

The Minions:

Always initiates a handshake in the first round
Keeps the handshake state between rounds (is the opponent possibly the Overlord?)
If at any round the handshake fails, blocks the opponent from getting any points by demanding all items.
If the opponent is believed to be an Overlord – the handshake is successful and the opponent requested the entire pot – accept the offer

Here we present the protocol flow and the code for the Overlord and Minion

For the remainder of this post “Counts” refers to the set of items that are on the table for this game.

Overlord Protocol:

We can see that the Overlord uses both the saved state and the received offer to determine whether this is a valid phase of the handshake..

This is a snippet of the Overlord’s code:

We can see that when the Overlord is the starting player it always plays the competitive strategy, and does not initiate a handshake. The Overlord checks the offers it receives to decide whether a handshake is ongoing. If at any point in the game the Overlord decides that the opponent is not a Minion it switches to the competitive strategy.

Minion Protocol:

The flow of a Minion playing against the Overlord

The flow of a Minion playing against a competitor:

This is a snippet of the Minion’s code:

We can see that the Minion always initiates a handshake on its first offer, regardless of what it received. At any point in the game, if the Minion decides that the opponent is not the Overlord, it switches to a blocking strategy.

Protocol Offer

The key to making the protocol work as we designed is to create a proper offer for the handshake. The offer must be deterministic and computable by both sides. Moreover, it should not be static: we want it to be changed from game to game, and also from round to round.

The main reason for that is to lower the probability that the opponent can learn our strategy, or that their strategy accidentally coincides with our protocol, giving them the win.

Additionally, the offer must be not to bad for us, so that we don’t give away free points. If the opponent accepts the offer while we are still trying to perform the handshake, we won’t lose too much.

Before we describe the algorithm for calculating the offer, let’s go back to the game rules:

Both players know which items are on the table for bargaining
Each player receives the round number and the opponents offer in each round
All communication, like http calls or some IO, is not allowed.
In every game there is a randomisation of the bargaining items

From the rules, we can see that the only common information between the two players are the group of bargaining items, the offer, and the round number. Moreover, the only information that passes between two players is the offer.

Thus, we built the protocol offer as follows:

For each round we need to create an offer that will fulfil the requirements described above. In order to do so, we decided to create an offer by picking a single item in each round. We offer a single instance of this item to the competitor, keeping the rest of the items to ourselves.

This way we try to minimise the worth of the offer to the competitor, while also minimising our loss if they accept it.

In order to choose the item to offer the opponent we hash the items and amounts in “Counts” along with current round number. We sort the items in “Counts” by their names and amounts, then use the hashed value we calculated modulo the size of “Counts” to choose the item to offer. “Counts” changes every game, and the round number every round, meaning that the hash value will be different in every round of every game.

This is a snippet of the offer code

Competitive Strategy

We started the game by developing a strategy for playing the game by the book – no Minions. We compared the results of several different kinds of strategies, from the simplest greedy algorithms to attempts to analyse the opponents’ preferences and build an offer accordingly. In the end, in order to emphasise the strength of the Overlord-Minion coalition we chose one of the simpler attempts. This was to show that even with a “bad” strategy the Minions could elevate the Overlord to the winning position.

We’ll describe the two basic directions we developed:

1. Greedy:

During the initialisation phase we built a static list of “best offers”. These were offers that ensured we got as many points as possible, sorted in descending order of value to us. We assumed that an offer which gave nothing to the opponent would not be accepted. Thus all the “best offers” included some items to be given to the opponent. The “best offer” options were hardcoded for each round. We tried several different configurations, sometimes allowing for more losses, sometimes for less. Finally, we compared the results of several such players and chose one for the basic master strategy.

2. Algorithmic:

The algorithmic strategy was to try to give the opponent the best deal we thought we could, while ensuring that our own gains would not fall below a certain threshold.

We kept a score for each item in counts, based on the number of times the opponent was willing to offer it to us. Our assumption was that the opponent was more likely to offer us items that were worth less to them. We sorted the items by their perceived importance to the opponent and built an offer containing items we thought were important to them, while not giving over items with the highest worth to us. This was an attempt to “sweeten the deal”. In this case we both win, but hopefully we win more. Here too we played with different configurations. This was done by changing amounts of attractive and unattractive items to the opponent, and playing with our threshold value. This strategy showed promising results, but was nowhere near those of the Overlord-Minion collaboration.

Results

The final result of the game was the crushing defeat of all those who dared to oppose us.

We can bask in our glory all day, but it is more interesting to describe the way we reached this victorious result. As we mentioned above, we felt that keeping a low profile was imperative to winning the game. The multiple hacking attempts and the general competitive atmosphere only served to strengthen this decision. Once we had finished working on the code we started to carefully test how much of a lift the Minions offered. The graph on the right is from a couple of days before the end of the competition. We enabled four minions, and saw that the Overlord immediately shot ahead of the competitors. We quickly disabled the minions, so as not to draw attention. In the final hour before the end of the game we enabled 20 minions, giving us the enormous lead seen in the graph on the left.

All hail the Overlord!

P.S.

Stay tuned to hear about how the game was hacked, and the different hacking strategies used by Roy Bass.

November 25, 2018

Avi Youkhananov

CodinGame Story One – The key for creativity and happiness in developers life

Photo by Juan Gomez on Unsplash

“Keep a developer learning and they’ll be happy working in a windowless basement eating stale food pushed through a slot in the door. And they’ll never ask for a raise.” — Rob Walling (https://robwalling.com/2006/10/31/nine-things-developers-want-more-than-money/)

The past decade has produced substantial research verifying what may come as no surprise: developers want to have fun. While we also need our salaries, salaries alone will not incentivize us developers who, in most cases, entered a field to do what we love: engage in problem-solving. We like competition. We like winning. We like getting prizes for winning. To be productive, we need job satisfaction. And job satisfaction can be achieved only if we get to have fun using the skills we were hired to use.

We wanted to keep the backend developers challenged and entertained.
That’s why Guy Kobrinsky and I created our own version of Haggling, whose basic idea we adapted from Hola, a negotiation game.

The Negotiation Game:

Haggling consists of rounds of negotiations between pairs of players. Each pair’s goal is to maximize score in the following manner:

Let’s say there are a sunglasses, two tickets, and three cups on the table. Both players have to agree on how to split these objects between them. To one, the sunglasses may be worth $4, a ball $2, and the tickets are worthless. The opponent might value the same objects differently; while the total worth of all the objects is the same for both players, their valuation kept secret

Both players take turns making offers to each other about how to split the goods. A proposed split must distribute all objects between partners such that no items are left on the table. On each turn, one can either accept an offer or make a counter-offer. If after 9 offers an agreement is reached, every player receives the amount that its portion of the goods is worth, according to the assigned values. If there is still no agreement after the last turn, both players receive no points.

The Object of the Game:

Write code to obtain a collection of items with the highest value by negotiating items with an opponent player.

User Experience:

We wanted it to be as easy as possible for players to submit, play and test their code.
Therefore, we decided to keep player code simple – not relying on any third-party libraries.
To do this, we built a simple web application for testing and submitting code, supplying a placeholder with the method “accept” – the code that needs to be implemented by the different participants. The “accept” method describes a single iteration within the negotiation, in which each player must decide if they will accept the offer given to them (by returning null or the received offer) – or return a counter offer.

To assist in verifying the players’ strategy, we added a testing feature allowing players to run their code vs some random player. Developers were able to play around with it, re-implementing the code before actual submission.

Java Code Example:

[gist id= 8e870dad5baeec79cbda4be5f56617f6 file=HagglingCode.java]

Test Your Code and Submit Online:

Tournament And Scoreboard:

Practice tournaments ran continuously for two weeks, taking all submitted players into account and allowing developers to see their rank. During this time, competitors were able edit their code. So there was plenty of time to learn and improve.

We also provided analytics for every player. Developers were able to analyze and improve their strategy.

At the end of the two weeks, we declared a code freeze and the real tournament took place. Players’ final score was determined only from the results of the real tournament, not the practice tournaments.

Game Execution And Score:

We executed the game tournament using multiple agents – each of the agents was reported to Kibana:

The Back-Stage:

Where did we store players’ code?
We decided to store all players’ code in S3 of AWS to avoid revealing the code to other players.

What languages were supported?

We started with Java only, but players expressed interest in using Scala and Kotlin as well. So we gave these developers free rein to add support for those languages, which we then reviewed before integrating into the base code. Ultimately, developers were able to play in all three languages.

What was the scale of Haggling?

In the final tournament, 91 players competed in 164 million rounds in which 1.14 billion “accepts” were called. The tournament was executed on 45 servers, having 360 cores and using 225G of memory.

The greatest advantage of our approach was our decision to use Kubernetes, enabling us to add more nodes, as well as tune their cores and memory requirements. Needless to say, it was no problem to get rid of all these machines when the game period ended.

How did the tournament progress?

The tournament was tense, and we saw a lot of interaction with the game over the two weeks.
The player in the winning position changed every day, and the final winner was not apparent until very near the end (and even then we were surprised!).
We saw a variety of single-player strategies with sophisticated calculations and different approaches to gameplay.
Moreover, in contrast to the original game, we allowed gangs: groups of players belonging to a single team that can “help” each other to win.

So how do you win at haggling?

The winning strategy was collaborative – the winning team created two types of players: the “Overlord” which played to win, and several “Minions” whose job was to give points to the Overlord while blocking other players. The Overlord and Minions recognized each other using a triple handshake protocol, based on mathematical calculations of the game parameters. Beyond this, the team employed a human psychological strategy – hiding the strength of the Overlord by ensuring that for the majority of the development period the Overlord went no higher than third place. They populated the game with “sleeper cells” – players with basic strategies ready to turn into minions at the right moment. The upheaval occurred in the final hour of the game when all sleepers were converted to minions.

The graph shows the number of commits in the last hour before the code freeze:

Hats Off to the Hacker: who got the better of us?

During the two weeks, we noticed multiple hacking attempts. The hacker’s intent was not to crash the game, but rather to prove that it is possible and make a lesson of it.
Although it was not our initial intent, we decided to make hacking part of the challenge and to reward the hacker for demonstrated skills and creativity.

On the morning of November 7th, we arrived at the office and were faced with the following graph of the outcomes:

The game had been hacked! As can be seen in the graph, one player was achieving an impossible success rate. What we discovered was the following: the read-only hash map that we provided as method argument to players was written in Kotlin; but, when players converted the map to play in either Java or Scala, the resulting conversion rendered a mutable hash map, and this is how one of the players was able to modify the hash map. We had failed to validate the preferences, ensuring that the hashmap values that players turned in used the same values as the original.

In conclusion, This is exactly the sort of sandbox experience, however, that makes us better, safer, and smarter developers. We embraced the challenge.

Want to play with us? Join Outbrain and challenge yourself.

Outbrain Tech Blog Recent Posts

Intro

To upgrade?

Switch Re-provisioning Process

15 minutes times 74 is…

A better way

Automated and Robust

Hands-free and auditable

So what just happened?

Kill them! Close their resource! Let none escape!

Background

Guava OUT

Caffeine IN

Guava’s hit-rate benchmark vs Caffein’s

Conclusion

Intro

In the beginning

It’s good, but is it good enough?

It works so well, why touch it?

One small step for dev, one giant leap for devops

No, really, let’s unfold that last statement

Show me the money

EPILOGUE

The second annual Data Science Summer School

The structure of the summer school

What the participants had to say

Conclusion

Distributed Tracing

What we need to solve

The First Rule

What not to do

Slow and steady wins the race

Before we dive in…few words about Real Time Analytics

Data visualization in Apache Druid

The architecture

Spark Streaming job runs forever? Not really.

What problem are we trying to solve?

It’s all about offsets

Step#1 – auto commit offsets

Step#2: Commit Kafka offsets manually

Graceful Shutdown

Step #3: Kafka commitAsync

Here, it works!

Conclusion

The Cloud is an Illusion

It was a sunny day in California…

How many Engineers does it take to…?

The people involved:

The process:

Putting Remote Hands in the Driver Seat

How We Built It

More time to build more automation =]

The Rules

Strategy – General Overview

Technical Strategy

Social Engineering

Strategy – The Protocol

The Overload:

The Minions:

Overlord Protocol:

Minion Protocol:

Protocol Offer

Competitive Strategy

1. Greedy:

2. Algorithmic:

Results

The Negotiation Game:

The Object of the Game:

User Experience:

Java Code Example:

Test Your Code and Submit Online:

Tournament And Scoreboard:

Game Execution And Score:

The Back-Stage:

How did the tournament progress?

So how do you win at haggling?

Hats Off to the Hacker: who got the better of us?

Search

עברית

Categories