Testing | Outbrain Techblog

November 11, 2019

Avi Youkhananov

Oh my Guava! We are moving to Caffeine.

Caching is extremely important! It provides fast response time, enabling effortless performance improvements in certain use cases.
At Outbrain, we have recently moved to Caffeine caching, after having used Guava in-memory caching for many years.

Background

Caffeine library is a rewrite of Guava’s cache that uses a Guava-inspired API that returns CompletableFutures, allowing asynchronous automatic loading of entries into a cache. The library was written by Ben Manes who is the author of ConcurrentLinkedHashMap on which Guava cache is based.

Guava OUT

Guava blocks during loading when a key is not present in the cache.
We wanted to change the API to work asynchronously, but the added complexity made the code difficult to understand and troubleshoot.
To make Guava non-blocking, we overrode the load and loadAll methods of CacheLoader to make the API return a future.
Since we changed the API to return a future, we encountered a problem. Guava is unaware that we use it to store futures, so it stores futures completed with an exception as well. To prevent the exceptions from being stored in the cache, we needed to add more complexity to our code.

Caffeine IN

Caffeine is a rewrite of Guava’s cache that uses an API that returns CompletableFutures out of the box, allowing asynchronous automatic loading of entries into a cache.
Caffeine removes futures that complete with an exception from the cache.
Caffeine uses both a Least Recently Used (LRU) eviction policy and a frequency-based admission policy relying on CountMin sketch. It has a better hit rate than LRU for many workloads.

Guava’s hit-rate benchmark vs Caffein’s

We analyzed service behavior with real traffic under different cache configurations to get an idea of how production services will behave.

The benchmarked cache contains image URLs keyed by UUID and follow an access pattern of most-frequent / least-frequent data, which is our most common use case at Outbrain.

In the benchmarks below, we did not measure and therefore have no interpretation of memory usage. But analyzing hit rates leads to some interesting insights.

1. Cache size: 10k items
Expiration after write: 5 min

Hit rate with 10k items:

Caffeine 28.33 %
Guava 20.95%

2. Cache size: 50k items
Expiration after write: 20 min

Hit rate with 50k items:

Caffeine 56.04 %
Guava 50.01%

3. Cache size: 100k items
Expiration after write: 20 min

Hit rate with 100k items:

Caffeine 70.10 %
Guava 66.77%

4. Cache size: 300k items
Expiration after write: 20 min

Hit rate with 300k items:

Caffeine 87.19 %
Guava 84.85%

Conclusion

We did it! With minor changes to infrastructure code (as Caffeine and Guava API are almost identical), we improved our hit rate and reduced code complexity for critical services in our system.

Honestly, Caffeine smells better than Guava.

November 8, 2017

Avi Youkhananov

Keep bugs out of production

Production bugs are painful and can severely impact a dev team’s velocity. My team at Outbrain has succeeded in implementing a work process that enables us to send new features to production free of bugs, a process that incorporates automated functions with team discipline.

Why should I even care?

Bugs happen all the time – and they will be found locally or in production. But the main difference between preventing and finding the bug in a pre-production environment is the cost: according to IBM’s research, fixing a bug in production can cost X5 times more than discovering it in pre-production environments (during the design, local development, or test phase).

Let’s describe one of the scenarios happen once a bug reaches production:

A customer finds the bug and alerts customer service.
The bug is logged by the production team.
The developer gets the description of the bug, opens the spec, and spends time reading it over.
The developer then will spend time recreating the bug.
The developer must then reacquaint him/herself with the code to debug it.
Next, the fix must undergo tests.
The fix is then built and deployed in other environments.
Finally, the fix goes through QA testing (requiring QA resources).

How to stop bugs from reaching production

To catch and fix bugs at the most time-and-cost efficient stage, we follow these steps, adhering to the several core principles:

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Create the design before coding: try to divide difficult problems into smaller parts/steps/modules that you can tackle one by one, thinking of objects with well-defined responsibilities. Share the plans with your teammates at design-review meetings. Good design is a key to reducing bugs and improving code quality.

Step 2: Start Coding

The code should be readable and simple. Design and development principles are your best friends. Use SOLID, DRY, YAGNI, KISS and Polymorphism to implement your code.
Unit tests are part of the development process. We use them to test individual code units and ensure that the unit is logically correct.
Unit tests are written and executed by developers. Most of the time we use JUnit as our testing framework.

Step 3: Use code analysis tools

To help ensure and maintain the quality of our code, we use several automated code-analysis tools:
FindBugs – A static code analysis tool that detects possible bugs in Java programs, helping us to improve the correctness of our code.
Checkstyle – Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. It automates the process of checking Java code.

Step 4: Perform code reviews

We all know that code reviews are important. There are many best practices online (see 7 Ways to Up-Level Your Code Review Skills, Best Practices for Peer Code Review, and Effective Code Reviews), so let’s focus on the tools we use. All of our code commits are populated to ReviewBoard, and developers can review the committed code, see at any point in time the latest developments, and share input.
For the more crucial teams, we have a build that makes sure every commit has passed a code review – in the case that a review has not be done, the build will alert the team that there was an unreviewed change.
Regardless of whether you are performing a post-commit, a pull request, or a pre-commit review, you should always aim to check and review what’s being inserted into your codebase.

Step 5: CI

This is where all code is being integrated. We use TeamCity to enforce our code standards and correctness by running unit tests, FindBugs validations Checkstyle rules and other types of policies.

Stage 2 – Testing Environment

Step 1: Run integration tests

Check if the system as a whole work. Integration testing is also done by developers, but rather than testing individual components, it aims to test across components. A system consists of many separate components like code, database, web servers, etc.
Integration tests are able to spot issues like wiring of components, network access, database issues, etc. We use Jenkins and TeamCity to run CI tests.

Step 2: Run functional tests

Check that each feature is implemented correctly by comparing the results for a given input with the specification. Typically, this is not done at the development level.
Test cases are written based on the specification, and the actual results are compared with the expected results. We run functional tests using Selenium and Protractor for UI testing and Junit for API testing.

Stage 3 – Staging Environment

This environment is often referred to as a pre-production sandbox, a system testing area, or simply a staging area. Its purpose is to provide an environment that simulates your actual production environment as closely as possible so you can test your application in conjunction with other applications.
Move a small percentage of real production requests to the staging environment where QA tests the features.

Stage 4 – Production Environment

Step 1: Deploy gradually

Deployment is a process that delivers our code into production machines. If some errors occurred during deployment, our Continuous Delivery system will pause the deployment, preventing the problematic version to reach all the machines, and allow us to roll back quickly.

Step 2: Incorporate feature flags

All our new components are released with feature flags, which basically serve to control the full lifecycle of our features. Feature flags allow us to manage components and compartmentalize risk.

Step 3: Release gradually

There are two ways to make our release gradual:

We test new features on a small set of users before releasing to everyone.
Open the feature initially to, say, 10% of our customers, then 30%, then 50%, and then 100%.

Both methods allow us to monitor and track problematic scenarios in our systems.

Step 4: Monitor and Alerts

We use the ELK stack consisting of Elasticsearch, Logstash, and Kibana to manage our logs and events data.
For Time Series Data we use Prometheus as the metric storage and alerting engine.
Each developer can set up his own metrics and build grafana dashboards.
Setting the alerts is also part of the developer’s work and it is his responsibility to tune the threshold for triggering the PagerDuty alert.
PagerDuty is an automated call, texting, and email service, which escalates notifications between responsible parties to ensure the issues are addressed by the right people at the right time.

stop bugs
All in All,
Don’t let the bugs fly out of control.

June 13, 2017

admin

Failure Testing for your private cloud – Introducing GomJabbar

TL;DR Chaos Drills can contribute a lot to your services resilience, and it’s actually quite a fun activity. We’ve built a tool called GomJabbar to help you run those drills.

Failure Testing for your private cloud

Here at Outbrain we manage quite a large scale deployment of hundreds of services/modules, and thousands of hosts. We practice CI/CD, and implemented quite a sound infrastructure, which we believe is scalable, performant, and resilient. We do however experience many production issues on a daily basis, just like any other large scale organization. You simply can’t ensure a 100% fault-free system. Servers will crash, run out of disk space, and lose connectivity to the network. Software will experience bugs, and erroneous conditions. Our job as software engineers is to anticipate these conditions, and design our code to handle them gracefully.

For quite a long time we were looking into ways of improving our resilience, and validate our assumptions, using a tool like Netflix’s Chaos Monkey. We also wanted to make sure our alerting system actually triggers when things go wrong. The main problem we were facing is that Chaos Monkey is a tool that was designed to work with cloud infrastructure, while we maintain our own private cloud.

The main motivation for developing such a tool is that failures have the tendency of occurring when you’re least prepared, and in the least desirable time, e.g. Friday nights, when you’re out having a pint with your buddies. Now, to be honest with ourselves, when things fail during inconvenient times, we don’t always roll our sleeves and dive in to look for the root cause. Many times the incident will end after a service restart, and once the alerts clear we forget about it.

Wouldn’t it be great if we could have “chaos drills”, where we could practice handling failures, test and validate our assumptions, and learn how to improve our infrastructure?

Chaos Drills at Outbrain

We built GomJabbar exactly for the reasons specified above. Once a week, at a well known time, mid day, we randomly select a few targets where we trigger failures. At this point, the system should either auto-detect the failures, and auto-heal, or bypass them. In some cases alerts should be triggered to let teams know that a manual intervention is required.

After each chaos drill we conduct a quick take-in session for each of the triggered failures, and ask ourselves the following questions:

Did the system handle the failure case correctly?
Was our alerting strategy effective?
Did the team have the knowledge to handle, and troubleshoot the failure?
Was the issue investigated thoroughly?

These take-ins lead to super valuable inputs, which we probably wouldn’t collect any other way.

How did we kick this off?

Before we started running the chaos drills, there were a lot of concerns about the value of such drills, and the time it will require. Well, since eliminating our fear from production is one of the key goals of this activity, we had to take care of that first.

"I must not fear.
 Fear is the mind-killer.
 Fear is the little-death that brings total obliteration.
 I will face my fear.
 I will permit it to pass over me and through me.
 And when it has gone past I will turn the inner eye to see its path.
 Where the fear has gone there will be nothing. Only I will remain."

(Litany Against Fear - Frank Herbert - Dune)

So we started a series of chats with the teams, in order to understand what was bothering them, and found ways to mitigate it. So here goes:

There’s an obvious need to avoid unnecessary damage.
- We’ve created filters to ensure only approved targets get to participate in the drills.
  This has a side effect of pre-marking areas in the code we need to take care of.
- We currently schedule drills via statuspage.io, so teams know when to be ready, and if the time is inappropriate,
  we reschedule.
- When we introduce a new kind of fault, we let everybody know, and explain what should they prepare for in advance.
- We started out from minor faults like graceful shutdowns, continued to graceless shutdowns,
  and moved on to more interesting testing like faulty network emulation.
We’ve measured the time teams spent on these drills, and it turned out to be negligible.
Most of the time was spent on preparations. For example, ensuring we have proper alerting,
and correct resilience features in the clients.
This is actually something you need to do anyway. At the end of the day, we’ve heard no complaints about interruptions, nor time waste.
We’ve made sure teams, and engineers on the call were not left on their own. We wanted everybody to learn
from this drill, and when they weren’t sure how to proceed, we jumped in to help. It’s important
to make everyone feel safe about this drill, and remind everybody that we only want to learn and improve.

All that said, it’s important to remember that we basically simulate failures that occur on a daily basis. It’s only that when we do that in a controlled manner, it’s easier to observe where are our blind spots, what knowledge are we lacking, and what we need to improve.

Our roadmap – What next?

Up until now, this drill was executed in a semi-automatic procedure. The next level is to let the teams run this drill on a fixed interval, at a well known time.
Add new kinds of failures, like disk space issues, power failures, etc.
So far, we were only brave enough to run this on applicative nodes, and there’s no reason to stop there. Data-stores, load-balancers, network switches, and the like are also on our radar in the near future.
Multi-target failure injection. For example, inject a failure to a percentage of the instances of some module in a random cluster. Yes, even a full cluster outage should be tested at some point, in case you were asking yourself.

The GomJabbar Internals

GomJabbar is basically an integration between a discovery system, a (fault) command execution scheduler, and your desired configuration. The configuration contains mostly the target filtering rules, and fault commands.

The fault commands are completely up to you. Out of the box we provide the following example commands, (but you can really write your own script to do what suits your platform, needs, and architecture):

Graceful shutdowns of service instances.
Graceless shutdowns of service instances.
Faulty Network Emulation (high latency, and packet-loss).

Upon startup, GomJabbar drills down via the discovery system, fetches the clusters, modules, and their instances, and passes each via the filters provided in the configuration files. This process is also performed periodically. We currently support discovery via consul, but adding other methods of discovery is quite trivial.

When a users wishes to trigger faults, GomJabbar selects a random target, and returns it to the user, along with a token that identifies this target. The user can then trigger one of the configured fault commands, or scripts, on the random target. At this point GomJabbar uses the configured CommandExecutor in order to execute the remote commands on the target hosts.

GomJabbar also maintains a audit log of all executions, which allows you to revert quickly in the face of a real production issue, or an unexpected catastrophe cause by this tool.

What have we learned so far?

If you’ve read so far, you may be asking yourself what’s in it for me? What kind of lessons can I learn from these drills?

We’ve actually found and fixed many issues by running these drills, and here’s what we can share:

We had broken monitoring and alerting around the detection of the integrity of our production environment. We wanted to make sure that everything that runs in our data-centers is managed, and at a well known (version, health, etc). We’ve found that we didn’t compute the difference between the desired state, and the actual state properly, due to reliance on bogus data-sources. This sort of bug attacked us from two sides: once when we triggered graceful shutdowns, and once for graceless shutdowns.
We’ve found services that had no owner, became obsolete and were basically running unattended in production. The horror.
During the faulty network emulations, we’ve found that we had clients that didn’t implement proper resilience features, and caused cascading failures in the consumers several layers up our service stack. We’ve also noticed that in some cases, the high latency also cascaded. This was fixed by adding proper timeouts, double-dispatch, and circuit-breakers.
We’ve also found that these drills motivated developers to improve their knowledge about the metrics we expose, logs, and the troubleshooting tools we provide.

Conclusion

We’ve found the chaos drills to be an incredibly useful technique, which helps us improve our resilience and integrity, while helping everybody learn about how things work. We’re by no means anywhere near perfection. We’re actually pretty sure we’ll find many many more issues we need to take care of. We’re hoping this exciting new tool will help us move to the next level, and we hope you find it useful too 😉

May 11, 2017

Anatoly Rabinovich

Effective Testing with Loan Pattern in Scala

Tests are crucial in systems that rely on CI/CD as part of their release cycle. One of the challenges is to write stable tests that work for you without spending a lot of time on maintaining bad tests.

Tests are Hard

They’re hard to write, hard to maintain and it’s even harder to stabilize a flaky test. At Outbrain, we take special pride in our ability (for the most part) to deliver new features to production and doing so with the confidence that only reliable tests can give you. These tests play a crucial role in our ability to deliver fast, good and stable code making sure no regression bugs were introduced in the process. It is crucial then, to not only maintain good test suites (unit tests, integration, and e2e) but also to fix any test that misbehaves (flaky tests).

We have a special environment to facilitate integration and e2e tests called simulation environment (it is only one of the set of tools we have for that purpose). This is a dedicated set of servers which we use to simulate our production environment. We deploy every new version of our services to that environment before we deploy to production, and run tests that check new flows of code, regression, and interoperability to other services.

In order to write an effective test for a new feature, we sometimes need to set up the environment with entities that are required for the feature we’re testing. If, for example, our new feature is to register a car to an owner (a Person entity). Before running the tests we need the required entities, a Car and a Person in our database. We’re not trying to test a flow for creating a new car, or a new person in this scenario. Therefore there is no need in creating the car and/or the person entities explicitly in the test before the actual test scenario happens. And in order to make our tests as clear and succinct as possible — we don’t want to be creating this data explicitly in each and every test.

Bad Practices

So, it was a common practice (albeit a bad one) to have pre-existing data on which we would rely on to run tests (for the whole simulation environment!). This led to two big (interconnected) problems:

No test isolation – a test mistakenly deleting some or all of the pre-existing data, for example, would do so for all the tests that run in that environment
Flaky tests – tests running concurrently are creating, deleting and generally changing data that affects others, which in turn would fail tests for no good reason — which makes it really hard to analyze and fix a failing test

We’ve tackled this problem by creating the needed data before the tests in a test class and deleting it after the test run. Which mitigated the problem somewhat — not only the tests in the same class were interconnected but also added boilerplate to the test class. Now, a test class looked something like (assuming these are entities autogenerated by Scalike for the relevant tables):

ScalaTest:
[gist id=a3f4c2721f3cedfc97d0b4650f41c48e file=MyTestClassScalaTestVanilla.scala]

Specs2:

[gist id=1d0ae89d8838d77a74f6262ffd8c21e1 file=MyTestClassSpecs2Vanilla.scala]

Looking at this, we were presented with a challenge. First, the data is created for all the tests that run in a class, which must be deleted only after all tests have finished running — this means that the tests are not isolated one from another and potentially may become flaky. Second, we wanted an elegant way of creating and deleting the needed entities seamlessly in order to minimize the boilerplate for each test class.

Note: It is possible however, in Specs2, to make a better solution by using the ‘Scope’ trait like so:

[gist id=e47bfe8119f8b426f24e5267badb009d file=ContextAfter.scala]

And using it in a test like so:

[gist id=54eb6ec0bbc1f034e29b8a77b0fc09f8 file=MyTestClassWithContextAfter.scala]

It’s a good solution, for a simpler problem than we faced. We needed the tests running in a single transaction, with a supplied session and a configurable db name (indicating a set of Scalike connection parameters).

Enter Loan Pattern

We first encountered this pattern when using ScalaTest and quickly moved to using it also in Specs2 (as most of our tests are written in Specs2). From ScalaTest documentation for Sharing fixtures:

“A test fixture is composed of the objects and other artifacts (files, sockets, database connections, etc.) tests use to do their work. When multiple tests need to work with the same fixtures, it is important to try and avoid duplicating the fixture code across those tests.”
“If you need to both pass a fixture object into a test and perform cleanup at the end of the test, you’ll need to use the loan pattern”

Which means, we can use fixtures to set up ‘artifacts’ for the tests to use, promoting the DRY principle by minimizing code duplication. It is also a good way to reduce boilerplate when writing tests. So, we wrote this one simple trait:

ScalaTest:

[gist id=0c629f47395900eb86c32142d95abe91 file=TestDataSupportScalaTest.scala]

Let’s go over what’s happening in this trait. We’re mixing in a custom trait called ‘DefaultGenerator’ which gives us the ‘DefaultObjects’ which are the entities we need to be pre-created for our tests to run. We have two private methods. One that calls ‘create’ on ‘DefaultObjects’ with a custom name to generate the needed entities. The other calls ‘cleanup’ on the test data to clean the environment after the test has finished running. And the star of this trait, the method (or fixture if you will) ‘withTestData’ which gets the test function as a parameter, calls the private method ‘createTestData’, calls the test and passing it the data we just generated and finally cleans up the generated data after the test finishes.

When mixing this trait in our test class, we get the following code:

[gist id=3cac004b5e5626de5a67293c233232e1 file=MyTestClassWithTestDataSupportScalaTest.scala]

‘testData’ is the data generated in our ‘withTestData’ method (a car and a person in our case).

The Specs2 version of the Loan Pattern is a bit more complex, as we’ve added some more bells and whistles to make it easier for us to create those entities in our domain. We’re using Scalike to create the entities in MySQL database, and we need a somewhat more refined control over the session we’re using, DB name etc’.

Specs2:

[gist id=4bfbbb0c62a8f89e3a0b87d6dcd6d2b5 file=DataContextName.scala]

[gist id=c1aab66a12dbf5206b0c41edde62b63c file=DataContextDbName.scala]

[gist id=38a6c3de8f78fb0cb1d2a2275e2ff42a file=PackageObjectTestData.scala]

[gist id=5230ad4a1b92a5270713a70c0bb75d17 file=DefaultDataContextSpecs2.scala]

It’s very similar to the ScalaTest flavor, but with several changes we needed to make to better facilitate our needs in the Specs2 tests. We have a mechanism to initialize a named DB connection, with a named connection pool and an explicit session. Besides these additions, it’s pretty similar to ScalaTest — generate the test data, run the test and clean the generated data.

The test class now looks like this:

[gist id=64701c4df08ecb2d29da4fc82c70c0bc file=TestClassDataContext.scala]
[gist id=d85b13035f768656f97b36f0df81c0e4 file=MyTestClassWithTestDataSupportSpecs2.scala]

Summary

We tackled several issues our team faced on a day to day basis, which made our simulation environment unstable, hard to maintain and generally very frustrating to work on. By extracting data generation and cleanup to an external trait and using a clever mechanism to reduce boilerplate, we managed to clean and simplify the test class, reduce code duplication and generally made our lives easier. Tests are still hard, but a bit easier to write and nicer to read. What do you think?

Category: Testing

Background

Guava OUT

Caffeine IN

Guava’s hit-rate benchmark vs Caffein’s

Conclusion

Why should I even care?

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Step 2: Start Coding

Step 3: Use code analysis tools

Step 4: Perform code reviews

Step 5: CI

Stage 2 – Testing Environment

Step 1: Run integration tests

Step 2: Run functional tests

Stage 3 – Staging Environment

Stage 4 – Production Environment

Step 1: Deploy gradually

Step 2: Incorporate feature flags

Step 3: Release gradually

Step 4: Monitor and Alerts

Chaos Drills at Outbrain

How did we kick this off?

Our roadmap – What next?

The GomJabbar Internals

What have we learned so far?

Conclusion

Tests are Hard

Bad Practices

Enter Loan Pattern

Summary

Search

עברית

Categories

Archive

RSS