Category: Training

Keep bugs out of production

Production bugs are painful and can severely impact a dev team’s velocity. My team at Outbrain has succeeded in implementing a work process that enables us to send new features to production free of bugs, a process that incorporates automated functions with team discipline.

Why should I even care?

Bugs happen all the time – and they will be found locally or in production. But the main difference between preventing and finding the bug in a pre-production environment is the cost: according to IBM’s research, fixing a bug in production can cost X5 times more than discovering it in pre-production environments (during the design, local development, or test phase).

Let’s describe one of the scenarios happen once a bug reaches production:

  • A customer finds the bug and alerts customer service.
  • The bug is logged by the production team.
  • The developer gets the description of the bug, opens the spec, and spends time reading it over.
  • The developer then will spend time recreating the bug.
  • The developer must then reacquaint him/herself with the code to debug it.
  • Next, the fix must undergo tests.
  • The fix is then built and deployed in other environments.
  • Finally, the fix goes through QA testing (requiring QA resources).

How to stop bugs from reaching production

To catch and fix bugs at the most time-and-cost efficient stage, we follow these steps, adhering to the several core principles:

How to stop bugs from reaching production

Stage 1 – Local Environment and CI

Step 1: Design well. Keep it simple.

Create the design before coding: try to divide difficult problems into smaller parts/steps/modules that you can tackle one by one, thinking of objects with well-defined responsibilities. Share the plans with your teammates at design-review meetings. Good design is a key to reducing bugs and improving code quality.

Step 2: Start Coding

Code should be readable and simple. Design and development principles are your best friends. Use SOLID, DRY, YAGNI, KISS and Polymorphism to implement your code.
Unit tests are part of the development process. We use them to test individual code units and ensure that the unit is logically correct.
Unit tests are written and executed by developers. Most of the time we use JUnit  as our testing framework.

Step 3: Use code analysis tools

To help ensure and maintain the quality of our code, we use several automated code-analysis tools:
FindBugs – A static code analysis tool that detects possible bugs in Java programs, helping us to improve the correctness of our code.
Checkstyle –  Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. It automates the process of checking Java code.

Step 4: Perform code reviews

We all know that code reviews are important. There are many best practices online (see 7 Ways to Up Level Your Code Review Skills, Best Practices for Peer Code Review, and Effective Code Reviews), so let’s focus on the tools we use. All of our code commits are populated to ReviewBoard, and developers can review the committed code, see at any point in time the latest developments, and share input.
For the more crucial teams, we have a build that makes sure every commit has passed a code review – in the case that a review has not be done, the build will alert the team that there was an unreviewed change.
Regardless of whether you are performing a post-commit, a pull request, or a pre-commit review, you should always aim to check and review what’s being inserted into your codebase.

Step 5: CI

This is where all code is being integrated. We use TeamCity to enforce our code standards and correctness by running unit tests, FindBugs validations Checkstyle rules and other type of policies.

Stage 2 – Testing Environment

Step 1: Run integration tests

Check if the system as a whole works. Integration testing is also done by developers, but rather than testing individual components, it aims to test across components. A system consists of many separate components like code, database, web servers, etc.
Integration tests are able to spot issues like wiring of components, network access, database issues, etc. We use Jenkins and TeamCity to run CI tests.

Step 2: Run functional tests

Check that each feature is implemented correctly by comparing the results for a given input with the specification. Typically, this is not done at the development level.
Test cases are written based on the specification, and the actual results are compared with the expected results. We run functional tests using Selenium and Protractor for UI testing and Junit for API testing.

Stage 3 – Staging Environment

This environment is often referred to as a pre-production sandbox, a system testing area, or simply a staging area. Its purpose is to provide an environment that simulates your actual production environment as closely as possible so you can test your application in conjunction with other applications.
Move a small percentage of real production requests to the staging environment where QA tests the features.

Stage 4 – Production Environment

Step 1: Deploy gradually

Deployment is a process that delivers our code into production machines. If some errors occurred during deployment, our Continuous Delivery system will pause the deployment, preventing the problematic version to reach all the machines, and allow us to roll back quickly.

Step 2: Incorporate feature flags

All our new components are released with feature flags, which basically serve to control the full lifecycle of our features.  Feature flags allow us to manage components and compartmentalize risk.

Step 3: Release gradually

There are two ways to make our release gradual:

  1. We test new features on a small set of users before releasing to everyone.
  2. Open the feature initially to, say, 10% of our customers, then 30%, then 50%, and then 100%.

Both methods allow us to monitor and track problematic scenarios in our systems.

Step 4: Monitor and Alerts

We use the ELK stack consisting of Elasticsearch, Logstash, and Kibana to manage our logs and events data.
For Time Series Data we use Prometheus as the metric storage and alerting engine.
Each developer can set up his own metrics and build grafana dashboards.
Setting the alerts is also part of the developer’s work and it is his responsibility to tune the threshold for triggering the PagerDuty alert.
PagerDuty is an automated call, texting, and email service, which escalates notifications between responsible parties to ensure the issues are addressed by the right people at the right time.

stop bugs
All in All,
Don’t let the bugs fly out of control.

Building the Culture to build Systems

Part 1 – The Badges

If there’s one thing we could name that differentiates a good team from a great one, it’s culture. It’s not something you can buy, it’s something you need to build, grow and nurture over time.

So how do you build the right culture?

It is very much like asking “how do you keep in shape?” First you need to set the goals you want to achieve and then you need to start exercising, but remember it is an effort that never ends – once you stop investing in it and not using it, you will lose it.

First – setting the goals, i.e. defining the values

You need to set goals, so you will know what to focus on, and as we are dealing with culture – the goals are actually the values you want to adopt and enforce.

For us some of the values and behaviours we want to emphasize are: collaboration, learning, fun, getting out of your comfort zone, initiative, excellence.

Once you have that defined you can start working out! In this post, and others to come, we will share some of the exercises we use to make sure we stay in shape.

Exercise 1 – The Badges

We created several sets of badges to enforce selected behaviours and celebrate occasions. Those badges are been handed out during weekly group sync meetings, and can also be added to mail signatures if one desires so. They also makes a nice collection, and naturally there are common and rare variants. We’ve found that this drives conversation and directly affects culture.

A few examples of the badges sets:

The Tech Collection:

We wanted to encourage using different tech tools, like Vagrant, improve Chef quality, killing tech debt etc. – so we made sure we have appropriate badges:



The Production Collection:

At the end of the day we are all dealing with production – and we want to celebrate that.

We created several badges for all kind of production related happenings, whether you have broken production, or spend a sleepless night – you will get mentioned. Seeing a senior engineer being granted a “broken prod” badge, drives our “blameless” culture, but still carries the weight of responsibility. No one wants a whole set of black badges!

sheep breakprod


The Celebration Collection:

We like to celebrate – whether it is the first on-call shift, the fact that one presented in a conference, arranged a meetup, or just helped a colleague out. By celebrating behaviors we value, we drive people to adopt them. Here are a few examples:

solo present ThanksBro


The “Jewish Mother” Collection:

Adding a good laugh is always nice, and at the end of the day we all have our quirky behaviours. By celebrating them we keep smiling… and strengthen tolerance.

holdon Alon eat


It is a living exercise and we keep on adding badges as we move along. In fact, we have our own engineers suggest and even create them. Our experience so far shows that gamifying culture-building activities has a positive effect on team atmosphere, and direct effect on the specific behaviors we address.
Stay tuned for additional exercises in How to Keep Your Culture in Shape.

New Hire LaunchPad

The first few days or weeks of a new hire in the company can be critical. What they say about first impressions, surely applies in this case as well – this initial time period, where an engineer needs to understand ‘how things are done here’, is important.

As Outbrain is scaling its engineering R&D group, it became evident that the current new hire training program is lagging behind. A new hire usually went through an extensive onboarding frontal lectures, where each subject was lead by a more veteran employee with the proper knowledge. It would usually take a month or so to go through all the lectures, not because there were so many of them, but because they were scattered and opened only when there were enough people (otherwise, it could inflict an unacceptable pressure on the employees giving them)

Hands on stuff, such as how to write a standard service, how to deploy it, what is the standard way to add metrics to you application etc, AKA – the important stuff, was usually taught in an informal way, where each team leader did the best they could, under the usual tight time constraints. As this method served us well for quite some time, the recruitment scale – and, in particular, the large number of new hires in a short time – forced us to re-think this process.

What we envisioned, was a 2 week bootcamp, that each engineering R&D new hire will have to undergo. We considered 3 basic approaches:

  • A  frontal lectures class, opened once a month, with people from various teams. This will not be efficient, as new hires rate are not consistent, and the learning is not by experience. the load on the instructors is huge, and they don’t always have the time or the teaching skills

  • A by-subject self learning, consisting of a given list of things to study and learn. As this scales better than the first approach, it still lacks the hands on experience, which is so important for understanding such a diversified, dynamic and multilayered environment.

  • Tasks-based bootcamp, consisting of a list of incremental tasks (where each new task relies on the actual implementation of the previous task). The bootcampee ends up creating a fully functional, deployed service. The service is a real service, in the sense that it is actually deployed to production, uses key infrastructure systems like any other service etc.

Our choice was the tasks-based bootcamp. in the course of a few weeks, we created a plan (documented in the company’s wiki), consisting of 9 separate units. Each unit has its own page, consisting 4 sections: the unit’s overview, some general notes, the steps for the unit and a section with tools and applications for this unit. We added a big feedback link to each page, so we can get the users feedback and improve accordingly.

The feedback we got is very promising – people can work independently, team leaders can easily scale initial training and the hands on experience gained was proving to be extremely important. The bootcamp itself is also used a knowledge source, where people can go back to and understand how to do things.

Our next steps, other than constantly improving the current bootcamp, are identifying other subjects that needs to be taught (naturally, a more advanced and specific ones) and building a bootcamp program following the same lines – hands on, progress on your own, real world training.