January 20, 2021

Avi Avraham

Yet another Hadoop migration – but from the human perspective

This blog is about our last Hadoop migration but from a different angle, instead of describing the technical aspects of it (don’t worry, I will go over it in addition) I will be more focusing on the human perspective of it, how and why we took the decision to go from (attention, spoiler alert!!!) commercial to community solution.

Background

But first some context, Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system built of thousands of micro services running inside Kubernetes containers spread over more than 7000 physical machines in 3 data centers and in public clouds (GCP & AWS). In order that the recommendations we supply will be valuable to our readers we invest in personalization as much as possible. To achieve this goal we have lots of machine learning algorithms that run in the background on top of the Hadoop ecosystem which makes it a very critical system for our business, we have 2 flavors of Hadoop clusters:

Online – we have 2 clusters in full DR mode running on bare metal machines, they are used for online serving activities.
Research – we have several clusters (per group and usage) running in GCP, they are used for research and offline activities.

Every day each cluster gets over 50TB of new data and there are over 10K job executions.

The Trigger

Few years ago we migrated our online Hadoop clusters to use the MapR commercial solution (you can read more about this in the Migrating Elephants blog), it had lots of improvements and we enjoyed the support we got. But a few years later, our Hadoop usage increased dramatically which made us very professional in supporting this system, we improved our knowledge and we could handle things on our own instead of depending on external resources for solving issues (which is one of the benefits when using a commercial solution). So, a few months before the license renewal we wanted to get a decision for what is the best way for us to proceed with this system.

We knew that it can take us some time until we will come up with new solution in production, we needed to make sure that the current system will continue to operate so we did the following actions in order to mitigate the time factor:

MapR have also a community version, we mapped the differences between it and the commercial version, there are few but the main ones that we had to solve were:
- No HA solution (for the CLDB, the main management service) – for that we implemented our own custom solution
- Support only single NFS endpoint – some of our use cases copy huge amount of data from the Hadoop into a separated DB so we needed a scalable solution, we ended up with a internal FUSE implementation
Since we had 2 clusters in DR mode, we were able to test and run the community version in parallel to the commercial version
Just for case, we made sure that we could extend the license for only several months instead of 3 years commitment

Those actions enabled us to make the right decision without any pressure, we made sure that we will be able to run the system with the community version until we will have the new solution in place.

The Alternatives

We came up with those alternatives:

Stay with the MapR commercial version – we didn’t want to stay with the community version (don’t have big users adoption and support).
Migrate to Cloudera commercial version
Migrate to Cloudera community version – this option was dropped immediately since the amount of nodes we had (>100) exceeded the limitation of using the community version
Migrate to Apache Hadoop community solution
Migrate to Google Cloud – the same we did with our research cluster

In order to be able to compare between those alternatives, we first defined the measurements that will help up to determine the chosen alternative, we summarized all in the following table:

There were other measurements like cost and required additional headcount, but they were not factors in our decision.

The Decision

After the successful migration we had with the research cluster to GCP (you can read about this in the Hadoop Research Journey blog series), lots of us thought that once we will need to migrate the online clusters, they will be migrated also to GCP. But like in real life, you need to handle each case by its own characteristics. I have 4 children and I wish that there was a single way to rule them all, but the reality is that each one requires a different approach and a different attitude. The same goes with our Hadoop clusters, each one has its own characteristics so each one has different requirements and needs, they differ by their usage so what is good for one is not necessarily good for the other. Due to the nature of the online clusters (lots of compute, less storage) we realized that we will not be able to benefit from the cloud features (elasticity and compute-storage separation) like we had with the research cluster.

Like said before, case-by-case, we needed another solution for our online clusters. Following this conclusion it was clear that the chosen solution will be from one of the bare metal alternatives. This left us to choose between the community and the commercial versions, we compared the pros and cons of each option but the main question was: do we want to pay for a license or to invest the money in training people and have better internal skills?

We figured out that the professional level we will have will have the best impact, it will be a win-win situation, our users (the engineering teams) will get better service and we will invest in our people. Following the post’s title, we took the dissection from the human perspective. Having said that, it was clear that the winning alternative was the Apache Hadoop Community.

The Migration (in short)

The migration needed to be done while the system itself continued to operate as a production system, this meant that we needed to continue to give the same service to the users while they will not be affected. As we have 2 clusters in separated DCs, so we did it one cluster at a time, in each it was done in a rolling manner, we started with a small Apache cluster and in several cycles we managed to move the workload from the MapR cluster into it.

Data migration

New data – we started to ingest the new data into the cluster, it was done by integrate the Apache cluster with our data delivery pipeline
Historical data – most of the workload needs some historical data for their processing, as the migration was in cycles, we copied only the data that was required for each cycle

Workload migration

We repeated the following steps until we managed to migrate the entire workload:

Moved some of the jobs from the MapR cluster to the Apache cluster
Stopped the migrated jobs in the MapR cluster
Moved machines from the MapR to the Apache cluster

The ability to perform those cycles was thanks to mapping we did, we defined the groups of jobs that can be moved together in one piece taking into consideration the dependencies between the jobs and the cluster capacity. The last cycle contained all the jobs that were tightly connected to each other and there was no possibility to be separated.

Of course there were lots of other technical details but they are a subject for a separate blog.

Achievements

The migration was a huge success, it was a full cooperation of all relevant engineering teams. After preparations that took us a few months, in no time (~1 month per cluster) we managed to complete it.

Now we can summarize what we achieved:

The human factor – our people improved their technical skills and got more professional in supporting the Hadoop system
We are using Open Source Software – there are lots of articles describing the benefits of using open source software which are: flexibility and agility, speed, cost effectiveness, attract better talents etc.
Improved cluster performance – with the same amount of cores (actually with a little less), all jobs finish their processing at least 2 times faster. For example, in the below graphs you can see the process time of the hourlyFactFlowFaliCreationInHiveJob in the Apache vs. MapR

Cleanup – as a preparation for the migration we mapped the storage and workloads that will be migrated, this enabled us to perform cleanups where possible
- Storage – we were able to delete more than 700TB which reduced the cluster capacity from 73% to 49%, you can see the impact in the graph below

Workloads – we disabled more than 800 jobs which was almost half of the total number (today we have ~950 jobs). In addition, we made some order with the jobs owners so instead of 19 ownership groups we now have only 16
Technical debt – one of the advantages of moving to open source software is the ability to use the latest versions of all integrated packages and in addition to be able to integrate even more packages, this ability is sometimes missing when using a commercial solution. In our case we were able to:
- Upgraded our Spark jobs (version 2.3)
- Upgraded our SparkStreaming jobs (version 2.4)
- Implemented Presto as another service for supplying a high performance query engine that can combine few data sources in a single query.

Epilog

The decision to migrate one of our critical systems from commercial to community version was not an easy decision to make, it involved some amount of risk but according to the blog’s title, we did it from the human perspective, when you invest in the people you get the investment back in big time.

Taking such a brave decision requires strong leadership, by nature there are always few sceptical people that fear changes, especially from such a big change in such a critical system. We needed to be determined and express confidence in the people and in the process, all people must be committed and engaged to this decision.

To summarize things up, the migration was a big success from the human perspective and from the technical aspects, it was a team effort that we all are enjoying its outcome. And regarding the different solution per Hadoop cluster, like in real life, try to follow the rule of case-by-case.

December 14, 2020

Avi Youkhananov

Faster release with Maven CI Friendly Versions and a customised flatten plugin

Fed up with waiting for the maven release? We’ve found a way to cut the release time by half. Each of our teams at Outbrain is responsible for its own service code in its own repository. However, our teams also share a large Maven-based repository that contains modules (libraries) that get released as Maven artifacts. After a module is released, it can be used by the teams. Thus, the shared repository — in contrast to service code, which is managed within individual team repositories — serves as a centralised place to manage team libraries.

Since our shared repository has hundreds of modules managed by multiple teams, we started to face failures during release, and release time increased dramatically. We decided to tackle this issue to boost efficiency.

The solution we came up with was to move to the Maven Ci Friendly Versions, which eliminate race conditions. (The Maven release plugin involves a git commit phase to change the pom.xml versions, but the change cannot get pushed if the commit hash set is out of date, prompting the build to fail.)

Moreover, we stopped using the Maven Release Plugin, accelerating our release process.

Bye Bye, Release Plugin.

Until recently, we had been using the Maven Release Plugin in order to release our libraries. The plugin has a two-step process, with different commands involved (release:prepare, release:perform).

The “prepare” and “perform” goals involve building the project multiple times. Moreover only one release can be triggered at a time — i.e., multiple releases cannot run simultaneously due to race conditions that were described earlier — you must wait for the current release to complete before starting the next. For projects that, like ours, have long build times, this is a deal-breaker. The Maven release plugin took far too long to run.

Welcome, Maven CI-Friendly versions
The approach we took is lightweight compared to the Maven release plugin approach and allows for multiple releases to be triggered and run simultaneously.

Here are the advantages of this approach over using the release plugin:

The Maven CI-Friendly Setup

The structure of our Monorepo (which follows a parent-child hierarchy) allowed us to easily transform all our pom.xml files from hard-coded versions to ${revision} properties as our artifact versions, which can be overridden as well.

In order to avoid redefining the revision property for each module, we defined the revision property in the parent pom.xml.

Here is a child pom.xml:

<project>
 <parent>
  <artifactId>ci-friendly-parent</artifactId
  <groupId>com.outbrain.example</groupId>
  <version>${revision}</version>
 </parent>
 <artifactId>ci-friendly-child</artifactId>
 <name>CI Friendly Child</name>
</project>

And this is the parent pom.xml:

<project>
 <groupId>com.outbrain.example</groupId>
 <artifactId>ci-friendly-parent</artifactId>
 <name>CI Friendly Parent</name>
 <version>${revision}</version>
 <properties>
   <revision>1.0.0-SNAPSHOT</revision>
 </properties>
</project>

As you can see, we moved to the CI Friendly Versions using revision property, and we are now set up to issue a local build to verify that the definition is correct.

To issue a local build, which will not be published, we invoked
mvn clean package as usual. This resulted in the artifact version 1.0.0-SNAPSHOT.

Want to change the artifact version? Easy.
Use the following command:

mvn clean package -Drevision=<REPLACE_ME>

The Maven release plugin used the revision placed in the pom.xml to define the next revision for release. If a development pom.xml holds a version value of 1.0-SNAPSHOT then the release version would be 1.0.
This value is then committed to the pom.xml file.
Finally, we can avoid those commits and hard-coded versions in pom.xml files.

Install/Deploy

In the Maven Ci Friendly Versions guidelines it is mentioned that the flatten maven plugin is necessary if you want to deploy or install your artifacts. Without this plugin the artifacts generated by this project cannot be used by other Maven projects.

This is true. But, the problem is that the flatten Maven plugin coupled with the “resolveCiFriendliesOnly” option does not work as expected due to bugs. Maven’s flatten plugin is a somewhat over-engineered, overly complex plugin that did not fit our needs. As adherers of the Unix philosophy, we decided to create our lightweight custom plugin, the ci friendly flatten maven plugin that replaces only the ${revision}, ${sha}, and ${changelist} properties.

The final pom.xml

</project>
  <project>
   <groupId>com.outbrain.example</groupId>
   <artifactId>ci-friendly-parent</artifactId>
   <name>CI Friendly Parent</name>
   <version>${revision}</version>

   <properties>
    <revision>1.0.0-SNAPSHOT</revision>
   </properties>

<modules>
...
</modules>
<build>
   <plugins>
    <plugin>
      <groupId>com.outbrain.swinfra</groupId>
      <artifactId>ci-friendly-flatten-maven-plugin</artifactId>
      <version>FIND_HERE</version>
      <executions>
       <execution>
          <goals>
            <goal>clean</goal>
            <goal>flatten</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
   </plugins>
 </build>
</project>

Building the bridge to Maven Ci Friendly Versions

We decided to transition to Maven Ci Friendly Versions. As I mentioned before, due to the bugs in the flatten plugin, we first needed to develop our custom flatten plugin. In our case, we used TeamCity to release our libraries, fetching the release version from git and tagging it (but it is worth noting that this process is suited to other build systems as well).

So, what do the release steps look like?

Fetch the latest git tag, increment it, and write the result to revision.txt file. This is the version we are going to release.

mvn ci-friendly-flatten:version

2. Set a system.version TeamCity parameter for our soon-to-be-released version, needed in order to use this version in the steps that follow.

#!/bin/bash -x
VER_PATH=”%teamcity.build.checkoutDir%/revision.txt”
REV=`cat $VER_PATH`
set +x
echo “##teamcity[setParameter name=’system.version’ value=’$REV’]”

3. Deploy the jars with the new version.

mvn clean deploy -Drevision=%system.version%

4. Tag the current commit with the updated version and push the tag.

mvn ci-friendly-flatten:scmTag -Drevision=%system.version%

The Maven-Release Plugin has two commits and, thus, triggers Clean/Compile/Test multiple times. In contrast, the new release process relying on our custom flatten plugin has zero commits, effectively eliminating build failures caused by race conditions in committed code during the build.

Moreover, in this release process, the goals Clean/Compile/Test are executed only once.

Overall, our new approach slashed release time by 50%, from 6+ min to 3+ min.

As a last note, we’ve employed this approach in other projects as well. So, while the time savings on this project was 3 minutes, in another project that was originally 22 minutes, the custom flatten plugin cut 11 minutes off the build.

So, take it from us. There is no need to get bogged down by Maven flatten plugin. Save yourself the headache. All you need is to switch to the Maven Ci Friendly Versions with our custom Ci Friendly Flatten Maven Plugin.

November 1, 2020

Avi Avraham

Our First Remote Hackathon

Background

2020 was a unique year when most of the industries in the world were forced to work remotely for a long period of time due to the COVID-19 pandemic. Besides health and employment concerns, staying at home for a long period of time is not natural for the human race as we are communicative beings by nature, causing many challenges, where one of the biggest challenges is how to maintain the social fabric. This is true in general and in particular for companies that most or all of its employees are working from home, and is becoming more significant as you go up in the organization structure, from team level to group level and above.

As managers, we were very concerned about finding ways to maintain the social connection between teams and had several discussions about it. During one of those discussions we remembered that in a different life period before the COVID-19 pandemic, we planned to have a Hackathon and we even started working on it but as usual, life has other plans. So, why not do the Hackathon we planned on doing just a little different? Besides the remote limitation, Hackathon is exactly what we need in order to solve this “disconnection” challenge.

As we already had experience on how to perform a “regular” Hackathon, we knew what are the basic things that needed to be done, but as this will be the first time we will do it remotely we understood that things need to be done a little differently, especially on the administrative side. As a rule, the participation in our Hackathons is not mandatory and the people’s engagement depends on the communication they receive. This is true for regular Hackathons and even more relevant for a remote Hackathon.

In this post I will describe the actions we took as the preparation and during the Hackathon itself, which led to a successful event. You can use it as a useful source of tips in case you plan to perform your own remote Hackathon.

Preparation

Most of the preparations for a remote Hackathon are the same as for a regular Hackathon, but there need to do some adjustments due to the nature of working remotely:

Choose the theme

The Hackathon’s theme define the Hackathon’s projects, so first thing you need to do is to define the Hackathon theme, in our case we choose the theme YACHADNESS, which is taken from the Hebrew word YACHAD which means TOGETHER, so YACHADNESS=TOGETHERNESS. We chose this theme since the target of this Hackathon was to make the people working together in order to solve the lack of communication during the WFH period, this was the reason we did this Hackathon and it was more important for us to focus on this aspect instead of any other technological subject. Due to the selected theme, there were no boundaries for the projects, anything is welcome and we let the people be as free as they want with their ideas.

Set timeline

When planning the Hackthon’s timeline need to take into account the following millstones:

Define the Hackathon’s projects
Present the selected projects
Establish the projects’ teams, each one can choose to assign himself to his favorite project
Define the winning KPIs
Establish the judges team
The Hackathon itself

Communication

As said before, it is true for regular Hacktons and even more for a remote one. Since all communication is done by mails and zoom meetings, you need to make sure that people are informed about the current status: where are we standing regarding the timeline and what are the next steps. In addition, due to the WFH situation, you need to invest more on the people’s engagement, this can be done by sending relevant mails, mentioning it during meetings and setting dedicated meetings for each milestone in the timeline.

Define the schedule

In most regular Hackathons, the people are gathered together in one place and are working in groups on their projects. In a remote Hackathon the people are working from their Homes, so the schedule needs to be prepared in advance, known to all and take into account all the limitations while working from home (kids, availability etc.). In our schedule we also added fun activities and we also arranged some giveaway that was delivered to the people’s houses during the Hackathon itself.

Working area

There are some benefits for working remotely, instead of arranging meeting rooms for the project teams (this is one of/most challenging task in a regular Hackathon), you need to create a dedicated zoom meeting and slack channel for each team to be used for all the team communication during the Hackathon period. In this way everyone could jump into each project and could say hi or to check how things are progressing, it was used also by our judges to do their periodic visits.

The Hackathon

The Hackathon itself was planned to take 2 days, since we all worked remotely we created a detailed schedule with check-in meetings and with lots of fun activities like Yoga session, Tabata workout and beer breaks, we wanted that those 2 days will be used also for fun in addition to the innovation work that was done. All was done in order to serve our initial plan, to bring the people closer even while working remotely.

So after a short kickoff session, each team started to work on their project (through their zoom & slack channel). We followed the schedule with dedicated shared meetings until we reached the final meeting that was the projects’ presentation, each team was given 10 minutes to present their project and in the end the winning project was declared, the winning was according to the results of a public voting combined with the judges decision.

And in case you wonder, the winning project was OBWho – a slack bot that can be give you a place to ask who is the owner of any internal system, the main logic of this project is a crawler that scans all our slack channels and according to that creates some mapping by using a machine learning algorithm that map the owners based on tokens that were found in each channel.

Conclusion

All the hard work that was done proved itself, we had a Successful Hackathon with results that passed our expectations. The participation ratio was very high, we had 6 great projects and the distance limitation did not influence the people who were committed and engaged as in a regular Hackathon, beside that we had lots of fun with all the fun breaks and activities.

But most importantly we achieved our initial goal, to bring people together for doing innovative work combined with fun despite the distance limitation, for 2 days we were all YACHAD again.

In order to summaries all in one place, in the below list you can find the additional actions that need to be taken into consideration when planning a remote Hackathon:

Lots of communication
Create slack channels for each project
Create zoom rooms for each project
Again, lots of communication
Define detailed schedule with sync zoom meetings
Have fun breaks through zoom
Send some giveaways to the people’s houses
In case it wasn’t clear unto now, lots of communication
Collect feedback through survey

For us, this remote Hackathon made the impact we wanted and we achieved exactly what we planned and more, I hope that you can learn from our experience in case you plan to have your own successful remote Hackathon.

May 27, 2020

Avi Avraham

Real Time Data Pipeline – More Than We Expected

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and to make the data available in real time. But as soon as we started to work on it, we realized that there are quite a few additional good reasons to embark on this complex project.

Overview

Our data delivery pipeline is designed to deliver data from various services into our Hadoop ecosystem. There it is processed using Spark and Hive in order to produce aggregated data to be used by our algorithms and machine learning (ML) implementations, business reports, debugging, and other analysis. We call each type of this data a “datum”. We have more than 60 datums, each representing a different type of data, for example: request, click, listings, etc.

Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system building from thousands of micro services running inside Kubernetes containers on top of an infrastructure that is spread over more than 7000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

As you can imagine, our services produce lots of data that is passed through our data delivery pipeline. On an average day, more than 40TB is moving from through it, and on peak days it can pass the 50TB.

And yes, in order to support this scale, the delivery pipeline needs to be scalable, reliable and lots of other words that end with “able”.

Pipeline Architecture

As I wrote above, we have multiple data centers (DCs), but in order to simplify the diagram and explanation, I will use only 2 DCs to present our pipeline architecture. As we go through the two architectures (legacy hourly and current RT), you will notice that the only components which remain the same are the edges. The services that produce the data, and the Hadoop where the data ends up.

Another thing worth mentioning is that in order to have full disaster recovery (DR), we have 2 independent Hadoop clusters in separated locations, each one of them gets all the data and processes the same jobs independently, in case one of them goes down, the other continues to work normally.

Hourly Files Pipeline – legacy

The data is collected by the services, and saved to files on the local file system of each service. From there, another process copies those files to our Data Processing System (DPS), which is an in house solution that collects all the data and copies it into the Hadoop. As you can see, it is a very simple architecture without a lot of components, and is quite reliable and robust since the data is saved in files which can be easily recovered in case of any malfunction.

The drawbacks of this pipeline is that the data is moved in chunks of hours and it is not accessible during the duration of the delivery time. It is only available for processing after each hour is completed, and all the data is in the Hadoop. And it also places a heavy burden on the network because the data is transferred in spikes.

Real Time Streaming Pipeline

Replacing the local file system storage, we have a Kafka cluster. And, instead of our in house DPS system we have a MirrorMaker, an aggregated Kafka cluster and a SparkStreaming cluster.

In this pipeline, the data is written directly from the service to the Kafka cluster. From this point all the data is accessible to anyone that wants to use it. Analysts, algo developers, or any other service that can find it useful. Of course it is also available to the MirrorMaker, as part of the pipeline.

The MirrorMaker job is to sync data between Kafka clusters, and in our case, to make sure that each aggregated Kafka cluster will have the entire data from all DCs.

Like before, the data in the aggregated cluster is available to all, especially to our SparkStreaming cluster which runs various jobs that consume the data and writes to the Hadoop.

So it’s clear that by implementing this pipeline, the data is available to all, and it reaches the Hadoop faster.

Obvious Benefits

Now lets go over the obvious benefits of the real time pipeline:

Data availability – since the data is passed through Kafka clusters, it is available to all implementations that need real time data. Here are 2 good examples:
- Recommendation engines can now calculate its algo models with fresher data. This led to a lift of more than 16% in our Revenue Per Mille (RPM).
- We added real time customer facing dashboards which enable them to take required actions faster according to fresh data.
Reduced the latency of the processing system – since the data reach the Hadoop faster, the hourly jobs that wait for the hourly data can start work sooner and to complete their work much closer to the end of the hour. This reduced the overall latency of the processing system and now the business reports for example, are available in a shorter time.

Additional Benefits

And those are the unexpected benefits that we gain from the real time pipeline:

Network capacity – In the hourly files pipeline, the data is moved in hourly batches and at the end of each hour. This means that the network capacity needs to support movement of all data at once. This requirement forces us to allocate the required bandwidth while it is used only for a short time during each hour, wasting expensive resources. The real time pipeline moves the same amount of data incrementally over the course of the hour.

The graphs below demonstrate the bandwidth savings that we made once we moved to the real time pipeline. For example, in the RX graph, you can see that we moved from peaks of 17 GBPs to have flat bandwidth of 7 GBPs, saving 10 GBPs.

Total RX traffic

Total TX traffic

Disaster recovery – the fact that in the hourly based pipeline the data is saved in files on the local file system of the machines that run the services have some limitations.
- Data loss – as we all know, machines have 100% chance of failing at some point. And when you have thousands of machines, the odds are against us. Each time a machine goes down, you have the risk of data loss since all the hourly files may not have been copied into the pipeline yet. In the real time pipeline, the data is written immediately to the Kafka cluster which is more resilient and the risk of the data loss is reduced.
- Late processing – if a machine has recovered from a failure, and you were lucky enough to avoid data loss, the recovered data needs to be processed and in most cases it won’t be done within the time period that the data is related to. This means that this time period will be processed again which adds extra load on the Hadoop and may result in data delays, since the Hadoop needs to process multiple time periods at the same time. Like before, the benefit of the real time pipeline in that aspect is that the data reaches the Hadoop without any delays so there is no reprocessing of any time period.

Conclusions

Having the real time pipeline, our life became much simpler. Beside the planned goals (data availability and reducing the latency), the extra goodies that we got from this change made us less sensitive to any network glitch or hardware malfunction. In the past, each one of those issues forced us to handle data delays and have the risk of data loss, and now the real time pipeline by its nature, solved all of them.

Yes, there are more components that we need to maintain and monitor, but this cost is justified if you compare it to the great results we achieved by implementing this real time system for our data delivery pipeline.

March 12, 2020

Arkadiy Verman

How you can set many spark jobs write to the same path

One of the main responsibilities of the DataInfra team at Outbrain, which I am a member of, is to build a data delivery pipeline, our pipeline is built on top of Kafka and Spark Streaming frameworks. We process up to 2 million requests per second through dozens of streaming jobs.

fig 1: Simplified data flow architecture

Problem Description

One of our requirements was to read data from different Kafka clusters and stream the data to the same path in the HDFS. This doesn’t sound complicated, several identical jobs, each provided with the Kafka address as a parameter. In practice, two concurrent jobs may delete each other files, here is how:

When we call saveAsHadoopFile() action in the spark program

https://gist.github.com/poluektik/7b15007bd94183deb0a8ea6acb71c684#file-hadoop-blog-post-gist-a

https://gist.github.com/poluektik/7b15007bd94183deb0a8ea6acb71c684

The save action will be performed by SparkHadoopWriter, a helper that writes the data and in the end issues a commit for the entire job.

See relevant documentation for SparkHadoopWriter:

https://gist.github.com/poluektik/5e54c2117f5db692c02c6472a5866d46

The commit is the part that is most relevant to our problem. The class that by default does the commit is FileOutputCommitter which, among other things, creates ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.

In the end, the entire temporary folder is deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

(OutputCommitter Documentation)

I’ve tried to find an easy solution in the Stack Overflow and Spark community but couldn’t find anything, except for suggestions to write to different locations and use Distcp later on, which would require additional compute resources which I’d like to spare.

The solution:

We’ve created our own OBFileOutputCommitter which is almost identical to the default FileOutputCommiter, but it supports the change of the temporary configuration, fortunately, we can add our own committer through Spark configuration. This way each job will have its own temporary folder, so the cleanup won’t delete data of other jobs.

There is a catch, of course – You’ll have to use MultipleTextOutputFormat to make sure that the files have unique names. If you won’t, two jobs will have the same default names which will collide.

Here is the link to the custom committer code. Add it to your project and follow the example below.

https://gist.github.com/poluektik/d0481bbc93a8ffaf8f527516e200f9dc

Conclusion

The Hadoop framework is flexible enough to amend and customize output committers. With proper documentation,

it could have been easier, but that shouldn’t stop you from trying to alter the framework to suit your needs.

References

https://hadoop.apache.org/docs/r2.7.2/api/src-html/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html

https://stackoverflow.com/questions/46665299/spark-avoid-creating-temporary-directory-in-s3/46690036#46690036

https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning?rq=1

https://github.com/apache/spark/pull/21286

https://issues.apache.org/jira/browse/MAPREDUCE-1471?jql=text%20~%20%22FileOutputCommitter%22

https://issues.apache.org/jira/browse/MAPREDUCE-7029

March 11, 2020

Dafna Frank Szarfman

I was a developer. Now I’m a Product Manager. Why and how.

Photo by davisco on Unsplash

I love writing code. I love everything about it. Coding is like solving puzzles. You get a problem and you need to think how to solve it. Think up the best solution, optimize it, debug it, squash bugs. I love writing code so much that I spent 20 years working as a developer. I even got used to waking up at night with a deep understanding that I just found a bug in my code. I love it! And despite having had several opportunities during my career to shift to other positions, I always decided to stick to coding. In a sense, I was afraid of not getting my hands dirty, of stepping away from the keyboard.

Product Hunt

When I joined Outbrain as a backend software developer in the Platform group, I took ownership of a backend service called Dyploma. Dyploma is a deployment gateway built on top of Kubernetes, enabling developers to build artifacts and deploy them to multiple data centers and environments with the click of a button. It makes our developers’ lives super easy.

Dyploma is a product with 200 users. A live product which is being used constantly, with feature requests, bugs and direct impact on the production services of Outbrain. With so many users, requirements and impact, we realized we wanted a product oriented approach to managing and evolving it. But we had no prior experience managing products, nor any product managers in the group. Our solution to this challenge was to introduce the Product Leads.

Product leads were an attempt to get the best of both worlds – keep the technical people doing the tech stuff they love and excel at, but also allow them to take on Product Manager responsibilities part-time. Dyploma wasn’t the only product in the Platform group where we wanted to implement a product-oriented approach, we had others as well. So when Platform leadership invited people to volunteer for these Product Lead roles, due to my deep sense of involvement with the product I was developing, I decided to take the challenge head on.

I took it upon myself to be the product lead for Dyploma, in parallel to being its backend developer. I met with the users, heard their wishes and complaints, mapped the different flows they needed, the new features they wanted, prioritized them… then switched hats to “backend developer” to do the implementations. It was fun, exciting, difficult, satisfying – it was everything I loved about writing code, but… not necessarily writing code.

Since I started doing something I was not familiar with, I decided I should start learning about what it takes to be a product manager. I have worked with product managers but to actually be one is different, and it had (and still has) its learning curve.

The Lean Startup

I learned that there is a constant loop of Learn->Build->Measure (Lean Startup anyone?).

For us, the “Learn” stage included creating a focus group, meeting the customers, documenting their requests and feedback.

One of the big advantages of working on internal products is that you can meet your customers all the time (should be handled with care, so it does not become a pain….). I would walk in the corridors and people would jump at me and start bombarding me with ideas and problems that they had. Corridor chats are nice but we needed more formalized communication channels to make communication scale, so I created focus groups and held periodic meetings with them. I used the documentation of these meetings when prioritizing features.

I used users stories to help me focus on what was really important, on the Minimum Viable Product (MVP). I then sat with a focus group of developers and we reviewed these user stories for feedback and validation. A slack channel for more ad-hoc documentation (and a paper trail), and boy, was it busy…

Next, the ‘Build’ stage. I figured this would be easy, since I was also the developer. But working with the UI team and seeing how the backend and frontend code work together toward a common goal, required going beyond just writing code. It required synchronization and team work.

Once built, came the ‘Measure’ stage. At first, this didn’t seem important – “if you build it, they will come” felt intuitively right. But that wasn’t the case. After implementing one specific, large feature, we found that user adoption was there… but we didn’t get the value we wanted. Partially because we didn’t define the value clearly enough, or had KPIs in place to measure it. In fact, once we implemented the KPIs, it seemed we were moving in the opposite direction!

Our learnings here lead us to implement a “KPIs first” approach, which wasn’t simple or necessarily intuitive. Instead of working on the feature, we would first work on the metrics, define the measurements and only then move onto building the features. This allowed us to “focus on the target” at all times and change direction quickly if needed. Another set of KPIs would focus on feature impact (for example, # of affected users) and implementation effort (high level estimation is mostly good enough). These reflect the feature’s perceived value and cost, so we could not only measure whether the feature moves the needle, but also understand how to prioritize the different features and manage our resources.

Measuring KPIs was true for customer satisfaction as well – we started gathering customer feedback, and were happy to see that our customers played along. They felt they were being heard and given a mandate to influence product direction.

To close the loop, following measurements, we returned to the ‘Learn’ stage, plotting the next steps in the product’s evolution.

Present day

Fast forward a year and a half – the Product Leads initiative was a partial success. Unfortunately, the number of Product Leads that kept the addon role over time wasn’t high. It was… one. But wherever we had a Product Lead, we got what we asked for :-). We had good user interaction, a product roadmap and a regular release cadence centered around it. This was the proof of the hypothesis we had in mind. As an infrastructure group we build products! And every product, no matter if it is internal or external, should be managed as one.

From this point on, we decided to invest more in this direction. I moved on to a role of full time Product Manager in the Platform group. While writing these lines, I manage 5 different products, developed by multiple teams, with more to come!

Looking back, here are some of my insights on what made this adventure a successful one.

Personal Skills – Like in many cases, a great human interface makes everything work better. The ability to build good relationships with different types of customers, make them feel they can approach you anytime and give feedback. Building these relationships will also enable you to facilitate better coordination and collaboration between the different teams.
Being a good Listener – As any good Product Manager, you need to collect ideas, listen to problems and needs, get feedback – and for all of that to be done well, you need to be a good listener. This usually involves keeping quiet while others talk 🙂
Being a good Investigator – Sometimes your customers are “too nice”. It might be that they don’t feel comfortable passing criticism, or they simply did not put the effort to think about it. They might not even realize they feel uncomfortable with some aspects of your product. To overcome this obstacle, you need to learn how to ask the hard questions. Look for the criticism, even if it doesn’t feel great, and focus on it. Dig deep, listen and look for hints.
Methodology – Working with different people, getting requirements all around, and choosing the right things to focus on – you must have a system if you want to get those right. When a VP has a feature request – is it more important than the feature request of the single developer? The system is more than just the priorities, it is also the KPIs, making sound decisions and holding conversations based on data and not on beliefs. The lighthouse should always be the value you bring to the product, and not necessarily the requestor’s role, how often they ask for it, or other “psychological bias factors”.

This past year was quite a ride – I learned a lot, gained exciting new experiences, failed and succeeded countless times. There’s still so much for me to learn, but one thing I already know – I love being a Product Manager!

February 10, 2020

Avi Avraham

Trusting the Infrastructure

One of the biggest challenges facing infrastructure groups is how to build TRUST with their customers. If achieved, this trust will lead to a better working environment where colleagues have good relationships and trust each other.

And like any relationship, one of the ways to gain someone’s trust is by being transparent. Transparency shows that you have nothing to hide, and allowing the free flow of information between sides.

In our engineering context, we are talking about transparency of the micro services owned by development teams and the resources that the infrastructure groups provide them.

Providing transparency for the resources’ usage is a difficult feat even in a simple system. It is even more complex when resources are shared across development teams, since you need to show each team its own relevant usage and not the overall usage of the whole physical resource.

At Outbrain we advocate for a full responsibility model – each development team has full ownership of its services, from development to production. This model gives the teams lots of power, as they have full freedom and are independent to do their job (and they do it great!) without having the infrastructure teams as a bottleneck.

This model makes the need for transparency of resources (and especially shared resources) even more critical.

A Bit Of Nostalgia

In the past everything was simpler. I grew up in the Eighties (actually it was the Seventies but we don’t have to be accurate all the time), we were young and beautiful, and everything was great and simple – there was no Internet, no mobile phones and we played outside in the street with other real human beings.

Even the computer systems were simpler. We had the classic 3 Tier Architecture: Presentation, Application & Data layers. The Application layer was a big monolithic server and the Data layer was probably an Oracle RDBMS. The infrastructure teams had full control over all the systems – no changes were made in production, unless they were the ones making it.

Back To Reality

The Eighties are gone, we are not children anymore (but still beautiful…). On the tech side, the micro services architecture was introduced, the big monolithic servers were broken into many little micro services, and the big databases followed the same trend. Each micro service can have its own database as if that wasn’t complex enough, each micro service can even have a different type of database depending on its own specific use case.

In addition, the amount of data and traffic grew exponentially, and the term BigData was invented. In order to support its growth, the systems become more and more complex, and in some cases the architecture became so complex and dynamic that no single person can describe it, not to mention to map it in a flow chart.…

The Challenge

Outbrain is the world’s leading content discovery platform. We serve over 300 billion content recommendations every month, to over 800 million users across the world. In order to support such a large scale, we have a backend system running hundreds of micro services on top of an infrastructure that is spread over more than 6000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

Our infrastructure contains, among other things, many data store clusters, data processing systems, data delivery pipelines and a BigData lake that is used by our recommendation ML algorithms, BI Engineers and analysts to produce business reports, etc.

And yes, the Cloud Platform Group which is responsible for supplying all this infrastructure has the same challenge that was described in the beginning – how do we empower our users and provide them the transparency to how much resources they are using and for what.

Conflicts

As complex as these computer systems are, the real complexity is actually the human factor. Every company has conflicts of two types: personal and structural.

As everyone knows, personal conflicts will always complicate engineering challenges. While structural conflicts are between technical teams with different roadmaps, priorities and constraints. They can quickly lead to personal conflicts and to a work environment with lots of tensions and frustrations. Such an atmosphere will inevitably affect the quality of the computer systems …

In our case the structural conflict is between the R&D teams and the Infrastructure teams. For example, infrastructure teams often feel like they are treated like warehouse workers. The R&D teams come with capacity or feature demands and the Infrastructure teams must fulfils them, the R&D teams are not aware or concerned of the effects and costs of their requests or the effect it might have on the resource which is shared with other teams.

Structural conflicts will always exist, but they can be managed effectively as long as there is a relationship built between the teams.

The Solution

The base of each relationship is TRUST and in our case the base for that is to provide visibility of all used services and resources.

The visibility enables the infrastructure teams to know who is using the infrastructure and how it is being used. The same visibility helps the R&D teams to understand the capacity of infrastructure that they are using. In fact, by providing this visibility we empower the R&D teams and taking the Infrastructure teams out of the warehouse worker role.

We realized that we must find a solution, a single place that will enable us and our users to understand the footprint of the various resources that are used and for what purposes. This will also enable us to estimate the cost of each feature/project leveraging our knowledge of resources that are used by it.

At Outbrain we created a self-serve Usage Report system, which crawls our entire infrastructure and provides visibility and transparency of resources for everyone, in order to display the required dashboards and reports.

There are lots of resources that can be monitored by the Usage Report system, but we decided to focus on the following:

Micro services deployments
Physical machines
Public cloud resources
Data processing utilization
Storage
Network usage

As you might imagine, the most challenging part of the Usage Report system is the crawling process, which needs to correlate between all the services and their users. This is done by adding an “owner” label to the services so it can be used later. Whereas the Usage Report system had to support different types of labels, based on the capability of each resource.

Conclusions

While not all services are part of the Usage Report system yet, we have already started to see the benefits of this tool.

In the process of implementing this tool, we added additional metadata which identifies the resource usage for each development team. Now we can associate the usage for each team.

In addition, by having this visibility we can estimate the cost of a certain feature. This can be achieved by calculating the cost of all the resources that it uses; in that way we can know if a feature is ROI positive or not.

But the main benefit that we gain is the TRUST. Now we have a real time dashboard which enables everyone to understand their resources usage. This knowledge increases the trust between the R&D and Infrastructure teams and empowering everyone to be more efficient and to take our ownership model one step forward.

December 3, 2019

Or Gerson

Frustration, Immortal TEZ Queries And Hope

My name is Or Gerson, and I am part of the DataInfra team in Outbrain.
DataInfra is responsible for a job scheduling system that supplies teams with the ability to define jobs that run queries on Hadoop cluster hosting about 2 PB of data.
Sometimes these queries can be inefficient – resulting in high processing time and extra load on the cluster.
Therefore we define a timeout limit for queries to execute, The idea is to kill the query when timeout occurred – however things are not so simple…

Kill them! Close their resource! Let none escape!

We use Apache-DBCP as our connection pool library and Hive2 driver to submit queries to Hive a pretty simple setup.
Apache DBCP provides connection pool services and is generally recommended as a connector to many relational databases.

Queries are done using Spring JdbcTemplate, but the implementing connections are managed by Apache DBCP.

These queries have a specific timeout set by the calling thread.
When the timeout occurs, the calling thread shuts down the async executor and calls the close() function on a “javax.sql.DataSource” object

dataSource.close()

This works well when using MapR engine, but when using TEZ the job continues to live long in the cluster, even after the VM is gone.

Seems like a common problem.
To my surprise, my online search found this to be a recurring issue, but without a good solution.

I started drilling down into the “org.apache.commons.dbcp.PoolableConnection” class to understand the problem better.

Shouldn’t closing the datasource be enough?

Under the DBCP and JDBC abstractions I found “org.apache.hive.jdbc.HiveStatement” which uses a thrift client to execute operations on Hive.

When a query was submitted using MapR and reached a timeout, the running job recognized that its handler had died and closed, resulting in “Query Cancelled” status on the waiting thread in HiveStatement.

TEZ did not recognize that its handling connection died, and continued to finish its run (even though the GC had definitely cleared this object).

Moreover, I found out that closing the datasource (using close() method) did not call the close() method on the connection object.
Searching the APIs in “org.apache.commons.dbcp” revealed that it did not expose its connection pool or the objects borrowed from it, so I had no way of interacting with them.

Now What?

First I needed to verify that I can actually close the connection from the client without interacting with the resource manager running the TEZ job (Yarn) directly.
Luckily I found that “HiveStatement” class behaves well and when calling close() method, the thrift session closes and does indeed kill the TEZ job.

I decided to create a data source that will keep references to the connections being used, allowing it to close them.

https://gist.github.com/kazabubu/5ff8d6f5faddb01dce9f93fd98f19458

Now, our datasource keeps its connection references and all we need is to use those “register” and “unregister” methods which are part of a simple “ConnectionRegister” interface supplying these methods.

https://gist.github.com/kazabubu/ffcdc774307abf810c0d3d80886e4f62

After implementing the aforementioned solution, queries using TEZ die on timeout, by explicitly closing the opened connection which delegates it to the thrift layer.

November 11, 2019

Avi Youkhananov

Oh my Guava! We are moving to Caffeine.

Caching is extremely important! It provides fast response time, enabling effortless performance improvements in certain use cases.
At Outbrain, we have recently moved to Caffeine caching, after having used Guava in-memory caching for many years.

Background

Caffeine library is a rewrite of Guava’s cache that uses a Guava-inspired API that returns CompletableFutures, allowing asynchronous automatic loading of entries into a cache. The library was written by Ben Manes who is the author of ConcurrentLinkedHashMap on which Guava cache is based.

Guava OUT

Guava blocks during loading when a key is not present in the cache.
We wanted to change the API to work asynchronously, but the added complexity made the code difficult to understand and troubleshoot.
To make Guava non-blocking, we overrode the load and loadAll methods of CacheLoader to make the API return a future.
Since we changed the API to return a future, we encountered a problem. Guava is unaware that we use it to store futures, so it stores futures completed with an exception as well. To prevent the exceptions from being stored in the cache, we needed to add more complexity to our code.

Caffeine IN

Caffeine is a rewrite of Guava’s cache that uses an API that returns CompletableFutures out of the box, allowing asynchronous automatic loading of entries into a cache.
Caffeine removes futures that complete with an exception from the cache.
Caffeine uses both a Least Recently Used (LRU) eviction policy and a frequency-based admission policy relying on CountMin sketch. It has a better hit rate than LRU for many workloads.

Guava’s hit-rate benchmark vs Caffein’s

We analyzed service behavior with real traffic under different cache configurations to get an idea of how production services will behave.

The benchmarked cache contains image URLs keyed by UUID and follow an access pattern of most-frequent / least-frequent data, which is our most common use case at Outbrain.

In the benchmarks below, we did not measure and therefore have no interpretation of memory usage. But analyzing hit rates leads to some interesting insights.

1. Cache size: 10k items
Expiration after write: 5 min

Hit rate with 10k items:

Caffeine 28.33 %
Guava 20.95%

2. Cache size: 50k items
Expiration after write: 20 min

Hit rate with 50k items:

Caffeine 56.04 %
Guava 50.01%

3. Cache size: 100k items
Expiration after write: 20 min

Hit rate with 100k items:

Caffeine 70.10 %
Guava 66.77%

4. Cache size: 300k items
Expiration after write: 20 min

Hit rate with 300k items:

Caffeine 87.19 %
Guava 84.85%

Conclusion

We did it! With minor changes to infrastructure code (as Caffeine and Guava API are almost identical), we improved our hit rate and reduced code complexity for critical services in our system.

Honestly, Caffeine smells better than Guava.

September 18, 2019

Davorin Kopic

Sharing Data Science Knowledge and Experience

Two years ago Outbrain and Zemanta joined forces. This, among many other great things, also resulted in big bi-directional knowledge exchange between Zemanta’s Data Science and Outbrain’s Recommendations teams. Inevitably, a lot of important progress has been made in both our algorithms and understanding. And of course, we still haven’t exhausted the huge backlog of new ideas which are sprouting from our discussions.

At Outbrain and Zemanta we know how important internal sharing of knowledge and experience is – but we also believe it is crucial to share knowledge with the wider community. And in our aspirations to do so, among other projects, we also started Zemanta’s Data Science Summer School.

The second annual Data Science Summer School

In July we hosted the second annual Data Science Summer School in our Zemanta’s office in Ljubljana, Slovenia. Among many applicants, we selected a group of very perspective young professionals and/or students and invited them to join us for a week of data science-flavored activities where they learned how we apply data science and machine learning in this data-rich industry.

The structure of the summer school

The week-long curriculum was set to be very practical and hands-on, but also have theoretical lectures intertwined. Participants first learned about the tools and techniques we are using in our day-to-day as data scientists in the industry. They learned how to use tools like git for version control, correctly set up python environments, use some python libraries like numpy, pandas for crunching data, matplotlib for visualization and scikit-learn to build some basic predictors.

Then, after setting up their environments, they got their feet wet by participating in a Kaggle challenge. Some participants had already participated in Kaggle challenges before so they shared their experiences and know-how, and for some, it was their first time so they tried to sponge up as much information as they could.

Finally, we provided them with a massive real dataset extracted from production, on which they had a chance to build their own predictors for estimating probabilities of clicks (CTR). After careful examination and analysis of 50+ provided features, they had an opportunity to use a tool of their choice to make predictions – some explored scikit-learn in more detail, while others chose various libraries like XGBoost for gradient boosted trees, XLearn for factorization machines or TensorFlow for neural networks. Finally, all teams presented their work and shared the gained knowledge.

Mixed in between hands-on experimentation they participated in many interesting talks and discussions on topics ranging from how programmatic advertising works, what is real-time bidding, theory powering auctions and what kind of algorithms and systems we are developing at Zemanta; all the way to data analysis, deploying machine learning models to production and some of our real-life scenarios and stories.

What the participants had to say

After successfully completing the week-long curriculum participants received their certificates and filled anonymous feedback forms saying things like “Great way to spend a week – the atmosphere was excellent!”, “The talks were especially interesting since they give a nice insight into the company”, “Working on real data gave me the opportunity to experience first-hand the problems data scientists are working on” – so we can say with great certainty the participants learned a lot and had tons of fun doing it.

Conclusion

This was the second iteration of Zemanta’s Data Science Summer School in our Ljubljana office in Slovenia. Mentors Robert, Luka, and Anže and I had a great time sharing knowledge with the students, who gained important insights into the processes behind applying data science and machine learning to solve real problems in the industry, so we are very excited to host more such events in the future.

Davorin Kopic
Head of Data Science @ Zemanta, an Outbrain Company

Category: Backend

Background

The Trigger

The Alternatives

The Decision

The Migration (in short)

Achievements

Epilog

The Maven CI-Friendly Setup

Here is a child pom.xml:

And this is the parent pom.xml:

Install/Deploy

Overview

Pipeline Architecture

Hourly Files Pipeline – legacy

Real Time Streaming Pipeline

Obvious Benefits

Additional Benefits

Conclusions

Problem Description

The solution:

Conclusion

References

Product Hunt

The Lean Startup

Present day

A Bit Of Nostalgia

Back To Reality

The Challenge

Conflicts

The Solution

Conclusions

Kill them! Close their resource! Let none escape!

Background

Guava OUT

Caffeine IN

Guava’s hit-rate benchmark vs Caffein’s

Conclusion

The second annual Data Science Summer School

The structure of the summer school

What the participants had to say

Conclusion

Search

עברית

Categories

Archive

RSS