July 9, 2020

Elazar Gur

Keren Ofek Granov

Faster Than a Missile: Intercepting Content Fraud

On May 21, 2020, thousands of Israeli websites were hacked by a group called “Hackers Of Savior”, probably including hackers from Turkey, The Gaza Strip, and North Africa. Not the first – and surely not the last – cyber-attack of its kind, the hackers targeted websites stored on uPress, an Israeli web hosting company.

The websites’ content was replaced by a page with a statement against Israel – “The countdown to Israel’s destruction began a long time ago” – and a disturbing video clip. The attack was timed to coincide with the Israeli national holiday “Jerusalem Day”, which celebrates the reunification of the city after the Six-Day War. It also coincided with the Iranian “Quds Day”, which was probably an additional motivation for the attack.

*The attacker’s content as shown on the hacked websites*

Beyond the visual corruption of the websites, the attackers also tried to delete all the websites’ data. Besides, the hackers prompted the browser to ask the viewer’s permission to take their photo with their computer’s webcam.

This attack was not your ordinary run-of-the-mill fraud event. It did not revolve around the usual advertiser-sourced content fraud. Rather, the source of the problem was an outside party and Outbrain was not even the target. The attack harmed not just the website publishers and end-users, but advertisers too.

Before diving in to see how Outbrain coped on the battleground during this event, let’s start at the beginning. What is “content fraud”, why is it such a challenge, and how does Outbrain protect its network and partners every day?

Content Fraud – The Major Challenge for Content Networks

Content fraud is an industry-wide problem that comes in many forms, including phishing, redirects, cloaking, malvertising, conspiracy-mongering and fake news. The internet is subsumed by an endless amount of fake content, such as fake quotes, doctored images, and deepfake video, making it increasingly difficult to navigate towards an authentic online experience. As content fraud is always evolving, it becomes much harder to detect and prevent over time.

As a discovery company, part of Outbrain’s lighthouse is to provide qualified content to its readers. While the overwhelming majority of content provided by advertisers is qualified, interesting and safe, Outbrain invests significant resources to ensure that no malicious or deceitful content makes its way onto our advertising platform.

When content fraud is detected, it comes from a specific and identifiable source that can be easily blocked and prevented from accessing Outbrain’s network. Generally, we use multiple technologies and people-powered techniques to verify that advertisers are legitimate and promoting qualified content. We look at a variety of properties, the identity of the advertiser, and the content itself. Based on the technologies Outbrain usually uses, the ability to respond is broad and fast.

In the recent event, the challenge for Outbrain was different.

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

*Scott Webb (https://unsplash.com/photos/yekGLpc3vro*)

The cyber event of May 2020 presented our anti-fraud specialists with a different and unique set of obstacles.

Firstly, we were not battling against an entity that was part of our network. The threat did not come from one of our partners – rather, our publishers and advertisers were first in the line of fire.

Secondly, the attack was broad and not sourced from a specific website. We assumed that some websites in our network had been corrupted, however, we didn’t know which ones. Targeted and specific blocking is the most common anti-fraud solution, yet in this case, we had to find and block multiple domains that were yet unknown.

Thirdly, any action would be temporary. When we identify a fraudster on the network, they are permanently blocked and banished from the network. There is no going back. However, in this case, we knew that once the affected websites recovered, we had to immediately re-enable them and permit them back on the network.

Lastly, even though Outbrain was not the target, we had to stay on our toes, and to assume that the attack was still happening or could happen again at any time. Although we identified the characteristics when it occurred, we had to be ready to quickly detect any new method or pattern that the attackers could develop and deploy.

The response by our anti-fraud team was designed to protect all three pillars of the Outbrain ecosystem:

The first pillar – Users who need protection from threatening content.
The second pillar – Advertisers who need protection from being charged for harmful visits and damage to their brand image.
The third pillar – Publishers who were prevented from directing traffic to threatening content that would harm their reputation.

The Technology Challenge

In order to find the corrupted sites, we had to scan loads and loads of web pages. Once such a page was found, we had to disable it immediately so it would no longer be promoted on the Outbrain network. These scans had to be done quickly and repeatedly. If a cyber attack is ongoing, pages can be hacked even after we scan them, and if the pages have recovered, we need to know straight away in order to recommend them back to the network.

*“The hacked pages shared the same attributes” – One of the attributes we discovered from crawler logs*

How did we identify a hacked page? We had a few options:

Published lists of the hacked websites: In some cyber attacks, this can be a good solution, but not in this case. The attack was a rolling one, which meant not all the websites were hijacked at once. There was no single reliable source that contained all the hacked websites.
Third-party solutions: In order to maintain the integrity of the Outbrain network, the Anti-Fraud team uses not only internal solutions that were developed in-house, but also external solutions, to flag problematic content, like malware, cloakers, redirects, and other fraud types. A third-party solution can be used to classify specific characteristics as a website “under attack”, so it might have been an option in a case like this. However, making the adjustments necessary to detect this specific attack would take too long, and in this instance, the ability to react fast was super important.
Outbrain’s in-house crawler: In order to recommend the best content to readers, Outbrain crawls the promoted content in its network, analyzing characteristics and extract features that will be used in the NLP and recommendation algorithms. Our in-house crawler was identified as the ideal solution for this challenge, as it is already connected to our databases. In order to get immediate results, we only had to define the population we wanted to crawl. Moreover, with some additional configurations and quick development by the AppServices team, we could tackle all the challenges at once, automatically, and rapidly.

Crawling Our Way Out

So Outbrain’s in-house crawler was immediately mobilized for detecting which sites were hacked.

We started by analyzing several hacked sites found in multiple sources across the net. Using our crawler output logs, we discovered that the pages were identical in terms of both structure and visual content, meaning they shared the same attributes (title, text, links, keywords, etc.). By querying the crawler logs for those attributes, we were able to easily identify more hacked domains. The problem was that the picture we saw was only a partial one, since not all the pages were crawled yet.

We chose the relevant potentially-hacked domains list and set it to be crawled in cycles of 3 hours, using a configured crawler feature that was specially tailored for this problem. Once the crawler identified a new (specific) value for the attributes the attackers injected into the web pages, we could disable the hacked pages. Disabling a page meant that the hacked native ads were no longer available as recommendations on the Outbrain network, and were no longer accessible by our users.

The ongoing scans also allowed us to monitor and understand the impact of the attack on our inventory in real-time and to communicate it to our partners. When web pages were recovered from the cyber attack, we used the crawler to re-enable them on the Outbrain network. With the help of our unique tech capabilities, we were able to come full circle out of this pernicious cyber attack.

Rapid Response Saves the Day

The attack began on the morning of May 21. According to uPress’s announcement, by May 22, 02:28 (Israel Time), 90% of the websites were recovered. As a result of our actions, we were able to prevent Outbrain users from being exposed to hacked web pages containing disturbing content.

This rapid reaction not only protected our users from threatening content, but it also enabled affected publishers and advertisers to return to their usual Outbrain traffic services very quickly after websites were recovered.

What Doesn’t Kill Us, Makes Us Stronger

During the cyber rescue operation, we were in direct contact with several teams in the organization, enabling us to present a united front and a holistic solution. Along the way, we benefited from the ability to pivot in several different directions as necessary, both internally and externally. It was a great lesson about how to use various tools to solve an array of problems.

With the rapid, coordinated response from Outbrain’s teams, we overcame the “Hackers of Savior” attack, and also developed the infrastructure to manage similar cyberattacks in the future, if – and when – they occur.

May 27, 2020

Avi Avraham

Real Time Data Pipeline – More Than We Expected

When we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming system, the reasons we had in mind were the obvious ones; to reduce the latency of our processing system and to make the data available in real time. But as soon as we started to work on it, we realized that there are quite a few additional good reasons to embark on this complex project.

Overview

Our data delivery pipeline is designed to deliver data from various services into our Hadoop ecosystem. There it is processed using Spark and Hive in order to produce aggregated data to be used by our algorithms and machine learning (ML) implementations, business reports, debugging, and other analysis. We call each type of this data a “datum”. We have more than 60 datums, each representing a different type of data, for example: request, click, listings, etc.

Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system building from thousands of micro services running inside Kubernetes containers on top of an infrastructure that is spread over more than 7000 physical machines spread between 3 data centers and in public clouds (GCP & AWS).

As you can imagine, our services produce lots of data that is passed through our data delivery pipeline. On an average day, more than 40TB is moving from through it, and on peak days it can pass the 50TB.

And yes, in order to support this scale, the delivery pipeline needs to be scalable, reliable and lots of other words that end with “able”.

Pipeline Architecture

As I wrote above, we have multiple data centers (DCs), but in order to simplify the diagram and explanation, I will use only 2 DCs to present our pipeline architecture. As we go through the two architectures (legacy hourly and current RT), you will notice that the only components which remain the same are the edges. The services that produce the data, and the Hadoop where the data ends up.

Another thing worth mentioning is that in order to have full disaster recovery (DR), we have 2 independent Hadoop clusters in separated locations, each one of them gets all the data and processes the same jobs independently, in case one of them goes down, the other continues to work normally.

Hourly Files Pipeline – legacy

The data is collected by the services, and saved to files on the local file system of each service. From there, another process copies those files to our Data Processing System (DPS), which is an in house solution that collects all the data and copies it into the Hadoop. As you can see, it is a very simple architecture without a lot of components, and is quite reliable and robust since the data is saved in files which can be easily recovered in case of any malfunction.

The drawbacks of this pipeline is that the data is moved in chunks of hours and it is not accessible during the duration of the delivery time. It is only available for processing after each hour is completed, and all the data is in the Hadoop. And it also places a heavy burden on the network because the data is transferred in spikes.

Real Time Streaming Pipeline

Replacing the local file system storage, we have a Kafka cluster. And, instead of our in house DPS system we have a MirrorMaker, an aggregated Kafka cluster and a SparkStreaming cluster.

In this pipeline, the data is written directly from the service to the Kafka cluster. From this point all the data is accessible to anyone that wants to use it. Analysts, algo developers, or any other service that can find it useful. Of course it is also available to the MirrorMaker, as part of the pipeline.

The MirrorMaker job is to sync data between Kafka clusters, and in our case, to make sure that each aggregated Kafka cluster will have the entire data from all DCs.

Like before, the data in the aggregated cluster is available to all, especially to our SparkStreaming cluster which runs various jobs that consume the data and writes to the Hadoop.

So it’s clear that by implementing this pipeline, the data is available to all, and it reaches the Hadoop faster.

Obvious Benefits

Now lets go over the obvious benefits of the real time pipeline:

Data availability – since the data is passed through Kafka clusters, it is available to all implementations that need real time data. Here are 2 good examples:
- Recommendation engines can now calculate its algo models with fresher data. This led to a lift of more than 16% in our Revenue Per Mille (RPM).
- We added real time customer facing dashboards which enable them to take required actions faster according to fresh data.
Reduced the latency of the processing system – since the data reach the Hadoop faster, the hourly jobs that wait for the hourly data can start work sooner and to complete their work much closer to the end of the hour. This reduced the overall latency of the processing system and now the business reports for example, are available in a shorter time.

Additional Benefits

And those are the unexpected benefits that we gain from the real time pipeline:

Network capacity – In the hourly files pipeline, the data is moved in hourly batches and at the end of each hour. This means that the network capacity needs to support movement of all data at once. This requirement forces us to allocate the required bandwidth while it is used only for a short time during each hour, wasting expensive resources. The real time pipeline moves the same amount of data incrementally over the course of the hour.

The graphs below demonstrate the bandwidth savings that we made once we moved to the real time pipeline. For example, in the RX graph, you can see that we moved from peaks of 17 GBPs to have flat bandwidth of 7 GBPs, saving 10 GBPs.

Total RX traffic

Total TX traffic

Disaster recovery – the fact that in the hourly based pipeline the data is saved in files on the local file system of the machines that run the services have some limitations.
- Data loss – as we all know, machines have 100% chance of failing at some point. And when you have thousands of machines, the odds are against us. Each time a machine goes down, you have the risk of data loss since all the hourly files may not have been copied into the pipeline yet. In the real time pipeline, the data is written immediately to the Kafka cluster which is more resilient and the risk of the data loss is reduced.
- Late processing – if a machine has recovered from a failure, and you were lucky enough to avoid data loss, the recovered data needs to be processed and in most cases it won’t be done within the time period that the data is related to. This means that this time period will be processed again which adds extra load on the Hadoop and may result in data delays, since the Hadoop needs to process multiple time periods at the same time. Like before, the benefit of the real time pipeline in that aspect is that the data reaches the Hadoop without any delays so there is no reprocessing of any time period.

Conclusions

Having the real time pipeline, our life became much simpler. Beside the planned goals (data availability and reducing the latency), the extra goodies that we got from this change made us less sensitive to any network glitch or hardware malfunction. In the past, each one of those issues forced us to handle data delays and have the risk of data loss, and now the real time pipeline by its nature, solved all of them.

Yes, there are more components that we need to maintain and monitor, but this cost is justified if you compare it to the great results we achieved by implementing this real time system for our data delivery pipeline.

March 12, 2020

Arkadiy Verman

How you can set many spark jobs write to the same path

One of the main responsibilities of the DataInfra team at Outbrain, which I am a member of, is to build a data delivery pipeline, our pipeline is built on top of Kafka and Spark Streaming frameworks. We process up to 2 million requests per second through dozens of streaming jobs.

fig 1: Simplified data flow architecture

Problem Description

One of our requirements was to read data from different Kafka clusters and stream the data to the same path in the HDFS. This doesn’t sound complicated, several identical jobs, each provided with the Kafka address as a parameter. In practice, two concurrent jobs may delete each other files, here is how:

When we call saveAsHadoopFile() action in the spark program

https://gist.github.com/poluektik/7b15007bd94183deb0a8ea6acb71c684#file-hadoop-blog-post-gist-a

https://gist.github.com/poluektik/7b15007bd94183deb0a8ea6acb71c684

The save action will be performed by SparkHadoopWriter, a helper that writes the data and in the end issues a commit for the entire job.

See relevant documentation for SparkHadoopWriter:

https://gist.github.com/poluektik/5e54c2117f5db692c02c6472a5866d46

The commit is the part that is most relevant to our problem. The class that by default does the commit is FileOutputCommitter which, among other things, creates ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.

In the end, the entire temporary folder is deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

(OutputCommitter Documentation)

I’ve tried to find an easy solution in the Stack Overflow and Spark community but couldn’t find anything, except for suggestions to write to different locations and use Distcp later on, which would require additional compute resources which I’d like to spare.

The solution:

We’ve created our own OBFileOutputCommitter which is almost identical to the default FileOutputCommiter, but it supports the change of the temporary configuration, fortunately, we can add our own committer through Spark configuration. This way each job will have its own temporary folder, so the cleanup won’t delete data of other jobs.

There is a catch, of course – You’ll have to use MultipleTextOutputFormat to make sure that the files have unique names. If you won’t, two jobs will have the same default names which will collide.

Here is the link to the custom committer code. Add it to your project and follow the example below.

https://gist.github.com/poluektik/d0481bbc93a8ffaf8f527516e200f9dc

Conclusion

The Hadoop framework is flexible enough to amend and customize output committers. With proper documentation,

it could have been easier, but that shouldn’t stop you from trying to alter the framework to suit your needs.

References

https://hadoop.apache.org/docs/r2.7.2/api/src-html/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html

https://stackoverflow.com/questions/46665299/spark-avoid-creating-temporary-directory-in-s3/46690036#46690036

https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning?rq=1

https://github.com/apache/spark/pull/21286

https://issues.apache.org/jira/browse/MAPREDUCE-1471?jql=text%20~%20%22FileOutputCommitter%22

https://issues.apache.org/jira/browse/MAPREDUCE-7029

December 3, 2019

Or Gerson

Frustration, Immortal TEZ Queries And Hope

My name is Or Gerson, and I am part of the DataInfra team in Outbrain.
DataInfra is responsible for a job scheduling system that supplies teams with the ability to define jobs that run queries on Hadoop cluster hosting about 2 PB of data.
Sometimes these queries can be inefficient – resulting in high processing time and extra load on the cluster.
Therefore we define a timeout limit for queries to execute, The idea is to kill the query when timeout occurred – however things are not so simple…

Kill them! Close their resource! Let none escape!

We use Apache-DBCP as our connection pool library and Hive2 driver to submit queries to Hive a pretty simple setup.
Apache DBCP provides connection pool services and is generally recommended as a connector to many relational databases.

Queries are done using Spring JdbcTemplate, but the implementing connections are managed by Apache DBCP.

These queries have a specific timeout set by the calling thread.
When the timeout occurs, the calling thread shuts down the async executor and calls the close() function on a “javax.sql.DataSource” object

dataSource.close()

This works well when using MapR engine, but when using TEZ the job continues to live long in the cluster, even after the VM is gone.

Seems like a common problem.
To my surprise, my online search found this to be a recurring issue, but without a good solution.

I started drilling down into the “org.apache.commons.dbcp.PoolableConnection” class to understand the problem better.

Shouldn’t closing the datasource be enough?

Under the DBCP and JDBC abstractions I found “org.apache.hive.jdbc.HiveStatement” which uses a thrift client to execute operations on Hive.

When a query was submitted using MapR and reached a timeout, the running job recognized that its handler had died and closed, resulting in “Query Cancelled” status on the waiting thread in HiveStatement.

TEZ did not recognize that its handling connection died, and continued to finish its run (even though the GC had definitely cleared this object).

Moreover, I found out that closing the datasource (using close() method) did not call the close() method on the connection object.
Searching the APIs in “org.apache.commons.dbcp” revealed that it did not expose its connection pool or the objects borrowed from it, so I had no way of interacting with them.

Now What?

First I needed to verify that I can actually close the connection from the client without interacting with the resource manager running the TEZ job (Yarn) directly.
Luckily I found that “HiveStatement” class behaves well and when calling close() method, the thrift session closes and does indeed kill the TEZ job.

I decided to create a data source that will keep references to the connections being used, allowing it to close them.

https://gist.github.com/kazabubu/5ff8d6f5faddb01dce9f93fd98f19458

Now, our datasource keeps its connection references and all we need is to use those “register” and “unregister” methods which are part of a simple “ConnectionRegister” interface supplying these methods.

https://gist.github.com/kazabubu/ffcdc774307abf810c0d3d80886e4f62

After implementing the aforementioned solution, queries using TEZ die on timeout, by explicitly closing the opened connection which delegates it to the thrift layer.

August 11, 2019

Daria Litvinov

Understanding Spark Streaming with Kafka and Druid

As a Data Engineer I’m dealing with Big Data technologies, such as Spark Streaming, Kafka and Apache Druid. All of them have their own tutorials and RTFM pages. However, when combining these technologies together at high scale you can find yourself searching for the solution that covers more complicated production use-cases.
In this blogpost I’m going to share the knowledge I gained by combining Spark Streaming, Kafka and Apache Druid all together for building real time analytics dashboard, guaranteeing precise data representation.

Before we dive in…few words about Real Time Analytics

The Real-time analytics is a new trend in Big Data technologies, and usually has significant business effect. When analyzing fresh data, the insights are more precise. For example, providing real time analytics dashboard for Data Analysts, BI and Account Managers teams can help these teams to make fast decisions.
The commonly used architecture for real time analytics at scale is based on Spark Streaming and Kafka. Both these technologies are very well scalable. They run on clusters and divide the load between many machines. The output of Spark jobs can go to many various destinations, it depends on the specific use case and the architecture. Our goal was to provide the visual tool displaying real-time events. For this purpose we chose Apache Druid database.

Data visualization in Apache Druid

Druid is a high performance real-time analytics database. One of its benefits is the ability to consume real time data from Kafka topic and build powerful visualizations on top of it using Pivot module. Its visualizations enable running various ad-hoc “slice and dice” queries and get visual results quickly. It is very useful for analyzing various use cases, for example how specific campaigns perform in certain countries. Data is retrieved at real-time, with 1-2 minutes delay.

The architecture

So we decided to build our Real Time analytics system based on Kafka events and Apache Druid. We already had events in Kafka topic. However we could not just ingest them into Druid as is. We needed to add more dimensions to each event. We needed to enrich every event with more data in order to see it in Druid in a convenient way. Regarding the scale, we’re dealing with hundreds of thousands of events per minute, so we needed to use technology that can support these numbers. We decided to use Spark Streaming job for the enrichment of original Kafka events.

Figure 1. Real time analytics architecture

Spark Streaming job runs forever? Not really.

The idea of Spark Streaming job is that it is always running. The job should never stop. It constantly reads events from Kafka topic, processes them and writes the output into another Kafka topic. However, this is an optimistic view. In real life things are more complicated. There are driver failures in the Spark cluster, in which case the job is restarted. Sometimes the new version of spark application is deployed into production. What happens in this case? How does the restarted job read Kafka topic and process the events? Before we dig into these details, this figure shows what we see in Druid when the Spark Streaming job is restarted:

Figure 2. Data loss on job restart

It is definitely data loss!

What problem are we trying to solve?

We are dealing with Spark Streaming application which reads events from one Kafka topic and writes them into another Kafka topic. These events are visualized later in Druid. Our goal is to enable smooth data visualization during the restart of our Spark Streaming application. In other words, we need to ensure that no events are lost or duplicated during the Spark Streaming job restart.

It’s all about offsets

In order to understand why data is lost on jobs restart, we need to get familiar with some terms in Kafka architecture. Here you can find an official Kafka documentation. In a nutshell: events in Kafka are stored in topics; each topic has divided into partitions. Each record in a partition has an offset – a sequential number which defines the order of the record. When application consumes the topic, it can handle offsets in several ways. The default behavior is always to read from the latest offsets. Another option is to commit offsets, i.e. to persist offsets so the job can read the committed offsets on restart and continue from there. Let’s see our steps towards the solution, and get a deeper understanding of Kafka offsets management with each step.

Step#1 – auto commit offsets

The default behavior is always to read from the latest offsets. This will not work because when the job is restarted, there are new events in the topic. If the job reads from latest, it loses all messages that were appended during the restart, as can be seen in Figure 2. There is a “enable.auto.commit” parameter in Spark Streaming. Its value is false by default. Figure 3 shows the behavior after changing its value to true, running the Spark application and restarting it.

Figure 3. Data spike of job restart

We can see that using Kafka auto-commit feature causes a new effect. There is no “data loss” , however we now see duplicate events. There was no real “burst” of events. What actually happened is that auto commit mechanism commits offsets “from time to time” . There are many messages in the output topic that were not committed. After the restart the job consumes messages from the latest committed offsets and processes some of these events again. That’s why on the output we get a burst of events.

Clearly, incorporating these duplications into our visualization may mislead the business consumers of this data and have an impact on their decisions and trust in the system.

Step#2: Commit Kafka offsets manually

So we can’t rely on Kafka auto-commit feature. We need to commit Kafka offsets by ourselves. It order to do this, let’s see how Spark Streaming consumes data from Kafka topics. Spark Streaming uses an architecture called Discretized Streams, or DStream. DStream is represented by a continuous series of RDDs (Resilient Distributed Datasets), which is one of the Spark’s main abstractions. Most Spark Streaming jobs look something like this:

dstream.foreachRDD { rdd =>
    rdd.foreach { record => process(record) }
}

In our case processing the record means writing the record to the output Kafka topic. So, in order to commit Kafka offsets we need to do the following:

dstream.foreachRDD { rdd =>
   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd.foreach { record => process(record)}
   stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) 
}

This is a straight-forward approach and before we discuss it deeper let’s take a look at the big picture. Let’s assume that we handle offsets correctly. Namely, that offsets are saved after each processing of RDD. What happens when we stop the job? The job is stopped in the middle of the processing of RDD. The part of micro-batch is written to the output Kafka topic and is not committed. Ones the job runs again, it will process some messages for the second time and the spike of duplicate messages will appear (as before) in Druid:

Figure 4. Data spike on job restart

Graceful Shutdown

Turns out there is a way to ensure a job is not killed during an RDD processing. It is called a “Graceful Shutdown”. There are several blog posts describing how you can kill your Spark application gracefully, but most of them relate to old versions of Spark and have many limitations. We were looking for a “safe” solution that works for any scale and does not depend on a specific Spark version or operating system. To enable Graceful Shutdown, the Spark context should be created with the following parameters:
spark.streaming.stopGracefullyOnShutdown = true.
This instructs Spark to shut down StreamingContext gracefully on JVM shutdown rather than immediately.
In addition, we need a mechanism to stop our jobs intentionally, for example when deploying a new version. We’ve implemented the first version of this mechanism by simply checking the existence of an HDFS file that instructs the job to shut down. When the file appears in HDFS, the streaming context will stop with the following parameters:
ssc.stop(stopSparkContext = true, stopGracefully = true)

In this case the Spark application stops gracefully only after all received data processing is completed. This is exactly what we need.

Step #3: Kafka commitAsync

Let’s recap on what we have so far. We intentionally commit Kafka offsets in each RDD processing (using Kafka commitAsync API) and we use spark graceful shutdown. Apparently, there was another caveat. Digging into the documentation of Kafka API and Kafka commitAsync() source code, I’ve learned that commitAsync() only puts the offsetRanges into a queue, which is actually processed only in the next loop of foreachRDD. Even if the Spark job is stopped gracefully and finishes processing of all its RDDs, the offsets of last RDD are actually not committed. To solve this problem, we’ve implemented a code that persists Kafka offsets synchronously and does not rely on Kafka commitAsync(). For each RDD we then stored the committed offsets in an HDFS file. When the job starts running again, it loads the offsets file from the HDFS and consumes Kafka topic from these offsets and on.

Here, it works!

It was only the combination of a graceful shutdown and a synchronous storage of Kafka offsets that provided us with the desired result. No data loss, no data spikes during restarts:

Figure 5. No data loss of spikes during Spark job restart

Conclusion

Solving the integration problem between Spark Streaming and Kafka was an important milestone for building our real-time analytics dashboard. We’ve found the solution that ensures stable dataflow without loss of events or duplicates during the Spark Streaming job restarts. We now have the trustworthy data which is visualized in Druid. Thanks to this, we’ve added more types of events (Kafka topics) into Druid and built real time dashboards. These dashboards provide insights for various teams, such as BI, Product and Customer Support. Our next goal is to utilize more features of Druid, like new analytical functions and alerts.

August 27, 2018

Sergey Yegournov

Upgrade Railway Tracks Under a Moving Train

Outbrain has quite some relations with Cassandra. We have around 30 production clusters of
different sizes, ranging from small ones to cluster with 100 nodes across 3 datacenters.

Background

Cassandra has proven to be a very reliable choice as a datastore which employs an eventual consistency model. We used the Datastax Enterprise 4.8.9 distribution (which is equivalent to Cassandra 2.1) and as Cassandra 3 evolved clearly at some point we had to plan out an upgrade. We ended up with two options, either upgrade the commercial distribution version provided by Datastax or migrate to the Apache Cassandra distribution. Since Outbrain uses Cassandra extensively it was an important decision to make.
At first, we identified that we do not use any of the Datastax enterprise features and we were quite happy with the basics that are still offered by Apache Cassandra. So, at this point, we decided to give Apache Cassandra 3.1.11 a try.
But when planning an upgrade aside from the necessity to stay on the supported version there are always some goals, like leveraging a new feature and seeking for performance improvement. Our main goal was to get more from the hardware that we have and potentially reduce the number of nodes per cluster.

In addition, we wanted to have in-place upgrade which means that instead of building new clusters we could perform the migration in the production clusters themselves while they are still running, it is like upgrading the railway tracks under a moving train and we are talking about the very big train moving in high speed.

The POC

But before defining the in-place upgrade we needed to validate that the Apache Cassandra distribution fit to our use cases, and for that, we performed a POC. After selecting one of our clusters that will be used as a POC we build a new Apache Cassandra 3.1.11 cluster, following the new storage format and hence read path optimisations in Cassandra 3 we started with a small cluster of 3 nodes in order to see if we can benefit from this feature for reducing the cluster size – that is if we could achieve acceptable read and write latencies with fewer nodes.
The plan was to copy data from the old cluster using sstableloader – while the application would employ dual writes approach before we start the copying. This way we would have two clusters in sync, and we could direct our application to do some/all reads from the new cluster.
And then the problems started… at first it became clear that with our multiple datacenter’s configurations we do not want sstableloader to stream data to foreign data centers, we wanted to be able to control the target datacenter for the stream processing, we reported a bug and instead of waiting for a new release of Apache Cassandra we build our own custom build.
After passing this barrier we reached another one that was related to LeveledCompactionStrategy (LCS) and how Cassandra performs compactions. With a cluster of 3 nodes and 600GB of data in a table with LCS it took few weeks to put data into the right levels. Even with 6 nodes it took days to complete compactions following sstablesloader data load. That was a real pity, because read and write latencies were quite good even with a fewer number of nodes. Luckily Cassandra has a feature called multiple data directories. At first it does look like something to do with allowing more data but this feature actually helps enormously with LCS. So once we tried it and played with an actual number of data directories we identified that we can compact 600GB of data on L0 in less than a day with three data directories.

Encouraged with these findings we conducted various tests on our Cassandra 3.1 cluster and realised that not only we can benefit from storage space savings and faster read path – but also potentially put more data per node for clusters where we use LCS.

Storage before and after the migration

The plan

The conclusions were very good and indicated that we are ok with Apache Cassandra distribution and we can start preparing our migration plan using in-place upgrades. At the high level the
migration steps should be:

Upgrade to the latest DSE 4.x version (4.8.14)
Upgrade to DSE 5.0.14 (Cassandra 3)
Upgrade sstables
Upgrade to DSE 5.1.4 (Cassandra 3.11)
Replace DSE with Apache Cassandra 3.11.1

After defining the plan, we were ready to start migrating our clusters. We mapped our clusters and categorised them according to if they are used for online serving activities or for offline activities.
Then we were good to go and started to perform the upgrade cluster by cluster.

Upgrade to the latest DSE 4.x version (4.8.14)

The first upgrade to DSE 4.8.14 (we had 4.8.9) is a minor upgrade. The reason this step is needed at all because DSE 4.8.14 includes all the important fixes to co-exist with Cassandra 3.0 during upgrade and our upgrade experience tells that indeed DSE 4.8.14 and DSE 5.0.10 (Cassandra 3.0.14) co-existed very well. The way it should be done is to upgrade nodes in a rolling fashion upgrading one node at a time. One node at a time applies to a datacenter only, datacenters could be processed in parallel (which is what we actually did).

Upgrade to DSE 5.0.14 (Cassandra 3)

Even if you aim to use DSE 5.1 you cannot really jump to it from 4.x, so 4.x – 5.0.x – 5.1.x chain is mandatory. The steps and pre-requisites on upgrade to DSE 5.0 are documented quite well by Datastax . What is probably not documented that well is which driver version to use and how to configure it so that application smoothly works both with DSE 4.x and DSE 5.x nodes during upgrade. What we did is that we instructed our developers to use cassandra-driver-core version 3.3.0 and enforce the protocol version 3 to be used so that the driver can successfully talk to both DSE 4.x and DSE 5.x nodes. Another thing to make sure is that before starting the upgrade you should ensure that all your nodes are healthy. Why? Because one of the most important limitations about clusters running both DSE 4.x and 5.x is the fact that you cannot stream data, which means that bootstrapping a node during node replacement is simply not possible. However, for clusters where we had SLAs on latencies we wanted more confidence and for those clusters what we did was that in one datacenter we upgraded a single node to DSE 5.0.x. This is 100% safe operation, since it’s the first node with DSE 5.0.x it can be replaced using the regular node replacement procedure with DSE 4.x. One very important point here is that DSE 5.0 does not really force any schema migration till all nodes in the cluster are running DSE 5.0.x. What this means is that system keyspaces remain untouched and this is what guarantees that DSE 5.0.x node can be replaced with DSE 4.x. Since we have 2 kinds of clusters (serving and offline) with different SLA we took 2 different approaches, for the serving clusters we upgraded the datacenters sequentially leaving some datacenter running DSE 4.x, we did it because if something goes really wrong, we still would have a healthy datacenter with DSE 4.x running. And for the offline clusters we upgraded all the datacenters in parallel, both approaches worked quite well.

Upgrade sstables

Whether clusters were upgraded in parallel or sequentially we always wanted to start sstablesupgrade as quick as possible after DSE 5.0 was installed. Sstables upgrade is the most important phase of upgrade because it’s most time consuming. There are not so many knobs to tune:

control number of compaction threads with -j option for the nodetool upgradesstables command.
control the compaction throughput with nodetool setcompactionthroguhput.

Whether -j would allow to speed up sstables upgrade depends a lot on how many column families you have, how big they are and what kind of compaction strategy is configured. As for the compaction throughput we determined that for our storage system it was ok to set it to 200 MB/sec to allow regular read/write operation to be carried out at reasonable higher percentile latencies, but for sure this has to be identified for each storage system individually.

Upgrade to DSE 5.1.4 (Cassandra 3.11)

In theory, upgrade from DSE 5.0.x to DSE 5.1.x is a minor one, just need to upgrade the nodes in a rolling fashion exactly the same way as minor upgrades were done within DSE 4.x, storage format stays the same and no sstablesupgrade is required. But when upgrading our first reasonably large cluster (in terms of number of nodes) we noticed something really bad, nodes running DSE 5.1.x suffered from never-ending compactions in the system keyspace which killed their performance while DSE 5.0.x were running well. For us, it was a tough moment because that was the serving cluster with SLAs and we did not notice anything like that on our previous clusters upgrades! We suspected that something is potentially wrong with schema migrations mechanism and the command nodetool describecluster confirmed that there are two different schema versions, one from 5.0.x nodes and one from 5.1.x nodes. So, we decided to speed up DSE 5.1.x upgrade under assumption that once all nodes are 5.1.x there would no issues with different schema versions and cluster would settle. This indeed happened.

Replace DSE with Apache Cassandra 3.11.1

So how was the final migration step of replacement of DSE 5.1.4 with Apache Cassandra? In the end it worked well across all our clusters. Basically, DSE packages have to be uninstalled, Apache Cassandra has to be installed and pointed to existing data and commit_log directories.

However, two extra steps were required:

DSE creates dse-specific keyspaces (e.g. dse_system). This keyspace is created with EverywhereStrategy which is not supported by the Apache Cassandra. So before replacing DSE with Apache Cassandra on a given node the keyspace replication strategy has to be changed.
There are few system column families that receive extra columns in DSE deployments (system_peers and system_local). So, if you attempt to start Apache Cassandra after DSE is removed – it will fail on schema check against these column families. There are two ways this could be worked around – the first option is to delete all system keyspaces, set auto_bootstrap to false and also set the allow_unsafe_join JVM option to true – assuming the node’s ip address stays the same Cassandra will re-create system keyspaces and jump into the ring. However we used alternative approach – we had a bunch of scripts which first exported these tables with extra columns to json – and then we imported them on a helper cluster which had the same schema as system tables do in Apache Cassandra. Then we imported data from json files and copied the resultant sstables to the node being upgraded

Minor version, big impact

So Cassandra 3.11 clearly has vast improvements in terms of storage format efficiency and better read latency. However even within a certain Cassandra major version a minor version can introduce a bug which would heavily affect performance. So the latency improvements were achieved on Cassandra 3.11.1. But this very specific version introduced an unpleasant bug (CASSANDRA-14010, fix sstable ordering by max timestamp in singlePartitionReadCommand)

This bug was introduced during refactoring – and led to that for tables with STCS almost all sstables were queried for every read. Typically Cassandra sorts sstables by a timestamp in a descending order – and it starts answering a query by looking at the most recent sstables until it’s able to fulfil a query (with some restrictions, like for counters or unfrozen collections). The bug however made sstables to be sorted in the ascending order. It was fixed in 3.11.2 – and here is the impact – number of sstables per read for 99th percentile has dropped after the upgrade. Reading less sstables for sure leads to lower local read latencies.

Drop in sstables per read

Local read latency drop on Cassandra 3.11.2

Always mind page cache

On one of our newly built Cassandra 3.11.2 clusters we noticed unexpectedly high local read latencies. This came as a surprise because this cluster was specifically dedicated to a table with LCS. We spent quite some time in blktrace/blkparse/btt to confirm that our storage by itself is performing well. However we identified later that high latency is only subject for 99th and 98th percentiles – while from btt output it became clear that there write IO bursts which result in a very large blocks thanks to a scheduler merge policy. Next step was to exclude compactions – so even with compactions disabled we still had these write bursts. Looking further it became evident that page cache defaults were different from what we typically had – and the culprit is vm.dirty_ratio which was set to 20%. With 256GB of RAM flushing 20% of it certainly would result in IO burst which would affect reads. So after tuning page cache (we opted for vm.dirty_bytes and vm.dirty_background_bytes instead of vm.dirty_ratio).

Page cache flush settings impact on local read latency

The results

So, in the end – what were our gains from Apache Cassandra 3.11?
Well aside from any new features we got lots of performance improvements, it’s a major storage savings which immediately transform into lower read latencies because more data fits into OS cache which improved a lot the clusters’ performance.

In addition, the newer storage format results in lower JVM pressure during reads (object creation ratio decreasing 1.5 – 2 times). The following image describe the decrease in the read latency in our datacenters (NY, CHI & SA) during load tests (LT) that we performed before and after the migration

Read latency before and after the migration

Another important aspect are new metrics exposed. Few interesting ones:

Ability to identify consistency level requested for both reads and writes. This can be useful to track if occasionally the consistency level requested is not what would be expected for a given column family. For example, if QUORUM is requested instead of LOCAL_QUORUM due to a bug in app configuration.
Metrics on the network messaging.

Read metrics before the migration

Read metrics after the migration

We did something that lots of people thought that it’s impossible, to perform such in-place migration in running production clusters. Although there were some bumps in the way, it’s totally possible to conduct an in-place upgrade of DSE 4.8.x to Apache Cassandra 3.11, while maintaining acceptable latencies and avoiding any cluster downtime.

May 14, 2018

Avi Avraham

Hadoop Research Journey from Bare Metal to Google Cloud – Episode 3

Previously on our second episode of the trilogy “Hadoop Research Journey from bare metal to Google Cloud – Episode 2”, we covered the POC we had.

In this episode we will focus on the migration itself, building a POC environment is all nice and easy, however migrating 2 PB (the raw part out of 6 PB which include the replication) of data turned to be a new challenge. But before we jump into technical issues, lets start with the methodology.

The big migration

We learned from our past experience that in order for such a project to be successful, like in many other cases, it is all about the people – you need to be minded to the users and make sure you have their buy-in.

On top of that, we wanted to complete the migration within 4 months, as we had a renewal of our datacenter space coming up, and we wanted to gain from the space reduction as result of the migration.

Taking those two considerations in mind, we decided that we will have the same technologies which are Hadoop and Hive on the cloud environment, and only after the migration is done we would look into leveraging new technologies available on GCP.

Now after the decision was made we started to plan the migration of the research cluster to GCP, looking into different aspects as:

Build the network topology (VPN, VPC etc.)
Copy the historical data
Create the data schema (Hive)
Enable the runtime data delivery
Integrate our internal systems (monitoring, alerts, provision etc.)
Migrate the workflows
Reap the bare metal cluster (with all its supporting systems)

All in the purpose of productizing the solution and making it production grade, based on our standards. We made a special effort to leverage the same management and configuration control tools we use in our internal datacenters (such as Chef, Prometheus etc.) – so we would treat this environment as yet just another datacenter.

Copying the data

Sound like a straightforward activity – you need to copy your data from location A to location B.

Well, turns out that when you need to copy 2 PB of data, while the system is still active in production, there are some challenges involved.

The first restriction we had, was that the copy of data will not impact the usage of the cluster – as the research work still need to be performed.

Second, once data is copied, we also need to have data validation.

Starting with data copy

Option 1 – Copy the data using Google Transfer Appliance

Google can ship their transfer appliance (based on the location of your datacenter), that you would attach to the Hadoop Cluster and be used to copy the data. Ship it back to Google and download the data from the appliance to GCS.

Unfortunately, from the capacity perspective we would need to have several iterations of this process in order to copy all the data, and on top of that the Cloudera community version we were using was so old – it was not supported.

Option 2 – Copy the data over the network

When taking that path, the main restriction is that the network is used for both the production environment (serving) and for the copy, and we could not allow the copy to create network congestion on the lines.

However, if we restrict the copy process, the time it would take to copy all the data will be too long and we will not be able to meet our timelines.

Setup the network

As part of our network infrastructure, per datacenter we have 2 ISPs, each with 2 x 10G lines for backup and redundancy.

We decided to leverage those backup lines and build a tunnel on those lines, to be dedicated only to the Hadoop data copy. This enabled us to copy the data in relatively short time on one hand, and assure that it will not impact our production traffic as it was contained to specific lines.

Once the network was ready we started to copy the data to the GCS.

As you may remember from previous episodes, our cluster was set up over 6 years ago, and as such acquired a lot of tech debt around it, also in the data kept in it. We decided to take advantage of the situation and leverage the migration also to do some data and workload cleanup.

We invested time in mapping what data we need and what data can be cleared, although it didn’t significantly reduce the data size we managed to delete 80% of the tables, we also managed to delete 80% of the workload.

Data validation

As we migrated the data, we had to have data validation, making sure there are no corruptions / missing data.

More challenges on the data validation aspects to take into consideration –

The migrated cluster is a live cluster – so new data keeps been added to it and old data deleted
With our internal Hadoop cluster, all tables are stored as files while on GCS they are stored as objects.

It was clear that we need to automate the process of data validation and build dashboards to help us monitor our progress.

We ended up implementing a process that creates two catalogs, one for the bare metal internal Hadoop cluster and one for the GCP environment, comparing those catalogs and alerting us to any differences.

This dashboard shows per table the files difference between the bare metal cluster and the cloud

In parallel to the data migration, we worked on building the Hadoop ecosystem on GCP, including the tables schemas with their partitions in Hive, our runtime data delivery systems adding new data to the GCP environment in parallel to the internal bare metal Hadoop cluster, our monitoring systems, data retention systems etc.

The new environment on GCP was finally ready and we were ready to migrate the workloads. Initially, we duplicated jobs to run in parallel on both clusters, making sure we complete validation and will not impact production work.

After a month of validation, parallel work and required adjustments we were able to decommission the in-house Research Cluster.

What we achieved in this journey

Upgraded the technology
Improve the utilization and gain the required elasticity we wanted
Reduced the total cost
Introduced new GCP tools and technologies

Epilogue

This amazing journey lasted for almost 6 months of focused work. As planned the first step was to use the same technologies that we had in the bare metal cluster but once we finished the migration to GCP we can now start planning how to further take advantage of the new opportunities that arise from leveraging GCP technologies and tools.

May 7, 2018

Avi Avraham

Hadoop Research Journey from Bare Metal to Google Cloud – Episode 2

Previously on our first episode of the trilogy “Hadoop Research Journey from bare metal to Google Cloud – Episode 1”, we covered our challenges.

In this episode, I am looking to focus on the POC that we did in order to decide whether we should rebuild the Research cluster in-house or migrate it to the cloud.

The POC

As we had many open questions around migration to the cloud, we decided to do a learning POC, focusing on 3 main questions:

Understand the learning curve that will be required from the users
Compatibility with our in-house Online Hadoop clusters
Estimate cost for running the Research cluster in the Cloud

However, before jumping into the water of the POC, we had some preliminary work to be done.

Mapping workloads

As the Research cluster was running for over 6 years already, there were many different use cases running on it. Some of which are well known and familiar to users, but some are old tech debts which no one knew if needed or not, and what is their value.

We started with mapping all the flows and use cases running on the cluster, mapped users and assigned owners to the different workflows.

We also created distinction between ad-hoc queries and batch processing.

Mapping technologies

We mapped all the technologies we need to support on the Research cluster in order to assure full compatibility with our Online clusters and in-house environment.

After collecting all the required information regarding the use cases and mapping the technologies we selected representative workflows and users to participate in the POC and take active part in it, collecting their feedback regarding the learning curve and ease of use. This approach will also serve us well later on, if we decide to move forward with the migration, having in house ambassadors.

Once we mapped all our needs, it was also easier to get from the different cloud vendors high level cost estimation, to give us a general indication if it makes sense for us to continue and invest time and resources in doing the POC.

We wanted to complete the POC within 1 month, so on one hand it will run long enough to cover all types of jobs, but on the other hand it will not be prolonged.

For the POC environment we built Hadoop cluster, based on standard technologies.

We decided not to leverage at this point special proprietary vendor technologies, as we wanted to reduce the learning curve and were careful not to get into a vendor lock-in.

In addition, we decided to start the POC only with one vendor, and not to run it on multiple cloud vendors.

The reason behind it was our mindfulness to our internal resources and time constraints.

We did theoretical evaluation of technology roadmap and cost for several Cloud vendors, and choose to go with GCP option, looking to also leverage BigQuery in the future (once all our data will be migrated).

The execution

Once we decided on the vendor, technologies and use cases we were good to go.

For the purpose of the POC we migrated 500TB of our data, build the Hadoop cluster based on Data Proc, and build the required endpoint machines.

Needless to say, that already in this stage we had to create the network infrastructure to support the secure work of the hybrid environment between GCP and our internal datacenters.

Now that everything was ready we started the actual POC from the users perspective. For a period of one month the participate users will perform their use cases twice. Once on the in-house Research cluster (the production environment), and second time on the Research cluster build on GCP (the POC environment). The users were required to record their experience, which was measured according to the flowing criteria:

Compatibility (did the test run seamlessly, any modifications to code and queries required, etc.)
Performance (execution time, amount of resources used)
Ease of use

During the month of the POC we worked closely with the users, gathered their overall experience and results.

In addition, we documented the compute power needed to execute those jobs, which enabled us to do better cost estimation for how much it would cost to run the full Research Cluster on the cloud.

The POC was successful

The users had a good experience, and our cost analysis proved that with leveraging the cloud elasticity, which in this scenario was very significant, the cloud option would be ROI positive compared with the investment we would need to do building the environment internally. (without getting into the exact numbers – over 40% cheaper, which is a nice incentive!)

With that we started our last phase – the actual migration, which is the focus of our last episode in “Hadoop Research Journey from Bare Metal to Google Cloud – Episode 3”. Stay tuned!

April 26, 2018

Avi Avraham

Hadoop Research Journey from Bare Metal to Google Cloud – Episode 1

Outbrain is the world’s leading discovery platform, serving over 250 billion personal recommendations per month. In order to provide premium recommendations at such a scale, we leverage capabilities in analyzing a large amount of data. We use a variety of data stores and technologies such as MySql, Cassandra, Elasticsearch, and Vertica, however in this post trilogy (all things can be split to 3…) I would like to focus on our Hadoop eco-system and our journey from pure bare metal into a hybrid cloud solution.

Hadoop Research Journey from Bare Metal to Google Cloud

The bare metal period

In a nutshell, we keep two flavors of Hadoop clusters:

Online clusters, used for online serving activities. Those clusters are relatively small (2 PB of data per cluster) and are kept in our datacenters on bare metal clusters, as part of our serving infrastructure.
Research cluster, surprisingly, used mainly for research and offline activities. This cluster keeps large amount of data (6 PB), and by nature the workload on this cluster is elastic. Most of the time it was not utilized, but there were times of peaks when there was a need to query huge amount of data.

History lesson

Before we move forward in our tale, it may be worthwhile to spend a few words about the history.

We first started to use the Hadoop technology at Outbrain over 6 years ago – starting as a technical small experiment. As our business rapidly grow, so did the data, and the clusters were adjusted in size, however a tech debt had been built up around it. We continued to grow the clusters, based on scale out methodology, and after some time, found ourselves with clusters running old Hadoop version, not being able to support new technologies, build from hundreds of servers, some of which are very old.

We decided we need to stop being fire fighters, and to get super proactive about the issue. We first took care of the Online clusters, and migrated them to a new in-house bare metal solution (you can read more about on this in the Migrating Elephants post on Outbrain Tech Blog site)

Now it was time to move forward and deal with our Research cluster.

Research cluster starting point

Our starting point for the Research cluster was a cluster build out of 500 servers, holding about 6 PB of data, running CDH4 community version.

As mentioned before, the workload on this cluster is elastic – at times, requires a lot of compute power and most of the time fairly under utilized (see graph below).

Research cluster starting point

This graph shows the CPU utilization for 2 weeks, as it seen the usage is not constant, most of the time is barely used, with some periodic peaks

The cluster was unable to support new technologies (such as SPARK and ORC), which were already in use with the Online clusters, reducing our ability to use it for real research.

On top of that, some of the servers in this cluster were becoming very old, and as we grow the cluster on the fly, its storage:CPU:RAM ratio was suboptimal, causing us to waste expensive foot print in our datacenter.

On top of all of the above, it caused so much frustration to the team!

We mapped our options moving forward:

Do in-place upgrade to the Research cluster software
Rebuild the research cluster from scratch on bare metal in our datacenters (similar to the project we did with the Online clusters)
Leverage cloud technologies and migrate the research cluster to the Cloud.

The dilemma

Option #1 was dropped immediately since it answered only a fraction of our frustration at best. It did not address the old hardware issues, and it did not address our concerned regarding non optimal storage:CPU:RAM ratios – which we understood would only get worse when we come to use RAM intensive technologies such as SPARK.

We had a dilemma between option #2 and option #3, both viable options with pros and cons.

Building the Research cluster in house was a project we were very familiar with (we just finished our Online clusters migration), our users were very familiar with the technology, so no learning curve on this front. On the other hand, it required a big financial investment, and we were unable to leverage the elasticity to the extent we wanted.

Migrating to the cloud answered our elasticity needs, however presented a non-predictable cost model (something very important to the finance guys), and had many unknowns as it was new for us, and for the users that would need to work with the environment. It was clear that learning and education will be needed, but it was not clear as to how steep this learning curve would be.

On top of that, we knew that we must have full compatibility between the Research cluster and the Online cluster, but it was hard for us to estimate the effort required to get there, and the number of processes that require data transition between the clusters.

So, what do we do when we don’t know which option is better?

We study and experiment! And this is how we entered the 2nd period – the POC.

You are invited to read about the POC we did and how we did it on our next episode of “Hadoop Research Journey from Bare Metal to Google Cloud – Episode 2”.

Category: BigData

Content Fraud – The Major Challenge for Content Networks

Why and How Did Outbrain Respond to the “Hackers of Savior” Cyber Attack?

The Technology Challenge

Crawling Our Way Out

Rapid Response Saves the Day

What Doesn’t Kill Us, Makes Us Stronger

Overview

Pipeline Architecture

Hourly Files Pipeline – legacy

Real Time Streaming Pipeline

Obvious Benefits

Additional Benefits

Conclusions

Problem Description

The solution:

Conclusion

References

Kill them! Close their resource! Let none escape!

Before we dive in…few words about Real Time Analytics

Data visualization in Apache Druid

The architecture

Spark Streaming job runs forever? Not really.

What problem are we trying to solve?

It’s all about offsets

Step#1 – auto commit offsets

Step#2: Commit Kafka offsets manually

Graceful Shutdown

Step #3: Kafka commitAsync

Here, it works!

Conclusion

Background

The POC

The plan

Upgrade to the latest DSE 4.x version (4.8.14)

Upgrade to DSE 5.0.14 (Cassandra 3)

Upgrade sstables

Upgrade to DSE 5.1.4 (Cassandra 3.11)

Replace DSE with Apache Cassandra 3.11.1

Minor version, big impact

Always mind page cache

The results

The big migration

Copying the data

Starting with data copy

Setup the network

Data validation

What we achieved in this journey

Epilogue

The POC

Mapping workloads

Mapping technologies

The execution

The POC was successful

The bare metal period

History lesson

Research cluster starting point

The dilemma

Search

עברית

Categories

Archive

RSS