Brainpower - Data Insights

Outbrain Challenges the Research Community with Massive Data Set


Outbrain is responsible for content recommendations you see on thousands of online publishers, including CNN, Fox News, ESPN, and the Washington Post. These recommendations must satisfy a large variety of user interests, which ultimately require a lot of data and advanced algorithms.

Introducing the Outbrain Challenge

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

Sample Test Case: Predicting What Content Recommendation a User Will Click On

On the left: content recommendations presented to the user. On the right: the user’s reading history

Content categories previously visited


Here are some key questions researchers will need to consider, based on this test case.

1. User interest matters, but how and to what extent?

Consider your own content consumption tastes and choices. If you tend to read content about sports and technology, and you specifically read more about Usain Bolt and Elon Musk, would you be more inclined to click on a story about the latest Kim Kardashian scandal or a story about SpaceX? And what if the choice was between a Red Bull video about extreme sports and the SpaceX story? What is ultimately the best formula for mixing the interests of one user with the wisdom you gain from a larger pool of users?

2. Context matters, but how and to what extent?

If you are reading an article about a new restaurant, would you prefer to read next about another restaurant? Or would you rather read about cooking? Or perhaps you would be interested in a completely different topic. Outbrain currently captures context from many different variables. For example, the device of the user, the time of day, her current location, the current content she is reading, whether she came from another section on the publisher, or from Facebook.

3. Breaking news matters, but how long before a story is no longer interesting?

Some content is evergreen. Other news grows stale almost immediately. How can you determine when interest in a story diminishes?

Official Contest Details

Analyze and Model Data: Win Cash Prizes and Glory

The Outbrain Challenge will be hosted on Kaggle, an online platform for predictive modelling competitions. Companies and organizations such as Facebook, AirBnb, Microsoft, Walmart, and even the European Organization for Nuclear Research upload data challenges on Kaggle. Data Scientists from all over the world experiment with different techniques and compete against each other to produce the best models, winning both prestige and monetary rewards.

The Outbrain Challenge will award the best three models with cash prizes: $12,000 for first place, $8,000 for second, and $5,000 for the third best model.

Data is Completely Anonymized:    

Our data is anonymized across multiple fronts. First, user identifiers are opaque. Outbrain does not collect nor hold personally identifiable information (PII), and the user identifiers we are releasing here are further obscured. Second, to protect our publisher partners and advertisers, we are not releasing URLs of viewed or clicked stories, but rather opaque document and site identifiers. We are also not releasing the text of the stories, only some of their semantic attributes such as encoded categories and entities these stories pertain to.

Important dates:

The Outbrain Challenge will open on October 5, 2016, and close at 11:59PM UTC on January 18, 2017.

View the Outbrain Challenge page on Kaggle here

Ran Locar

Ran Locar

Ran is a Senior Data Scientist at Outbrain, mining data and applying machine learning to terabyte sized datasets, to optimize Outbrain's recommendations. Ran has over 20 years of software development, troubleshooting, reverse engineering, data analysis, and data science experience. He has a B.Sc. in Computer Science and Business Administration from the Tel-Aviv University, and is a frequent participant in Data Science competitions. Prior to joining Outbrain, Ran was a Big Data Engineer at Amdocs, building predictive analytics solutions for leading telecommunications providers.

Ronny Lempel

Ronny Lempel

Ronny Lempel joined Outbrain in May 2014 as VP of Outbrain's Recommendations Group, where he oversees the computation and delivery of the company's organic and paid recommendations, as well as the auction mechanisms that run in the paid recommendations' marketplace. Prior to joining Outbrain, Ronny spent 6.5 years as a Senior Director at Yahoo Labs. Ronny joined Yahoo in October 2007 to open and establish its Research Lab in Haifa, Israel. During his tenure at Yahoo, Ronny led R&D activities in diverse areas, including Web Search, Web Page Optimization, Recommender Systems and Ad Targeting. In January 2013 Ronny was appointed Yahoo Labs’ Chief Data Scientist in addition to his managerial duties. Prior to joining Yahoo!, Ronny spent 4.5 years at IBM's Haifa Research Lab with the Information Retrieval Group, where his duties included research and development in the area of enterprise search systems. During his tenure at IBM, Ronny managed the Information Retrieval Group for two years. Ronny received his PhD, which focused on search engine technology, from the Faculty of Computer Science at Technion, Israel Institute of Technology in early 2003. During his PhD studies, Ronny spent two summer internships at the AltaVista search engine. Ronny has authored over 40 research papers in leading conferences and journals, and holds 17 granted US patents. He regularly serves on program and organization committees of Web-focused conferences, and has taught advanced courses on Search Engine Technologies and Big Data Technologies at Technion.