The first part of any data science task, is getting a good dataset to work with. We have a lot of data, but preparing the datasets can be a very hard work – you really have to “get your hands dirty” to get the data from all the sources and tables and convert them into an easy to use dataset.
What are the main challenges in creating a good dataset to use?
- Many output tables – tables that store requests for recommendations, served recommendations, user clicks, profiles for the users and documents and more.
- Number of data stores – these tables are stored on a number of sources, due to their different nature. Some are Hive tables, some data is stored on MySQL, and on Cassandra as well.
- Long queries – some of these tables are very big. Querying them, especially for a long date range, can take a while.
- Irrelevant data – we rarely want data from our entire traffic. Usually we only want some portion of it which is relevant for the current modeling task.
Silos and partitioning:
In addition to these challenges, there are other advantages to a good data collection process.
We want to have the ability to train models on different silos – different population groups, that may behave differently and require different models.
To enable achieving this easily, we add a number of columns and partitions to our output aggregation tables – such as platform, country, language and more.
This allows us to quickly try out our models on different groups.
We decided to split our output into 2 main parts:
First, a dataset for building models: It will contain only the served recommendations we want (from specific variants of the traffic), and it should contain all of the clicked recommendations plus a sample of the non clicks, in order to have a balanced dataset for learning.
Second, a dataset that will be used for simulation of the business metrics.
Our recommendations are served in widgets, showing a small batch of recommendations.
For this use case, we take only recommendations from widgets that received at least one click.
With this dataset, we can apply our model only on these clicked widgets, and see how well we graded the clicked recommendation compared to the other non clicked recommendations.
The solution – Automatic Spark job
Our solution to solve all these challenges was to create an automatic data collection job.
The job runs hourly, triggered by our ETL engine.
An hourly Apache Spark job aggregates an hourly dataset, with the relevant data; creates the needed partitions; and creates the two outputs described above.
Using Spark was very convenient for this use case. It allowed us to create a readable flow, that queries different input sources, and holds in memory data that is common for both tables before writing the final output to Hive.
A quick note on how we monitor our Spark jobs:
It is somewhat of a challenge to understand how a Spark job behaves, other the basic error messages and checking the job’s output.
To make the job’s progress and status more visible, we send metrics from the job’s driver, using HTTP, to our monitoring server, which aggregates them.
This allows us to create simple to use dashboards and understand with ease what is going on.
In addition we send out textual logs to our monitoring server, that indexes them to an ElasticSearch cluster. We can then view the logs with ease using Kibana.
Below is a dashboard example for a demo job, collecting a portion of our data.
Stay tuned for our 3rd and last part of this blog post series where we will cover our Spark-based machine learning framework, that allows us to be highly agile with our research tasks, and dynamic as well as robust in pushing our models to production.