This post was written by Stas Levin
Outbrain is proud to announce Aletheia, our solution for a uniform data delivery and flow monitoring across data producing and consuming subsystems. At Outbrain we have great amounts of data being constantly moved and processed by various real time and batch oriented mechanisms. To allow fast recovery and high SLA, we need to be able to detect problems in our data crunching mechanisms as fast as we can, preferably at near real time. The later problems are detected, the harder it is to investigate them (and thus fix them), and chances of business impact grow rapidly.
To address these issues, we’ve built Aletheia, a framework providing a uniform way to deliver and consume data, with built in monitoring capabilities allowing both producing and consuming sides to report statistics, which can be used to monitor the pipeline state in a timely fashion.
Overview of Aletheia
Aletheia enables one to easily deliver his domain entities (represented as classes) to, and from, what we call EndPoints. Aletheia consists of two main components, a DatumProducer, and a DatumConsumer. As their names imply, each is responsible for either delivering data to or consuming it from some endpoint. Both the DatumProducer and DatumConsumer report their ongoing statistics using what we call “breadcrumbs”, which are essentially messages consisting of metadata concerning the produced/consumed data. By comparing breadcrumbs reported by DatumProducers and a DatumConsumers one can obtain an idea about what portion of the data is available for consumption and what portion has already been consumed. This can come in even more handy if pipelines have more than one producer/consumer components, creating a more complex graph structure, with breadcrumbs reported at each stage.
Aletheia consists of two main components:
DatumProducer – the component responsible for producing (delivering) and auditing data to a certain EndPoint (a kafka topic, a file, etc)
DatumConsumer – the component responsible for consuming and auditing of data from a certain EndPoint (a kafka topic, a file, etc)
In addition, there are are also senders and receivers, responsible for communicating with the particular endpoint types, be it Kafka, log files or other custom endpoint type.
Aletheia is all about getting a datum from one place to another, where datum is a single unit of information, typically a client’s domain entity, say a click, or an impression event. A datum is packed into a “DatumEnvelope” that consists of some metadata and the actual serialized datum. Aletheia comes with a native support for Kafka and text log file endpoints, but was built with extensibility in mind, making the introduction of new endpoint types easy.
Both the DatumProducer and DatumConsumer keep record of their progress and periodically send out their aggregated monitoring information in the form of “breadcrumbs” which can be used to form a comprehensive real-time picture of your pipelines state.
Use cases at Oubrain
Producing data to Kafka clusters in multiple data centers
Consuming data from Kafka as part of our Storm topologies
Producing distributed log files generated by frontend servers
(Work in progress) Loading data files into Hadoop
Aletheia has consolidated the way we manage our data production and consumption here at Outbrain. Its uniform API to deliver and consume data from different sources, along with the built in Breadcrumb emission, make for what we have found to be a convenient abstraction layer.
Aletheia can be found at https://github.com/outbrain/Aletheia