Blog Posts -

Live Tail in Kubernetes / Docker Based environment

At Outbrain we are big believers in Observability.

What is Observability, and what is the difference between Observability and Monitoring? I will leave the explanation to Baron Schwartz @xaprb:

“Monitoring tells you whether the system works.  Observability lets you ask why it’s not working.”

@ Outbrain we are currently in the midst of migrating to a Kubernetes / Docker based environment.

This presented many new challenges around understanding why things don’t work.

In this post I will be sharing with you our logging implementation which is the first tool used to understand the why.

But first thing first, a short review of our current standard logging architecture:

We use a standard ELK stack for the majority of our logging needs. By standard I mean Logstash on bare metal nodes, Elasticsearch for storage and Kibana for visualizing and analytics.  Apache Kafka is transport layer for all of the above.

A very simplified sketch of the system:Live Tail in Kubernetes

Of course the setup is a bit more complex in real life since Outbrain’s infrastructure is spread across thousands of servers, in multiple physical data centers and cloud providers; and there are multiple Elasticsearch clusters for different use cases.

Add to the equation that these systems are used in a self-serve model, meaning the engineers are creating and updating configurations by themselves – and you end up with a complex system which must be robust and resilient, or the users will lose trust in the system.

The move to Kubernetes presented new challenges and requirements, specifically related to the logging tools:

  • Support multiple Kubernetes clusters and data centers.
  • We don’t want to us “kubectl”, because managing keys is a pain especially in a multi cluster environment.
  • Provide a way to tail logs and even edit log file. This should be available on a single pod or across a service deployed in multiple pods.
  • Leverage existing technologies: Kafka, ELK stack and Log4j on the client side
  • Support all existing logging sources like multiline and Json.
  • Don’t forget services which don’t run in Kubernetes, yes we still need to support those.

 

So how did we meet all those requirements? Time to talk about our new Logging design.

The new architecture is based on a standard Kubernetes logging setup – Fluentd daemonset running on each Kubelet node, and all services are configured to send logs to stdout / err  instead of a file.

The Fluentd agent is collecting the pod’s logs and adding the Kubernetes level labels to every message.

The Fluentd plugin we’re using is the kubernetes_metadata_filter.

After the messages are enriched they are stored in a Kafka topic.

A pool of Logstash agents (Running as pods in Kubernetes) are consuming and parsing messages from Kafka as needed.

Once parsed messages can be indexed into Elasticsearch or routed to another topic.

A sketch of the setup described:

A sketch of the setup described:

And now it is time to introduce CTail.

Ctail, stands for Containers Tail, it is an Outbrain homegrown tool written in Go, and based on a server and client side components.

A CTail server-side component runs per datacenter or per Kubernetes cluster, consuming messages from a Kafka topic named “CTail” and based on the Kubernetes app label creates a stream which can be consumed via the CTail client component.

Since order is important for log messages, and since Kafka only guarantees order for messages in the same partition, we had to make sure messages are partitioned by the pod_id.

With this new setup and tooling, when Outbrain engineers want to live tail their logs, all they need to do is launch the CTail client.

Once the Ctail client starts, it will query Consul, which is what we use for service discovery, to locate all of the CTail servers; register to their streams and will perform aggregations in memory – resulting in a live stream of log entries.

Here is a sketch demonstrating the environment and an example of the CTail client output:

CTail client output

 

To view logs from all pods of a service called “ob1ktemplate” all you need is to run is:

# ctail-client -service ob1ktemplate -msg-only

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: uri http://localhost:8181/Ob1kTemplate/ returned status code 200
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751532409-n1kxv: Getting uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate'
2017-06-13T19:16:25.531Z ob1ktemplate-test-ssages-2751568954-n1rte: uri http://localhost:8181/Ob1kTemplate/api/echo?name='Ob1kTemplate' returned status code 200

Or logs of a specific pod:

# ctail-client -service ob1ktemplate -msg-only -pod ob1ktemplate-test-ssages-2751568960-n1kwd

2017-06-13T19:16:25.525Z ob1ktemplate-test-ssages-2751568960-n1kwd: Running 5 self tests now...
2017-06-13T19:16:25.527Z ob1ktemplate-test-ssages-2751568960-n1kwd: Getting uri 
http://localhost:8181/Ob1kTemplate/
2017-06-13T19:16:25.529Z ob1ktemplate-test-ssages-2751568960-n1kwd: uri http://localhost:8181/Ob1kTemplate/ returned status code 200

 

This is how we solve this challenge.

Interested in reading more about other challenges we encountered during the migration? Either wait for our next blog, or reach out to visibility at outbrain.com.

Monitoring APIs with ELK

The Basics

One of the main challenges we’ve dealt with during the last couple of years, was opening our platform and recommendation engine to the developers’ community. With the amount of data that Outbrain processes, direct relations with hundreds of thousands of sites and reach of more than 600M users a month, we can drive the next wave of content innovation. One of Outbrain’s main drivers for enabling automated large scale recommendations system is to provide application developers the option to interact with our system via API.

Developers build applications, and those application are used by users, in different locations and times. When exposing API to external usage you can rarely predict how people will actually use it

These variations can come from different reasons:

  1. Unpredictable scenarios
  2. Unintentional misuse of the API. Either for lack of proper documentation, a bug, or simply because a developer didn’t RTFM.
  3. Intentional misuse of the API. Yeah, you should expect people will abuse your API or use it for fraudulent activity.

In all those cases, we need to know how the developers community is using the APIs and how the end users (applications) are using it as well and also take proactive measures.

Hello ELK.

The Stack

The Stack

ElasticSearch, Logstash and Kibana (AKA ELK) are great tools for collecting, filtering, processing, indexing and searching through logs. The setup is simple: Our service writes logs (using Log4J), the logs are picked up by a Logstash agent that sent it to an ElasticSearch  index. Kibana is setup to visualize the data of the ES index.

The Data

Web server logs are usually too generic. Application debug logs are usually too noisy. In our case, we have added a dedicated log with a single line for every API request. Since we’re in application code, we can enrich the log with interesting fields, like country of request origin (translating the IP to country). etc…

Here’s a list of useful fields:

  • Request IP  – Don’t forget about XFF header
  • Country / City – We use a 3rd party database for translating IPs to the country.
  • Request User-Agent
  • Request Device Type – Resolved from the User-Agent
  • Request Http Method – GET, POST, etc.
  • Request Query Parameters
  • Request URL
  • Response Http Status – code. 200, 204, etc.
  • Response Error Message – The API service can fill in extra details on errors.
  • Developer Identifier / API Key – If you can identify the Developer, Application or User, add these fields.

What can you get out of this?

So we’ve got the data in ES, now what?

Obvious – Events over time

Obvious - Events over time

This is pretty trivial. You want to see how many requests are made. With Kibana’s ® slice ‘n dice capabilities, you can easily break it down per Application, Country, or any other field that you’ve bothered to add. In case an application is abusing your API and calling it a lot, you can see who just jumped over time with his requests and handle it.

Request Origin

Request Origin

If you’re able to resolve the request IP (or XFF header IP) to country, you’ll get a cool looking map / table and see where requests are coming from. This way you can detect anomalies like frauds etc…

Http Status Breakdown

Http Status Breakdown

By itself, this is nice to have. When combined with Kibana’s slice n’ dice capabilities this let’s you see an overview for any breakdown. In many cases you can see that an application/developer is shooting the wrong API call. Be proactive and lend some assistance in near real time. Trust us, they’ll be impressed.

IP Diversity

IP Diversity

Why would you care about this? Consider the following: A developer creates an application using your API, but all requests are made from a limited number of IPs. This could be intentional, for example, if all requests are made through some cloud service. This could also hint on a bug in the integration of the API. Now you can investigate.

Save the Best for Last

The data exists in ElasticSearch. Using Kibana is just one way of using it. Here are a few awesome ways to use the data.

Automated Validations (or Anomaly detection)

Once we’ve identified key anomalies in API usage, we’ve setup automated tests to search for these anomalies on a daily basis. Automatic anomaly detection in API usage proved to be incredibly useful when scaling a product. These tests can be run on demand or scheduled, and a daily report is produced.

Automated Validations (or Anomaly detection)

Abuse Detection

ElasticSearch is (as the name suggests) very elastic. It enables querying and aggregating the data in a variety of ways. Security experts can (relatively) easily slice & dice the data to find abuse patterns. For example, we detect when the same user-id is used in two different locations and trigger an alert.

Key Takeaways

  • Use ELK for analyzing your API usage
  • Have the application write the events (not a generic web-server).
  • Provide application-level information. E.g. Additional error information, Resolved geo location.
  • Share the love