When I was a kid, my parents used to tell me that I can’t have my cake and eat too. Actually, that’s a lie, they never said that. Still, it is something I hear parents say quite often. And not just parents. I meet the same phrase everywhere I go. People constantly taking a firm, almost religious stance about choosing one thing over another: Mac vs PC, Android vs iOS, Chocolate vs Vanilla (obviously Chocolate!).
So I’d like to take a moment to take a different, more inclusive approach.
Forget Mac vs PC. Forget Chocolate vs Vanilla.
I don’t want to choose. I Want it all!
At Outbrain, the core of our compute infrastructure is based on bare metal servers. With a fleet of over 6000 physical nodes, spread across 3 data centers, we’ve learned over the years how to manage an efficient, tailored environment that caters to our unique needs. One of which being the processing and serving of over 250 Billion personalized recommendations a month, to over 550 Million unique users.
Still, we cannot deny that the Cloud brings forth advantages that are hard to achieve in bare metal environments. And in the spirit of inclusiveness (and maximizing value), we want to leverage these advantages to complement and extend what we’ve already built. Whether focusing on workloads that require a high level of elasticity, such as ad-hoc research projects involving large amount of data, or simply external services that can increase our productivity. We’ve come to view Cloud Solutions as supplemental to our tailored infrastructure rather than a replacement.
Over recent months, we’ve been experimenting with 3 different vectors involving the Cloud:
Our world revolves around publications, especially news. As such, whenever a major news event occurs, we feel immediate, potentially high impact. Users rush to publisher sites, where we are installed. They want their news, they want their recommendations, and they want them all now.
For example, when Carrie Fisher, AKA Princess Leia, passed away last December, we saw a 30% traffic increase on top of our usual peak traffic. That’s quite a spike.
Since usually we do not know when the breaking news event will be, it means that we are required to keep enough extra capacity to support such surges.
By leveraging the cloud, we can keep that additional extra capacity to bare minimum, relying instead on the inherent elasticity of the cloud, provisioning only what we need when we need it.
Doing this can improve the efficiency of our environment and cost model.
A couple of months back one of researchers came up with an interesting behavioral hypothesis. For the discussion at hand, lets say that it was “people who like chocolate are more likely to raise pet gerbils.” (drop a comment with the word “gerbils” to let me know that you’ve read thus far). That sounded interesting, but raised a challenge. To validate or disprove this, we needed to analyze over 600 Terabytes of data.
We could have run it on our internal Hadoop environment, but that came with a not-so-trivial price tag. Not only did we have to provision additional capacity in our Hadoop cluster to support the workload, we anticipated the analysis to also carry impact on existing workloads running in the cluster. And all this before getting into operational aspects such as labor and lead time.
Instead, we chose to upload the data into Google’s BigQuery. This gave us both shorter lead times for the setup and very nice performance. In addition, 3 months into the project, when the analysis was completed, we simply shut down the environment and were done with it. As simple as that!
We use Fastly for dynamic content acceleration. Given the scale we mentioned, this has the side-effect of generating about 15 Terabytes of Fastly access logs each month. For us, there’s a lot of interesting information in those logs. And so, we had 3 alternatives when deciding how to analyse them:
- SaaS based log analysis vendors
- An internal solution, based on the ELK stack
- A cloud based solution, based on BigQuery and DataStudio
After performing a PoC and running the numbers, we found that the BigQuery option – if done right – was the most effective for us. Both in terms of cost, and amount of required effort.
There are challenges when designing and running a hybrid environment. For example, you have to make sure you have consolidated tools to manage both on-prem and Cloud resources. The predictability of your monthly cost isn’t as trivial as before (no one likes surprises there!), controls around data can demand substantial investments… but that doesn’t make the fallback to “all Vanilla” or “all Chocolate” a good one. It just means that you need to be mindful and prepared to invest in tooling, education and processes.
I’d like to revisit my parents’ advice, and try to improve on it a bit (which I’m sure they won’t mind!):
Be curious. Check out what is out there. If you like what you see – try it out. At worst, you’ll learn something new. At best, you’ll have your cake… and eat it too.