August | 2016 | Outbrain Techblog

August 28, 2016

Ori Lahav

ScyllaDB POC – (not so :) live blogging – Update #3.

: Scylla attacking Olysseus’s ship

It has been a long time (more than 4 months) since we last updated.

You can read the previous update here.

It is not that we abandoned the POC, we actually continued to invest time and effort on it since there is a good progress. It is just that we did not yet ended it and got the proofs that we wanted. While there was a lot of progress in both Outbrain system side and ScyllaDB side on those 4 months, there is one things that is holding us back from showing trying to prove the main point of this POC. Our current bottleneck is the network. The network on the datacenter where we are running the tests on is 1Gbps ethernet network. We found out that although Scylla is not loaded and works with good latencies we are saturating the NICs. We did some improvements along the way to still show that Scylla is behaving better than C* but if we want to show that we can significantly reduce the number of nodes in the cluster, we need to upgrade to 10Gbps ethernet.

This upgrade will come shortly.

This is where we currently stand. However – a lot was done in those 4 months and there is a lot of learnings I want to share. The rest of the post is the way Doron and the Scylla guys describes what happened. It looks more like captain’s log but it tells the story pretty well.

Scylla:

23/6 – We created special app server cluster to call Scylla, and delegated all calls both to C* and Scylla cluster. We wanted to do that so there will be less coupling between the C* path and the Scylla path and less mutual interruptions that will interfere our conclusions. The app servers for Scylla were configured not to use cache, so entire load (1.5M-3M RPM) went directly to Scylla. C* stayed behind cache and actually handled ~10% of the load. This went smoothly.
In the following ~3 weeks we tried to load the snapshot again from C* and stumbled with some difficulties, some related to bugs in Scylla, some to networking limits (1Gpbs). During this time we had to stop the writes to Scylla for few days, so the data was not sync again. Some actions we have done to resolve
1. We investigated bursts of usage we had and decreased them in some use-cases (both for C* and Scylla). They caused the network usage to be very high for few seconds, sometimes for a few tens of milliseconds. This also helped C* a lot. The tool is now open source.
2. We added client-server compression (LZ4). It was supported by Scylla, but client needed to configure it.
3. Scylla added server-server compression during this period.
4. Changed the “multi” calls back to N parallel single calls (instead of one IN request) – it better utilize the network.
5. Scylla client was (mistakably) using latency aware over the token aware. This caused app to go to the “wrong” node a lot – causing more traffic within Scylla nodes. Removing the latency-aware helped reducing the server-server network usage and the overall latency.

14/7 – with all the above fixes (and more from Scylla) we were able to load the data and stabilize the cluster with the entire load.
Until 29/7 I see many spikes in the latency. We are not sure what we did to fix it… but on 29/7 the spikes stopped and the latency is stable until today.
During this period we have seen 1-5 errors from Scylla per minute. Those errors were explained by trying to reach partitions coming from very old C* version. It was verified by logging the partitions we fail for in the app server side. Scylla fixed that on 29/7.
7/8-15/8 – we have changed the consistency level of both Scylla and C* to local one (to test C*) – this caused a slight decrease in the (already) low latencies.
Up to 19/8 we have seen occasional momentarily errors coming from Scylla (few hundreds every few hours). This has not happened since 19/8.. I don’t think we can explain why.
Current latencies – Scylla holds over 2M RPM with latency of 2 ms (99%ile) for single requests and 40-50 ms (99%ile) for multi requests of ~100 partitions in avg per request. Latencies are very stable with no apparent spikes. All latencies are measured from the app servers.

Next steps on Scylla:

Load the data again from C* to sync.
Repeat the data consistency check to verify the results from C* and Scylla are still the same.
Run repairs to see cluster can hold while running heavy tasks in the background..
Try to understand the errors I mentioned above if they repeat.
Create the cluster in Outbrain’s new Sacramento datacenter that have 10Gbps network. with minimum nodes (3?) and try the same load there.

Cassandra:

7/8 – we changed consistency level to local-one and tried to remove cache from Cassandra. The test was successful and Cassandra handled the full load with latency increasing from 5 ms (99%ile for single requests) to 10-15ms in the peak hours.
15/8 – we changed back to local-quorum (we do not like to have local-one for this cluster… we can explain in more details why) and set the cache back.
21/8 – we removed the cache again, this time with local-quorum. Cassandra handled it, but single requests latency increased to 30-40 ms for the 99%ile in the peak hours. In addition, we have started timeouts from Cassandra (timeout is 1 second) – up to 500 per minute, in the peak hours.

Next steps on C*:

Run repairs to see cluster can hold while running heavy tasks in the BG.
Try compression (like we do with Scylla).
Try some additional tweaks by the C* expert.
In case errors continue, will have to set cache back.

Current status comparison – Aug 20th:

The following table shows comparison under load of 2m RPM in peak hours.

Latencies are in 99%ile.

	Cassandra	ScyllaDB
Single call latency	30-40 ms (spikes to 70)	2 ms
Multi call latency	150-200 ms (spikes to 600)	40-50 ms
Errors (note: these are query times exceeding 1 second, not necessarily database failures)	Up to 150 a minute every few minutes timeouts per minute, with some higher spikes every few days	few hundreds every few days

Below are the graphs showing the differences.

Latency and errors graphs showing both C* and Scylla getting requests without cache (1M-2M RPM):

Latency comparison of single requests

Latency comparison of multi requests

Errors (timeouts for queries > 1 second)

Summary

There are very good signs that Scylla DB does make a difference in throughput but due to the network bottleneck, we could not verify it. We will update as soon as we have results on a faster network. Scylla guys are working on solution for slower networks too.

Hope to update soon.

August 21, 2016

Daniel Sternlicht

חדשות מעולם הפרונט אנד – ניוזלטר #5

השבוע בניוזלטר – לוגו האולימפיאדה בCSS3, האק בCanvas שיאפשר לכם לפרוץ את גבולות חלון הדפדפן, פודקסט שמוקדש למפתחי פרונט אנד, ועוד…

הלינקים הנבחרים מהפורום

ברוח האולימפיאדה, הנה מימוש של הלוגו בעזרת CSS3
http://codepen.io/AllThingsSmitty/pen/cHjIC/

כלי מצוין שיאפשר לכם לקבל סטטיסטיקות על הCSS באתר שלכם ממנו תוכלו להסיק מסקנות על יעילות הקוד שלכם
http://cssstats.com/

תפסיקו להשתמש בng-repeat, תעברו לng-options. הנה למה:
http://www.codelord.net/2016/08/12/stop-ng-repeating-options-use-ng-options/

מאמר מעניין על יצירת טמות (Themes) בעזרת פיצ’רים חדשים של המאפיין Color
https://cloudfour.com/thinks/building-themes-with-css4-color-features/

פודקסט חדש ומעניין שמעלה כל שבוע שלושה מפתחי JavaScript שיתנו לכם טיפים איך להתפתח בתחום הקליינט
http://careerjs.com/

מאמר מרתק על CSS Reflections, תוך כדי ניסיון לשחזור הLoader הזה (http://codepen.io/TheDutchCoder/pen/IKqpA).
https://css-tricks.com/state-css-reflections/

האק מגניב לפריצת הגבולות של הדפדפן בעזרת Canvas
https://twitter.com/benjaminbenben/status/762929148503347200

מכירים את הטי-רקס שמופיע בכרום כשאתם אופליין? במגאזין פיסקל פרפקט הישראלי החליטו לנסות להבין למה בחרו דווקא בו
http://www.pixelperfect.co.il/posts/googlechromest-rex/

אם אי פעם תצטרכו ליצור מכונת תופים רספונסיבית (WAT?!), הנה מדריך שיוכל לעזור לכם במשימה
https://www.smashingmagazine.com/2016/08/how-to-create-a-responsive-8-bit-drum-machine-using-web-audio-svg-and-multitouch/

בהמשך ליצירת הפיקאצ’ו מהניוזלטר האחרון, הנה אוסף פוקמונים בCodepen
http://codepen.io/collection/XbrqEP/#

כנראה שיצא לכם להשתמש ב:hover לא פעם, המאמר הבא ינסה להניא אתכם מלהשתמש בו שוב
https://medium.com/instacart-design/hover-is-dead-long-live-hover-37a89d3795df#.ff6i8qefv

כאילו אין מספיק, הנה אוסף של 67 פלאגינים למימוש Modal/Popup בJavaScript
http://www.appcues.com/blog/67-open-source-modal-window-plugins-made-with-jquery-javascript-css-and-more/

לולאת for נגד לולאת forEach, מתי להשתמש במה
http://thejsguy.com/2016/07/30/javascript-for-loop-vs-array-foreach.html

מודול Node נחמד שיאפשר לכם לעשות אינטגרציה עם Slack בקלות
https://github.com/BeepBoopHQ/slapp

גוגל מודיעה על חסימת סרטוני Flash החל מגירסה 53 של כרום!
https://chrome.googleblog.com/2016/08/flash-and-chrome.html

התחרות השנתית של מוזילה לפיתוח משחקי JavaScript ב13k יוצאת לדרך
https://hacks.mozilla.org/2016/08/js13kgames-code-golf-for-game-devs/

ספריה קטנה וחמודה למימוש Rating בשלל אנימציות
http://lunarlogic.github.io/starability/

קצת על הניוזלטר והפורום

אחת לשבועיים מתקיים פורום בקרב מפתחי הפרונט אנד באאוטבריין. הפורום מתחלק לשני חלקים, הראשון: סבב שיתוף ידע בין אנשי הפורום, השני: הרצאה או דיון מעמיק על נושא מסוים בעולם הפרונט אנד. בניוזלטר הדו-שבועי שלנו תוכלו לקבל תקציר לכל הדברים עליהם דיברנו בפורום.

August 8, 2016

Daniel Sternlicht

חדשות מעולם הפרונט אנד – ניוזלטר #4

השבוע בניוזלטר – טירוף הפוקימון גו מגיע לעולם הפרונט אנד, אימולטור של Terminal שכתוב בJS בלבד, איך אינסטגרם היה נראה בווינדוס 95, ועוד…

הלינקים הנבחרים מהפורום

ברוח הטירוף של פוקימון גו, הנה פיקאצ’ו בCSS
https://codepad.co/snippet/cRS9VPZ6

מאמר מעניין על למה תפריטי DropDown מאטים את העבודה שלנו במובייל
https://medium.com/theuxblog/mobile-dropdowns-revisited-17358ceb5340#.lwo1i1b3w

פרויקט AMP של גוגל מתרחב, ויאפשר להפוך את הווב במובייל ליותר מהיר גם בדפים שאינם דפי תוכן
https://techcrunch.com/2016/08/02/googles-amp-expands-beyond-news/

ילידי שנות ה90? הנה איך אינסטגרם היה נראה בווינדוס 95
https://www.behance.net/gallery/41023081/Instagram-for-Win95

כל מה שרציתם לדעת על אפקטים בעזרת CSS Filter
https://www.sitepoint.com/css-filter-effects-blur-grayscale-brightness-and-more-in-css/

אימולטור לTerminal שכתוב 100% בJavaScript
https://hyperterm.org/

מדריך מצוין שיעזור לכם ללמוד איך לכתוב Unit Tests שיהיו Cross-browser
https://philipwalton.com/articles/learning-how-to-set-up-automated-cross-browser-javascript-unit-testing/

אם אתם עובדים עם React ובמקרה יוצא לכם לכתוב טמפלייט לאימייל, המאמר הבא יסביר לכם איך לכתוב את הטמפלייט בעזרת React
https://assertible.com/blog/creating-email-templates-with-react-components

שימושים מעניינים ליחידות Viewport בCSS
https://falkus.co/2016/07/neat-uses-for-csss-awesome-viewport-units/

Blog Posts - August 2016

ScyllaDB POC – (not so :) live blogging – Update #3.

Scylla:

Next steps on Scylla:

Cassandra:

Next steps on C*:

Current status comparison – Aug 20th:

Latencies are in 99%ile.

Summary

Search

עברית

Categories

Archive

RSS