Skills Matter Performance Summit 2021

Daniel Moraru

by Daniel Moraru
· 8 min read

As someone with a background in performance engineering I was asked to cover this year’s Performance Summit Conference and share my experience with the team. No socializing this time between presentations - the conference was all online in keeping with the times.

I grouped the summaries based on the subject of the presentations - mobile, machine learning and system/linux so that readers can jump to the category of interest to them.


The first day was started by Nian Sun - Tech Lead at ByteDance with Task scheduling optimizations in Mobile Applications. A good choice to start the summit with as it was pretty low level and required full focus. However, the high level conclusions were very useful for mobile application developers. He discussed the scheduling customizations that are possible on heterogeneous CPU architectures common in mobile phones and the approaches taken to ensure maximum application responsiveness while consuming as little power as possible. He mentioned how a multi-threaded approach might not work so well on mobile phones for reducing execution time. On some schedulers multiple threads could fully utilize the big cores while leaving the little ones unused.

Another problem described was very short-lived threads - the type one would use to try and speed up application startup. The schedulers on mobile phone CPUs might not deal very well with context switches between threads which might cause the multiple threads approach to be slower than a single thread execution. While there are optimization strategies at the system level, as an application developer you might want to minimize context switches and run longer tasks on the same threads - that should guarantee better performance on most phones in the Android ecosystem. He also mentioned speculative page faults - optimized concurrent access to the memory map that would speed up applications in memory constrained environments.

Another interesting presentation, and again very low level, was given by Ivica Bogosavljevic: Performance Optimization Techniques for Mobile Devices: An Overview. It touched on for a bit as well on the difficulty of high performance multi threaded applications on mobile CPUs and he listed the usual trappings that might cause a single threaded approach to work better: increased power consumption, not being able to fully utilize cores that cannot be idle thus consuming power without any significant benefit, the complicated scheduling and the need to always measure to make sure you are actually benefiting from using multiple threads. For maximizing energy efficiency try to keep the CPUs idle most of the time, make your software faster :-), group and process data in one go so that the CPU can go back to idle quicker, use events rather than polling and minimize communication with other components. He groups applications into core bound (CPU intensive) and memory bound and states that the performance goal is always transforming a memory bound into a core bound application. A number of optimization strategies and low level optimization techniques were discussed to help both memory and core bound applications. Many of these techniques are not visible for small data sets but make a huge difference for large ones.

A related mobile presentation on this day was given by Yaniv Sabo and Simon Casagranda of Facebook - Shipping 2021 user experiences on 2011 phones. He talked about Facebook Lite and the challenges they faced in providing the best possible Facebook experience to people with 10 year old phones. The architecture they chose in the end was interesting - they moved the entire application business logic off the user’s phone to Facebook’s servers leaving the mobile application a simple shell that allowed it to fit much better with the hardware constraints. He mentioned the key application requirements for good performance on devices with limited resources: small binary size, use very little data, make reliable connections, have acceptable performance, provide multi-language support, provide core functionality.

Machine Learning (ML)

There were three presentations related to using ML for performance engineering tasks. Two covered the optimization of deployment costs and one covered the prediction of performance regressions.

The two that covered the deployment cost optimizations were given by people from Akamas and both seem to present the same or very similar reinforcement learning algorithm but with slightly different goals. The first presentation was given by Stefano Doni from Akamas - Using ML to Automatically Optimize Kubernetes for Cost Efficiency and Reliability The problem they wanted to resolve is one which resonates with a lot of organizations using Kubernetes: how to configure resource limits for a large deployment in such a way that they get the best value for the money which in the case of the application presented meant best throughput for the money. He mentioned something very interesting: they found out that request limits start throttling CPU availability at very low utilization rates - at about 30% - which results in CPU throttling having significant performance implications - the performance can be 22 times worse. This makes resolving this task manually an even more painful job. The approach they took was to start with a configuration, apply workload, collect KPI, score and apply ML optimization that would indicate the most promising configuration for the next run and repeat. The application shown in the presentation consisted of 10 microservices running in a Kubernetes environment, plus a load generation tool. The environment had 22 configurable parameters for CPU and memory limits and exposed 275 metrics. The objective of the optimization was to maximize the service throughput and service cost ratio (cost efficiency) given some specific constraints such as the percentile of response time to the clients and the error rate. The machine learning optimizer was let loose on this environment and at the end of 35 iterations over 24 hours it came up with a 77% improvement of the cost efficiency.

The other presentation from Akamas was given by Luca Chiabrera - Tailoring Google Dataproc to Reduce Spark Execution Time and Cut the Bill - and covered an identical approach but used this time for finding the optimum Spark cluster configuration for a specific job. Here the optimization targets were a bit different - you could optimize for cost or job execution duration. For the latter, limiting the allowed variations in cost was essential in producing an acceptable result (significant execution time reduction at a reasonable cost increase).

The last ML related presentation was given on the second day by Moritz Beller of Facebook - Learning to Regressions in Production. He presented Facebook - an ML approach to predicting on analysing the application code base. This is a very interesting approach that could identify performance regression as early as in the developer’s IDE and it was made possible by the massive amounts of data collected by Facebook via its development, deployment and test processes. Facebook was in an extremely favorable position - of their own doing - to attempt this project; their continuous delivery and performance regression testing generated over the years large amounts of small pieces of code with their own performance regression numbers which were used to train this deep learning algorithm. It will be interesting seeing if and how this work can be generalized and maybe extrapolated to technologies not used by Facebook. An interesting fact he mentioned was that even though the algorithm was trained on one language it worked very well on another just because the languages were similar enough that the processing of both source codes resulted in similar tokens.


Daniele Salvatore Albano’s presentation - Cachegrand – A Super Scalar Caching Platform discussed Cachegrand, a very fast caching platform designed with a performance first approach. Cachegrand is a Linux only application written in C and Daniele presented the optimization techniques he used when building this project: structs organized in a way convenient to the CPU not human, structs aligned and sized to exploit the cache lines, pin threads to cores without sharing memory, prefer memory barriers, avoid syscalls, fibers over OS threads. Because optimizations at this level almost always come at a cost in terms of code complexity, his advice is to always test and measure an optimization to ensure this extra complexity is worth it (since compilers are pretty good at optimizing things). And beware microbenchmarks. The talk was very interesting and covered many other low level Linux specific optimizations that would be very interesting for any developers concerned with maximizing performance on this platform.

Paul McKenney of Facebook has a great deal of experience writing critical concurrent software. In his presentation - The Price of Fast and Robust Concurrent Software - he touches on the problems commonly encountered as the maintainer of the RCU module in the Linux kernel. He mentions how a one in a million years bug on a single system can turn into a serious problem once your code runs on billions of devices, how bugs evolve with your tests and how if your tests are no longer failing they are probably no longer efficient. Overall a very entertaining talk and a rare occasion to learn from someone who has been working on a piece of code that’s being used on probably billions of devices every day.


I am going to wrap up this summary with a few thoughts on how we can apply what we learned from this summit. One thing that impressed me was the commitment of the people developing the low level infrastructure to performance - they go to extreme lengths to ensure that what we have at our disposal, as higher level developers or users, is of excellent quality. If we approach every new component that we develop for our applications with, not necessarily the same obsession, but at least some regard to performance and optimizations we can end up with applications that will behave well when subjected to stressful conditions.

Another aspect that is worth discussing is how ML has spread throughout the entire software industry - at this point, as a software engineer, you must know enough about ML to be able to recognise when a problem might require this type of solution.

And I would like to mention, before I finish, the importance of good development processes - Facebook probably started their continuous delivery and continuous performance regression testing before ML started to be used widely. The fact that they had the foresight to keep all that data allowed them now to do something that would have been practically impossible without it - they are getting closer now to identifying problematic code before even the code left the developer’s machine… interesting times ahead.

Share this blog post: