Changed 3 lines of code, saved 760 server hours per month

4 minute read

Act 1, where I write Java

In the past, I had the opportunity to assist a team in developing an Android application and a Java server. While my primary focus was on the networking and container environment, a hackathon presented an opportunity to dive deeper into the app’s functionality. I discovered that the server was utilizing an excessive amount of CPU time, approximately 0.5% of the total, to check a particular feature flag. Recognizing the potential for cost savings, I implemented a caching mechanism to store the flag for five-second intervals, thus saving the company a tidy sum.

During the development process, I observed that every update I made to my pull request (PR) required over 20 minutes for validation. Upon further investigation of the PR signals, I discovered that the end-to-end (e2e) signal was the clear outlier, consistently taking nearly 20 minutes, while all the other signals took less than two minutes. After the hackathon, I revisited this issue to identify the root cause.

Act 2, with big plans for great things

I consulted with the engineer responsible for our e2e infrastructure to inquire about the validation delays I had been experiencing. In response, they confirmed that the entire system was subpar, and that there were plans to restructure it in the coming months. They said the following was missing:

  1. Utilizing previously-built or partially-built artifacts
  2. Building production rather than debug code (including minified assets and no symbols)
  3. Connecting to a common log analysis framework to identify specific issues

While these sounded good to me, I couldn’t resist taking a closer look at the code myself. What I found there shocked me.

Act 3, where I spot a problem

The e2e code was hard to read (non-common PHP dialect, sprinkled with JavaScript), but I figured out a specific piece that looked weird to me. Translated to JavaScript, it looked a bit like this:

function e2eTest(commit) {
    checkOut(commit);
    // ...
    await runProcess("server/build.sh");
    // ...
    const artifact = await androidBuilder.buildApk(commit, "client");
    // ...
    const serverProcess = runProcess("java server/server.jar localhost:8000");
    // Magic to ensure server is running
    await androidTestRunner.run(artifact, "localhost:8000");
    serverProcess.kill();
}

If you feel annoyed, it means you can see what I saw:
We’re waiting for the server build to finish before even starting client build
To avoid wasting time building the APK when there are issues with the server, we have a process in place to wait for the server to finish before building the client. However, this approach results in unnecessary delays since the two builds do not benefit from being run sequentially. I confirmed this and approached the owner of the end-to-end (e2e) process to address the issue. The owner appeared indifferent and casually mentioned that they planned to implement a new build system that automatically parallelizes the build steps next half. After undrestanding that this is not being currently handled, I went to work.

Act 4. Cut and paste is used

Using my limited knowledge of parallelism in our e2e codebase, I rewrote the above to something like:

function e2eTest(commit) {
    checkOut(commit);
    // ...
    const [_, artifact] = Promise.all([
        await runProcess("server/build.sh"),
        await androidBuilder.buildApk(commit, "client"),
    ]);
    // ...
    const serverProcess = runProcess("java server/server.jar localhost:8000");
    // Magic to ensure server is running
    await androidTestRunner.run(artifact, "localhost:8000");
    serverProcess.kill();
}

With the new build system in place, we are now able to build both the server and client in parallel and wait for both to complete before initiating testing. I tested and deployed this change promptly and effortlessly, as it was a small change only rearranging the execution methods in a particular area.

To our delight, this minor modification reduced the e2e signal’s run time from 20 minutes to 11 minutes, resulting in significant time savings for our engineering team, who no longer had to wait as long for PRs to be ready. While this achievement was appreciated by the team, the server runtime savings were more easily measured.
By decreasing the e2e runtime, we were able to reduce the worker time that our build system consumed. Analyzing our build system’s utilization logs, I discovered that this simple change eliminated 760 hours of worker time per month, which is a substantial amount.

Epilogue: big plans are crashing

Upon informing my colleagues of the improvement in e2e testing, I received feedback from the e2e specialist. They expressed dissatisfaction because their six-month plan to optimize the process would be more challenging to justify since the expected gains were now less significant. They had anticipated achieving a 7-minute e2e suite run time, which is a considerable improvement from 20 minutes, but less impressive compared to the new 11 minutes. After discussing the situation, we concluded that if the overhaul was less attractive now, it would be best to focus on other areas that could yield more significant gains. For instance, there may be other areas of the codebase with inefficient feature-flag checks that consume unnecessary CPU time, and optimizing these areas could save us a considerable amount of money.

I ended up moving to another team shortly after, but I will forever remember how my 3-line change saved our company a whole worker a month.