“Can you paint this apple orange?”
One of my gigs was about migrating a workload from one hyperscaler to another.
I was paired with a team lead that was chronically busy, which is why they couldn’t attend to this project themselves.
Managed Kubernetes on both ends meant I didn’t have to worry about images and server provisioning. Easy in and out, I figured.
The problem started when we discussed performance comparison. The original setup had 2 pods per node on a specific server type.
Since the providers don’t carry the same machine type, I figured I’d check the existing resource requests/limits, and transfer them to the new provider, since comparing the machines themselves is a bit apples to oranges. I got a surprising pushback on this.
“We need 2 per server”
The new provider’s k8s scheduler decided to pack 5 pods per server, obviously choosing a bigger machine type than the original one. This performed poorly - choking and failing their healthcheck. When I informed the tech people and started investigating, the lead told me “obviously they’re failing, they’re contending with 4 pods instead of 1!”, and was then needed elsewhere. I assumed that the battle-hardened lead had some wisdom hidden in that statement, and assumed that there’s a mystery constrained resource - network bandwidth, some sysctl limitation, open socket count.
The investigation
The service was very simple. No state, gets some HTTP requests and sends some HTTP requests upstream in response. Very microservice-y.
My setup was very straightforward - I observed some requests and synthesized my own. Then I created a special Python script (later a Kubernetes pod) that spams a given target with n requests per second. I put a spammer deployment and a victim pod in a cluster (to avoid external network becoming a problem) and started sending varying amounts of traffic to see when the pod breaks, and why.
After several hours, I confidently located the constrained external resource.
It was “nothing”.
Turns out the process was composed of several queues and workers and producers and consumers, and naturally the flow wasn’t homogenous and some queues were more prone to getting full, as queues do.
When one of the queues ended up getting full and causing the inserting code to block, that traffic jam grew until it reached the part that consumes HTTP requests. Since the HTTP listener was unbounded (why? unclear. maybe “for performance”), the server would take in more and more requests until all it’d do is accept the TCP handshake and hang, causing it to be terminated and taking all of the inflight requests with it.
Some time was spent tweaking queue sizes, playing with configs or adding new ones, while trying to avoid completely rewriting the app.
Where 200 pods were needed because of each process self-constraining, I reached a setup with 10 pods actually needed for peak traffic, so let’s say 20 overall. Since pod count is not an interesting metric, let’s say that a 7K$ monthly spend could be reduced to 1K$.
Not bad, I figured.
“It’s not the same”
The lead could not be convinced that 20 pods is enough. “What about resiliency?” they said, worried that a crashing server will take 5% of the inflight requests, much more dangerous than the 0.5% in a 200 pod configuration. To my suggestion of making the pods smaller and stacking them up (wasteful, but their money), they responded with “no, that’d be too small”. This started a game of trying to make the apples look as orange-ish as possible, despite them growing from the wrong tree. We kept the pods, costing the company more money, but looking the same as the legacy configuration if you squint hard enough. I tried explaining that comparing servers, or pods, or process count is apples to oranges. That it’s not a metal appliance we’re sticking in another datacenter, it’s a process. I said you should look at it as something that costs money and outputs RPS and stability. It doesn’t matter how many copies (beyond let’s say 3, for resiliency), it doesn’t matter what hardware or what size. If it runs on a pumpkin but answers all of my queries, I’ll take it.
Tangent: Autoscaling
As a cost-cutting measure, the original cluster had autoscaling. A predictive metric indicating user load was used to preemptively scale the cluster between 100 and 200 pods. Since the original one had it, the new one had to as well, despite it being between 10x and 20x overprovisioned.
I have to say that I strongly dislike autoscaling. Configuring it properly is hard. It’s a big moving part, depending on several components to work successfully. I found that it’s often used as a way to cover over not having sized your service properly (“what do I care how many RPS a server can handle? we can add as many as we need”), and I instead default to overprovisioning (e.g. 150% of peak traffic) unless the cost is prohibitive
End of story
I managed to convince the tech lead that some of the changes I did are good, or at least not-bad, and got them merged. The migration has been completed, and the new cluster is still overprovisioned with autoscaling. I constantly remind myself that it’s not my circus and not my monkeys. After some further observation, I think the 2-pod-per-server insistence is a symptom of a team that grew organically without ever having an outsider review their operating procedures. Flawed-but-working methodology became “how things are done here.” Anyway, if they want to reduce their cost by ~6K$/mo, I have just the thing :)