Le château de Valère et le Haut de Cry en juillet 2022 Christian David, CC BY-SA 4.0, via Wikimedia Commons

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform.

Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects.

We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it.

Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead.

In 2021 it was announced that the PSP mechanism would be deprecated, and removed in Kubernetes 1.25. Even though we had been warned years in advance, we did not prioritize the migration of PSP until we were in Kubernetes 1.24, and blocked, unable to upgrade forward without taking actions.

The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post.

First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions.

The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency.

I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected.

Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace — remember, we had 3.5k accounts.

We created a Kyverno policy template that we would then render and inject for each account.

For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.

We had planned the migration from PSP to Kyverno policies in stages, like this:

  1. update our internal template generators to make Pod security settings explicit
  2. introduce Kyverno policies in Audit mode
  3. see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
  4. modify Kyverno policies and set them in Enforce mode
  5. drop PSP

In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice.

In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.

This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure, reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs:

  • corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually checking the /healthz HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
  • having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete 4000 policy resources, each on a different namespace.
  • greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB. In our cluster, a single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional memory pressure on the apiserver.
  • corrected resource requests and limits of Kyverno, to more accurately describe our actual usage.
  • increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more timely by Kyverno.

I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time 🙂

When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was ‘DefaultProcMount’. However, that string is the name of the internal variable in the source code, whereas the actual default value is the string ‘Default’. This caused pods to be rightfully rejected by Kyverno while we figured the problem. I sent a patch upstream to fix this problem.

We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration.

This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete.

For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way 🙂

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.