vault app deployment guide¶
Deployment | |
Edit on GitHub | /deployments/vault |
Type | Helm |
Parent app | security |
Overview
The vault
application is an installation of Vault used as the primary LSST secrets management tool.
This installation currently provides secrets for all LSST environments and clusters.
Eventually, we may install a separate production Vault at the LDF and use the Roundtable Vault only for non-LDF services.
See DMTN-112 for the LSST Vault design.
This application deploys the Vault server. The Vault Agent Injector is not enabled since we’re instead using the Vault Secrets Operator.
Vault is configured in HA mode with a public API endpoint accessible at vault.lsst.codes. TLS termination is done at the nginx ingress layer using a Let’s Encrypt server certificate. The UI is available at vault.lsst.codes.
To manipulate the secrets stored in this Vault instance, use lsstvaultutils.
Upgrading Vault
Unlike many applications managed by Argo CD, upgrading Vault requires manual intervention.
The Vault Helm chart is monitored by WhiteSource Renovate and can be upgraded via the normal pull requests that tool creates. It will not change the version of the Vault software itself. When merging an update, Argo CD will get confused about the upgrade status of Vault. Fixing this will require the manual intervention explained below. It’s good practice to check if Vault itself has an update each time the Vault Helm chart is updated.
To upgrade Vault itself, first change the pinned version in the server.image.tag
setting in /deployments/vault/values.yaml.
Then, after that change is merged and Argo CD has applied the changes in Kubernetes, follow the Vault upgrade instructions.
You can skip the helm upgrade
step, since Argo CD has already made the equivalent change.
Where those instructions say to delete a pod, deleting it through Argo CD works correctly.
You don’t need to resort to kubectl
.
Kubernetes will not automatically upgrade the servers because the Vault StatefulSet
uses OnDelete
as the update mechanism.
After the upgrade is complete, you will notice that Argo CD thinks a deployment is in progress and will show the application as Progressing rather than Healthy. This is due to an Argo CD bug. To work around that bug:
- Open the Argo CD dashboard and locate the
StatefulSet
. You will see two or moreControllerRevision
resources with increasing versions, and threePod
resources. Check all thePod
resources and confirm they’re running the latest revision. (This should be the end result of following the above-linked upgrade instructions.) - Delete the unused
ControllerRevision
objects using Argo CD, leaving only the latest that matches the revision the pods are running. - Delete one of the stand-by pods and let Kubernetes recreate it. See the Vault upgrade instructions above for how to identify which pods are stand-by.
This should clear the erroneous Progressing status in Argo CD.
Changing Vault configuration
When making configuration changes, be aware that Argo CD will not detect a change to the configuration in values.yaml
and automatically relaunch the vault-*
pods.
You will need to delete the pods and let Kubernetes recreate them in order to pick up changes to the configuration.
If you change the number of HA replicas, Argo CD will fail to fully synchronize the configuration because the PodDisruptionBudget
resource cannot be updated.
Argo CD will show an error saying that a resource failed to synchronize.
To fix this problem, delete the PodDisruptionBudget
resource in Argo CD and then resynchronize the vault
app, and then Argo CD will be happy.
Seal configuration
A Vault database is “sealed” by encrypting the stored data with an encryption key, which in turn is encrypted with a master key. In a default Vault installation, the master key is then split with Shamir secret sharing and a quorum of key fragments is required to manually unseal the Vault each time the Vault server is restarted. This is a poor design for high availability and for Kubernetes management, so we instead use an “auto-unseal” configuration.
Auto-unsealing works by using a Google KMS key as the master key. That KMS key is stored internally by Google and cannot be retrieved or downloaded, but an application can request that data be encrypted or decrypted with that key. Vault has KMS decrypt the encryption key on startup and then uses that to unseal the Vault database. The Vault server uses a Google service account with permission on the relevant key ring to authenticate to KMS to perform this operation.
In auto-unseal mode, there is still a manual key, but this key is called a “recovery key” and cannot be used to unseal the database. It is, however, still needed for certain operations, such as seal key migration.
The recovery key for the Vault is kept in 1Password.
Changing seal keys
It is possible to change the key used to seal Vault (if, for instance, Vault needs to be migrated to another GCP project), but it’s not well-documented and is moderately complicated. Here are the steps:
- Add
disabled = "true"
to theseal
configuration stanza invalues.yaml
. - Change
vault.server.ha.relicas
to 1 invalues.yaml
. - Push the changes and wait for Argo CD to sync and remove the other running Vault containers. Argo CD will complain about synchronizing the affinity configuration; this is harmless and can be ignored.
- Relaunch the
vault-0
pod by deleting it and letting Kubernetes recreate it. - Get the recovery key from 1Password.
kubectl exec --namespace=vault -ti vault-0 -- vault operator unseal -migrate
and enter the recovery key. This will disable auto-unseal and convert the unseal recovery key to be a regular unseal key using Shamir. Vault is no longer using the KMS key at this point.- Change the KMS
seal
configuration stanza invalues.yaml
to point to the new KMS keyring and key that you want to use. Removedisabled = "true"
from theseal
configuration. Push this change and wait for Argo CD to synchronize it. - Relaunch the
vault-0
pod by deleting it and letting Kubernetes recreate it. kubectl exec --namespace=vault -ti vault-0 -- vault operator unseal -migrate
and enter the recovery key. This will reseal Vault using the KMS key and convert the unseal key you have been using back to being a recovery key.- Change
vault.server.ha.replicas
back to 3 invalues.yaml
, push, and let Argo CD start the remaining Vault pods.
External configuration
This deployment is currently only tested on GCP. Vault requires some external resources and a Kubernetes secret to be created and configured before deployment.
- Create a GCP service account named
vault-server
and download credentials for that service account. - Store those credentials as
credentials.json
in a Kubernetes secret in thevault
namespace namedvault-kms-creds
. You may have to manually create thevault
namespace first. - Create a GCP KMS keyring in the
global
region namedvault
and a symmetric key namedvault-seal
. - Grant the
vault-server
service account theCloud KMS CryptoKey Encrypter/Decrypter
role in IAM. This should be constrained to only thevault
keyring, but I was unable to work out how to do that correctly in GCP. - Create a GCS bucket named
storage.vault.lsst.codes
. (This will require domain valiation in Google Webmaster Tools.) Configure this bucket in the permissions tab to use uniform bucket-level access (there is no meaningful reason to use per-object access for this application and this configuration is simpler). Grant thevault-server
service account theStorage Object Admin
role on this bucket in the bucket permissions tab. - Create a DNS entry for the new cluster pointing to the external IP of the nginx ingress for the Kubernetes cluster in which this application is deployed.
We have
vault.lsst.codes
for the primary Vault installation andvault-1.lsst.codes
andvault-2.lsst.codes
for test installations. Update the list of hostnames configured in the Vaultvalues.yaml
file for the ingress to match the DNS entries that you want to use. The DNS entry has to match the hostname for cert-manager to be able to get TLS certificates for Vault.
When deploying Vault elsewhere, at least the storage bucket name will have to change because bucket names are globally unique in GCP.
Note that the GCP project and region are also encoded in values.yaml
if deploying elsewhere.
Backups
Google Cloud Storage promises a very high level of durability, so backups are primarily to protect against human errors, upgrade disasters, or dangerous actions by Vault administrators.
The first line of protection is to set a lifecycle rule that deletes objects if there are more than three newer versions, and then enable object versioning in the storage.vault.lsst.codes
bucket:
$ gsutil versioning set on gs://storage.vault.lsst.codes/
That will allow restoring a previous version of the Vault database.
To also protect against problems that weren’t caught immediately, and against human error such as deleting the volume, create a backup:
Create a GCS bucket named
backup.vault.lsst.codes
. It’s unclear what settings to use for this to minimize cost. In particular, nearline or cold storage may be cheaper, or may not be given backup by data transfer, and it’s impossible to figure this out from the documentation. The current configuration uses a single-region bucket (inus-central1
) with standard storage.Set a lifecycle rule to delete objects if there are more than twenty newer versions.
Enable object versioning on that bucket:
$ gsutil versioning set on gs://backup.vault.lsst.codes/
Create a daily periodic transfer from
storage.vault.lsst.codes
tobackup.vault.lsst.codes
. The current configuration schedules this for 2am Pacific time. The time zone shown is the local time zone. That time was picked to be in the middle of the night for US project staff.
Migrating Vault
If you want to migrate a Vault deployment from one GCP project and Kubernetes cluster to another, do the following:
- Create the external configuration required for the new Vault server in the new GCP project.
- Grant the new service account access to the KMS keyring and key used for unsealing in the old GCP project. This is necessary to be able to do a seal migration later. See this StackOverflow answer for how to grant access.
- Copy the data from the old GCS bucket to the new GCS bucket using a GCS transfer.
- Configure the new vault to point to the KMS keyring and key in the old project.
- Perform a seal migration to switch from the old seal key in KMS in the old GCP project to the new seal key in the new GCP project.
- Change DNS to point the Vault server name (generally
vault.lsst.codes
) to point to the new installation.