vault app deployment guide

Deployment Vault app status
Edit on GitHub /deployments/vault
Type Helm
Parent app security

Overview

The vault application is an installation of Vault used as the primary LSST secrets management tool. This installation currently provides secrets for all LSST environments and clusters. Eventually, we may install a separate production Vault at the LDF and use the Roundtable Vault only for non-LDF services.

See DMTN-112 for the LSST Vault design.

This application deploys the Vault server. The Vault Agent Injector is not enabled since we’re instead using the Vault Secrets Operator.

Vault is configured in HA mode with a public API endpoint accessible at vault.lsst.codes. TLS termination is done at the nginx ingress layer using a Let’s Encrypt server certificate. The UI is available at vault.lsst.codes.

To manipulate the secrets stored in this Vault instance, use lsstvaultutils.

Changing Vault configuration

When making configuration changes, be aware that Argo CD will not detect a change to the configuration in values.yaml and automatically relaunch the vault-* pods. You will need to delete the pods and let Kubernetes recreate them in order to pick up changes to the configuration.

If you change the number of HA replicas, Argo CD will fail to fully synchronize the configuration because the poddisruptionbudget resource cannot be updated. Argo CD will show an error saying that a resource failed to synchronize. To fix this problem, delete the poddisruptionbudget resource in Argo CD and then resynchronize the vault app, and then Argo CD will be happy.

Seal configuration

A Vault database is “sealed” by encrypting the stored data with an encryption key, which in turn is encrypted with a master key. In a default Vault installation, the master key is then split with Shamir secret sharing and a quorum of key fragments is required to manually unseal the Vault each time the Vault server is restarted. This is a poor design for high availability and for Kubernetes management, so we instead use an “auto-unseal” configuration.

Auto-unsealing works by using a Google KMS key as the master key. That KMS key is stored internally by Google and cannot be retrieved or downloaded, but an application can request that data be encrypted or decrypted with that key. Vault has KMS decrypt the encryption key on startup and then uses that to unseal the Vault database. The Vault server uses a Google service account with permission on the relevant key ring to authenticate to KMS to perform this operation.

In auto-unseal mode, there is still a manual key, but this key is called a “recovery key” and cannot be used to unseal the database. It is, however, still needed for certain operations, such as seal key migration.

The recovery key for the Vault is kept in 1Password.

Changing seal keys

It is possible to change the key used to seal Vault (if, for instance, Vault needs to be migrated to another GCP project), but it’s not well-documented and is moderately complicated. Here are the steps:

  1. Add disabled = "true" to the seal configuration stanza in values.yaml.
  2. Change vault.server.ha.relicas to 1 in values.yaml.
  3. Push the changes and wait for Argo CD to sync and remove the other running Vault containers. Argo CD will complain about synchronizing the affinity configuration; this is harmless and can be ignored.
  4. Relaunch the vault-0 pod by deleting it and letting Kubernetes recreate it.
  5. Get the recovery key from 1Password.
  6. kubectl exec --namespace=vault -ti vault-0 -- vault operator unseal -migrate and enter the recovery key. This will disable auto-unseal and convert the unseal recovery key to be a regular unseal key using Shamir. Vault is no longer using the KMS key at this point.
  7. Change the KMS seal configuration stanza in values.yaml to point to the new KMS keyring and key that you want to use. Remove disabled = "true" from the seal configuration. Push this change and wait for Argo CD to synchronize it.
  8. Relaunch the vault-0 pod by deleting it and letting Kubernetes recreate it.
  9. kubectl exec --namespace=vault -ti vault-0 -- vault operator unseal -migrate and enter the recovery key. This will reseal Vault using the KMS key and convert the unseal key you have been using back to being a recovery key.
  10. Change vault.server.ha.replicas back to 3 in values.yaml, push, and let Argo CD start the remaining Vault pods.

External configuration

This deployment is currently only tested on GCP. Vault requires some external resources and a Kubernetes secret to be created and configured before deployment.

  1. Create a GCP service account named vault-server and download credentials for that service account.
  2. Store those credentials as credentials.json in a Kubernetes secret in the vault namespace named vault-kms-creds. You may have to manually create the vault namespace first.
  3. Create a GCP KMS keyring in the global region named vault and a symmetric key named vault-seal.
  4. Grant the vault-server service account the Cloud KMS CryptoKey Encrypter/Decrypter role in IAM. This should be constrained to only the vault keyring, but I was unable to work out how to do that correctly in GCP.
  5. Create a GCS bucket named storage.vault.lsst.codes. (This will require domain valiation in Google Webmaster Tools.) Configure this bucket in the permissions tab to use uniform bucket-level access (there is no meaningful reason to use per-object access for this application and this configuration is simpler). Grant the vault-server service account the Storage Object Admin role on this bucket in the bucket permissions tab.
  6. Create a DNS entry for the new cluster pointing to the external IP of the nginx ingress for the Kubernetes cluster in which this application is deployed. We have vault.lsst.codes for the primary Vault installation and vault-1.lsst.codes and vault-2.lsst.codes for test installations. Update the list of hostnames configured in the Vault values.yaml file for the ingress to match the DNS entries that you want to use. The DNS entry has to match the hostname for cert-manager to be able to get TLS certificates for Vault.

When deploying Vault elsewhere, at least the storage bucket name will have to change because bucket names are globally unique in GCP. Note that the GCP project and region are also encoded in values.yaml if deploying elsewhere.

Backups

Google Cloud Storage promises a very high level of durability, so backups are primarily to protect against human errors, upgrade disasters, or dangerous actions by Vault administrators. The first line of protection is to set a lifecycle rule that deletes objects if there are more than three newer versions, and then enable object versioning in the storage.vault.lsst.codes bucket:

$ gsutil versioning set on gs://storage.vault.lsst.codes/

That will allow restoring a previous version of the Vault database.

To also protect against problems that weren’t caught immediately, and against human error such as deleting the volume, create a backup:

  1. Create a GCS bucket named backup.vault.lsst.codes. It’s unclear what settings to use for this to minimize cost. In particular, nearline or cold storage may be cheaper, or may not be given backup by data transfer, and it’s impossible to figure this out from the documentation. The current configuration uses a single-region bucket (in us-central1) with standard storage.

  2. Set a lifecycle rule to delete objects if there are more than twenty newer versions.

  3. Enable object versioning on that bucket:

    $ gsutil versioning set on gs://backup.vault.lsst.codes/
    
  4. Create a daily periodic transfer from storage.vault.lsst.codes to backup.vault.lsst.codes. The current configuration schedules this for 2am Pacific time. The time zone shown is the local time zone. That time was picked to be in the middle of the night for US project staff.

Migrating Vault

If you want to migrate a Vault deployment from one GCP project and Kubernetes cluster to another, do the following:

  1. Create the external configuration required for the new Vault server in the new GCP project.
  2. Grant the new service account access to the KMS keyring and key used for unsealing in the old GCP project. This is necessary to be able to do a seal migration later. See this StackOverflow answer for how to grant access.
  3. Copy the data from the old GCS bucket to the new GCS bucket using a GCS transfer.
  4. Configure the new vault to point to the KMS keyring and key in the old project.
  5. Perform a seal migration to switch from the old seal key in KMS in the old GCP project to the new seal key in the new GCP project.
  6. Change DNS to point the Vault server name (generally vault.lsst.codes) to point to the new installation.