Production hardening

While the tezos-on-gke project lets you quickly deploy a cluster with a running baker setup, running a production baker requires careful planning and execution.

Terraform service account

Usage of Google Default Application Credentials is not recommended in a production environment.

Instead:

  • ensure that you have set up an Organization - that can be done by registering a domain name and adding it to gcloud
  • create a Terraform Admin Project, Terraform Service Account and Service Account Credentials following this Google guide
  • do not pass project as a variable when deploying the resources. Instead, pass organization_id and billing_account as variables
  • pass the service account credentials json file serviceAccount:terraform@${TF_ADMIN}.iam.gserviceaccount.com as terraform_service_account_credentials terraform variable

That will create the cluster in a new project, created by the terraform service account.

You may then grant people in your organization access to the project. It is recommended to write more terraform manifests to do so.

Separate cluster definition from baker definition

While you can create a baker in one-shot, it is best suited for demos and testnets. A production baker is best advised to be defined declaratively.

You would normally write all the parameters defining your baker in a terraform.tfvars file in your laptop.

Instead, it is recommended to create a maintain a private terraform manifest, declaring a cluster, and every deployment that lives within. This way, the paramters defining your cluster can also be committed to git (except secrets which should be handled separately, more on this below).

This keeps the setup maintainable by letting you define several deployments within the same cluster. For example, you may deploy the Tezos baker setup, and the Tezos monitoring setup, within the same cluster.

Define the cluster

The terraform-gke-blockchain repository contains boilerplate terraform code to deploy a kubernetes cluster.

Start by declaring one empty cluster and one terraform provider:

module "terraform-gke-blockchain" {
  source                                = "github.com/midl-dev/terraform-gke-blockchain?ref=v2.0"
  org_id                                = "<my org id, defined above>"
  billing_account                       = "<my billing account, defined above>"
  project_prefix                        = "mybakingop"
  monitoring_slack_url                  = var.monitoring_slack_url
  terraform_service_account_credentials = "~/.config/gcloud/terraform-service-account-credentials.json"
  node_pools = { "baking_pool" : { "node_count": 1, "instance_type": "e2-standard-2" },
    "monitoring_pool" : { "node_count": 1, "instance_type": "e2-standard-1" } }
}

# This file contains all the interactions with Kubernetes
provider "kubernetes" {
  host             = module.terraform-gke-blockchain.kubernetes_endpoint
  cluster_ca_certificate = module.terraform-gke-blockchain.cluster_ca_certificate
  token = data.google_client_config.current.access_token
}

Notice that we created two node pools. These are distinct virtual machines that run your kubernetes cluster. You can map your pods to either. We will be using these to separate the baker setup from the payout/monitoring setup.

Define the tezos baker

Within the tezos-on-gke repository, the terraform-no-cluster-create folder will deploy the baker on a pre-existing cluster.

The output parameters of the terraform-gke-blockchain module become the input parameters of the Tezos baker module.

All variables will appear in the terraform manifest itself, except secrets. Secrets should be kept as variables, and handled appropriately.

It looks like:

module "tezos-baker" {
  source                          = "github.com/midl-dev/tezos-on-gke?ref=v2.0//terraform-no-cluster-create"
  region                          = module.terraform-gke-blockchain.location
  node_locations                  = module.terraform-gke-blockchain.node_locations
  kubernetes_endpoint             = module.terraform-gke-blockchain.kubernetes_endpoint
  cluster_ca_certificate          = module.terraform-gke-blockchain.cluster_ca_certificate
  cluster_name                    = module.terraform-gke-blockchain.name
  kubernetes_access_token         = data.google_client_config.current.access_token
  kubernetes_pool_name            = "baking_pool"
  project                         = module.terraform-gke-blockchain.project
  full_snapshot_url               = "https://mainnet.xtz-shots.io/full"
  rolling_snapshot_url            = "https://mainnet.xtz-shots.io/rolling"
  kubernetes_namespace            = "tezos"
  kubernetes_name_prefix          = "xtz"
  tezos_version                   = "v9.2"
  tezos_network                   = "mainnet"
  baking_nodes = {
    "mybaker" : {
      "mynode" : {
        "public_baking_key_hash": "tz1YmsrYxQFJo5nGj4MEaXMPdLrcRf2a5mAU",
        "public_baking_key": "edpk...",
        "insecure_private_baking_key": "edsk3cftTNcJnxb7ehCxYeCaKPT7mjycdMxgFisLixrQ9bZuTG2yZK"
      }
    }
  }
}

With remote signer

It is recommended to use a remote signer for secure operations.

Below is an example of baker with remote signer configured:

module "tezos-baker" {
  source                          = "github.com/midl-dev/tezos-on-gke?ref=v2.0//terraform-no-cluster-create"
  region                          = module.terraform-gke-blockchain.location
  node_locations                  = module.terraform-gke-blockchain.node_locations
  kubernetes_endpoint             = module.terraform-gke-blockchain.kubernetes_endpoint
  cluster_ca_certificate          = module.terraform-gke-blockchain.cluster_ca_certificate
  cluster_name                    = module.terraform-gke-blockchain.name
  kubernetes_access_token         = data.google_client_config.current.access_token
  kubernetes_pool_name            = "baking_pool"
  project                         = module.terraform-gke-blockchain.project
  full_snapshot_url               = "https://mainnet.xtz-shots.io/full"
  rolling_snapshot_url            = "https://mainnet.xtz-shots.io/rolling"
  kubernetes_namespace            = "tezos"
  kubernetes_name_prefix          = "xtz"
  tezos_version                   = "v9.2"
  tezos_network                   = "mainnet"
  signer_target_host_key=var.signer_target_host_key
  baking_nodes = {
    "mybaker" : {
      "mynode" : {
        "public_baking_key_hash": "tz1YmsrYxQFJo5nGj4MEaXMPdLrcRf2a5mAU",
        "public_baking_key": "edpk...",
        "ledger_authorized_path": "ledger://my-four-key-words/ed25519/0h/1h",
        authorized_signers : [
                { "ssh_pubkey" : "ssh-rsa AAAAB<snip>==",
                  "signer_port" : 8443,
                  "tunnel_endpoint_port" : 51756 }
        ]
      }
    }
  }
}

With payout config

The baking_nodes section also accepts a config for TRD payouts. See the TRD payouts section for details.

Terraform remote state

Terraform normally maintains state locally. Accidental loss of this file will cause your setup to be unmaintainable. Therefore, it is good practice to store the state in a remote Storage Bucket.

The best location for this storage bucket is the Terraform Admin project created above.

In the private terraform file, add the following:

terraform {
  backend "gcs" {
    bucket  = "terraform-state-midl-prod"
    prefix  = "terraform/state"
  }
}

The state will now be stored remotely.

More info in the Terraform documentation.

Putting it all together

This terraform manifest deploys a full Tezos baker.

It creates a cluster with two node pools: one for the baker and one for the remaining containers.

It deploys a baker and an auxiliary cluster handling the payouts, external monitoring and website.

terraform {
  backend "gcs" {
    bucket  = "terraform-state-midl-prod"
    prefix  = "terraform/state"
  }
}

data "google_client_config" "current" {
}

module "terraform-gke-blockchain" {
  source                                = "github.com/midl-dev/terraform-gke-blockchain?ref=v1.0"
  org_id                                = "<my org id, defined above>"
  billing_account                       = "<my billing account, defined above>"
  project_prefix                        = "mybakingop"
  monitoring_slack_url                  = var.monitoring_slack_url
  terraform_service_account_credentials = "~/.config/gcloud/terraform-service-account-credentials.json"
  node_pools = { "baking_pool" : { "node_count": 1, "instance_type": "e2-standard-2" },
    "monitoring_pool" : { "node_count": 1, "instance_type": "e2-standard-1" } }
}

module "tezos-baker" {
  source                          = "github.com/midl-dev/tezos-on-gke?ref=v2.0//terraform-no-cluster-create"
  region                          = module.terraform-gke-blockchain.location
  node_locations                  = module.terraform-gke-blockchain.node_locations
  kubernetes_endpoint             = module.terraform-gke-blockchain.kubernetes_endpoint
  cluster_ca_certificate          = module.terraform-gke-blockchain.cluster_ca_certificate
  cluster_name                    = module.terraform-gke-blockchain.name
  kubernetes_access_token         = data.google_client_config.current.access_token
  project                         = module.terraform-gke-blockchain.project
  kubernetes_pool_name            = "baking_pool"
  kubernetes_namespace            = "tezos"
  kubernetes_name_prefix          = "xtz"
  full_snapshot_url               = "https://mainnet.xtz-shots.io/full"
  rolling_snapshot_url            = "https://mainnet.xtz-shots.io/rolling"
  kubernetes_namespace            = "tezos"
  tezos_version                   = "v9.2"
  tezos_network                   = "mainnet"
  signer_target_host_key=var.signer_target_host_key
  baking_nodes = {
    "mynode" : {
      "mybaker" : {
        "public_baking_key_hash": "tz1YmsrYxQFJo5nGj4MEaXMPdLrcRf2a5mAU",
        "public_baking_key": "edpk...",
        "ledger_authorized_path": "ledger://my-four-key-words/ed25519/0h/1h",
        authorized_signers : [
                { "ssh_pubkey" : "ssh-rsa AAAAB<snip>==",
                  "signer_port" : 8443,
                  "tunnel_endpoint_port" : 51756 }
        ]
        "payout_config" = {
          "schedule"="06 */3 * * *",
          "initial_cycle": 370,
          "release_override": -5,
          "network": "MAINNET",
          "reward_data_provider": "tzkt",
          "dry_run": "false",
          "payment_address": "tz1",
          "rewards_type": "actual",
          "service_fee": 5,
          "rules_map": {}
        }
      }
    }
  }
}

module "tezos-mainnet-monitoring" {
  source                          = "github.com/midl-dev/tezos-auxiliary-cluster?ref=v2.0//terraform-no-cluster-create"
  region                          = module.terraform-gke-blockchain.location
  kubernetes_endpoint             = module.terraform-gke-blockchain.kubernetes_endpoint
  cluster_ca_certificate          = module.terraform-gke-blockchain.cluster_ca_certificate
  cluster_name                    = module.terraform-gke-blockchain.name
  kubernetes_access_token         = data.google_client_config.current.access_token
  project                         = module.terraform-gke-blockchain.project
  kubernetes_pool_name            = "monitoring_pool"
  kubernetes_namespace            = "tezos"
  kubernetes_name_prefix          = "xtz"
  tezos_network                   = "mainnet" 
  tezos_version                   = "v9.2"
  protocol                        = "009-PsFLoren"
  protocol_short                  = "PsFLoren"
  rolling_snapshot_url            = "https://mainnet.xtz-shots.io/rolling"
  bakers = {
    "baker001": {
      public_baking_key="tz1xxxx",
      slack_url="https://hooks.slack.com/services/xxxxx",
      slack_channel="#general",
    }
  }
}

Note that all values above are not secrets, it is fine to commit them in a private repository. The secrets are passed with variables and must be handled separately.

Going further

A production validator should be operated with an on-call rotation, meaning several operators have access to the setup.

Specifically:

  • secrets should be moved from a file in the operator workspace to a production secret store such as Hashicorp Vault
  • terraform deploys should be done by a CI system
  • any manual change in the kubernetes environment should be recorded in an audit log and committed in the code:
    • the terraform private file above can be applied with continuous integration
    • the intermediate kubernetes code generated with kustomize could be stored in a CI pipeline and deployed in an auditable way as well (see Gitops).