Terraform in Production

Difficulty

Secrets management is one of the areas where naive Terraform usage goes wrong most often — the tempting shortcuts (hardcoding a password, committing a .tfvars file) are exactly the ones that cause real incidents.

What not to do

# Never do this:
resource "aws_db_instance" "main" {
  password = "SuperSecret123!"   # plaintext in .tf, committed to git
}
# terraform.tfvars — also never commit this if it contains real secrets
db_password = "SuperSecret123!"

Both put the secret in version control history permanently (even if later removed, it remains in git log), visible to anyone with repo access, indefinitely.

Better patterns

1. Pull from a secrets manager at apply time, via a data source:

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

2. Inject via environment variables in CI, sourced from a secret store:

export TF_VAR_db_password="$(vault kv get -field=password secret/db)"
terraform apply
variable "db_password" {
  type      = string
  sensitive = true
}

3. Generate secrets Terraform itself creates, and immediately hand them to a secrets manager rather than exposing them as plain outputs:

resource "random_password" "db" {
  length  = 24
  special = true
}

resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = random_password.db.result
}

Remaining caveats

  • sensitive = true only redacts CLI/log output — it does not encrypt the value inside the state file, so the backend itself (S3 + KMS, Terraform Cloud) still needs to be properly access-controlled and encrypted at rest.
  • Restrict who can run terraform output or read the raw state file, since sensitive values remain retrievable there even when hidden from normal plan/apply output.

Interview-ready summary

Never hardcode secrets in configuration or commit them in tfvars; pull them from a secrets manager at apply time or inject via CI-managed environment variables, and treat the backend/state storage itself as the real security boundary that needs encryption and access control.

Related Resources

Running Terraform safely from CI/CD is less about the YAML pipeline syntax and more about enforcing a review gate before anything actually touches infrastructure.

A typical pipeline shape

On every pull request:

- run: terraform fmt -check
- run: terraform init
- run: terraform validate
- run: terraform plan -out=tfplan
- run: <post tfplan output as a PR comment for human review>

On merge to main (or after manual approval):

- run: terraform init
- run: terraform apply tfplan   # apply the *exact* plan that was reviewed

Key practices

  1. Never auto-apply from an unreviewed PR. plan runs on every PR so reviewers see the intended infrastructure diff alongside the code diff, but apply is gated behind merge (or an explicit manual approval step for production).
  2. Apply the plan you reviewed, not a freshly recomputed one. Saving the plan to a file (-out=tfplan) and running apply tfplan guarantees what gets applied is exactly what was shown in review — re-running plan right before apply risks a subtly different plan if something in the environment changed in between.
  3. Use a dedicated, least-privilege CI identity. The pipeline's AWS/Azure/GCP role should have only the permissions needed for the resources it manages, distinct from any individual engineer's broad access, and ideally distinct per environment (a prod-scoped role separate from dev).
  4. Remote backend with locking is mandatory, since multiple pipeline runs (or a pipeline run overlapping with a manual apply) could otherwise race on the same state.
  5. Pin the Terraform CLI and provider versions in CI to match what's used locally — a version mismatch between a developer's laptop and the CI runner is a classic source of "plan looks different in CI" surprises.
  6. Require explicit approval for production applies — a manual "approve" gate (GitHub Environments, a Slack approval bot, Terraform Cloud's run approval) before the prod apply step executes, even after the PR is merged.

Why this matters

The whole point of putting Terraform in CI is to make infrastructure changes go through the same rigor as application code changes: a visible diff, required review, and a controlled, auditable execution — never a human running apply -auto-approve from their laptop against production.

Terraform Cloud (now branded HCP Terraform) and Terraform Enterprise (the self-hosted version) are HashiCorp's managed platforms built around the same open-source Terraform engine, adding a layer of collaboration, governance, and execution infrastructure that teams would otherwise have to assemble themselves.

What open-source Terraform alone requires you to build

A typical DIY setup: an S3 bucket + DynamoDB table for remote state and locking, a CI pipeline (GitHub Actions/GitLab CI) to run plan/apply, a way to review plan output before merge, some ad-hoc convention for module versioning, and manually-managed IAM roles for who can trigger applies against which environment.

Pointing a configuration at Terraform Cloud

Instead of an s3/azurerm/gcs backend block, a configuration opts into the platform with a cloud block:

terraform {
  cloud {
    organization = "my-org"

    workspaces {
      name = "prod-network"
    }
  }
}

From here, terraform login authenticates the CLI, and terraform plan/apply execute as remote runs on Terraform Cloud's infrastructure rather than locally — the CLI just streams the log back to your terminal.

What Terraform Cloud/Enterprise adds out of the box

  • Remote state storage with locking — no S3/DynamoDB setup needed; state lives in the platform, versioned and locked automatically.
  • Remote/consistent execution — runs (plan/apply) execute on HashiCorp-managed (or self-hosted, for Enterprise) infrastructure rather than an individual's laptop or a bespoke CI runner, so environment/tooling drift between "my machine" and "the pipeline" disappears.
  • Private module and provider registry — teams publish internal modules with proper versioning and discoverability, rather than relying on ad-hoc Git refs scattered across configs.
  • Policy as code — Sentinel or OPA policies can block an apply that violates organizational rules (e.g., "no security group may allow 0.0.0.0/0 on port 22") before it ever executes, enforced centrally rather than hoping every reviewer catches it.
  • Run history and audit logs — a searchable record of every plan/apply, who triggered it, and what changed, which is otherwise scattered across CI logs and ad-hoc conventions.
  • Team/workspace-level RBAC — fine-grained control over who can plan vs. apply vs. manage variables for a given workspace, tied into SSO.

A minimal Sentinel policy blocking public ingress, enforced automatically before any apply that violates it is allowed to proceed:

import "tfplan/v2" as tfplan

no_open_ingress = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is not "aws_security_group_rule" or
    rc.change.after.cidr_blocks is not ["0.0.0.0/0"]
  }
}

main = rule { no_open_ingress }

When teams reach for it

Once a Terraform footprint grows past a handful of engineers and a couple of environments, the DIY combination of S3+DynamoDB+CI+ad-hoc policy checks becomes its own maintenance burden. Terraform Cloud/Enterprise replaces that patchwork with a single, opinionated, governed platform — the tradeoff being cost (for Cloud's paid tiers or Enterprise licensing) and some loss of flexibility versus a fully custom-built pipeline.

Related Resources

Testing infrastructure code is fundamentally harder than testing application code — there's no fast in-memory unit test equivalent when the thing under test is "did a real cloud API create a real VPC correctly?" Real Terraform testing strategies work in layers, from fast/cheap to slow/thorough.

Layer 1 — static checks (fast, run on every commit)

terraform fmt -check
terraform validate

fmt -check enforces consistent style; validate catches syntax errors and internal inconsistencies (referencing an undeclared variable, type mismatches) without needing any provider credentials or real infrastructure.

Layer 2 — security/best-practice linting

Tools like tflint, tfsec, and checkov statically scan configuration for misconfigurations before anything is ever planned: an S3 bucket without encryption, a security group open to 0.0.0.0/0 on a sensitive port, a resource missing required tags. These catch a large class of real-world mistakes without provisioning anything.

Layer 3 — plan review

terraform plan itself is a form of testing — reviewing the diff before apply catches "this would delete a resource I didn't expect" style mistakes, especially when automated in CI as a required PR check.

Layer 4 — real integration testing

For genuinely verifying that a module provisions what it claims:

# Using the built-in `terraform test` framework (.tftest.hcl files)
run "creates_vpc_with_expected_cidr" {
  command = plan

  assert {
    condition     = aws_vpc.main.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR block did not match expected value"
  }
}

Alternatively, Terratest (a Go library) actually applies a module against real (or LocalStack-mocked) infrastructure, asserts on real outputs/resource properties via SDK calls, and tears everything down afterward — genuinely exercising the module end-to-end rather than just checking the plan.

Why the layered approach matters

Static checks and linting run in seconds and catch the majority of common mistakes cheaply; real apply-based integration tests are slow and cost real cloud resources, so they're reserved for critical, reusable modules rather than run on every single PR. A mature testing strategy runs the cheap checks constantly and the expensive ones on a schedule or before publishing a new module version.

Related Resources

Interviewers often ask this to see whether you've actually operated Terraform at scale versus only used it on a solo project — the failure modes here are specific and recurring.

Common anti-patterns

1. One monolithic state file for the entire organization.

# Anti-pattern: every team's resources in one root module / one state file
resource "aws_vpc" "shared" { ... }
resource "aws_eks_cluster" "team_a" { ... }
resource "aws_rds_cluster" "team_b" { ... }
# ...hundreds more, all sharing one terraform.tfstate

Every team's resources live in a single apply, so a mistake anywhere blocks (or corrupts) everyone, applies get slower as the resource count grows, and the blast radius of any single change is enormous.

2. Hardcoded values instead of variables/data sources.

# Anti-pattern
resource "aws_instance" "web" {
  ami = "ami-0abcdef1234567890"   # only valid in one region, one account
}

# Better
resource "aws_instance" "web" {
  ami = data.aws_ami.latest.id     # resolved per-environment via a data source
}

Account IDs, AMI IDs, and CIDR ranges baked directly into resource blocks make the same configuration impossible to reuse across environments and force copy-paste-and-edit instead of parameterization.

3. Unpinned provider/module versions.

# Anti-pattern: no ref, no version — silently tracks whatever is newest
module "vpc" {
  source = "git::https://github.com/my-org/modules.git//vpc"
}

# Better
module "vpc" {
  source  = "git::https://github.com/my-org/modules.git//vpc?ref=v2.3.0"
}

A bare source with no ?ref=, or no version constraint on a provider, means the next terraform init -upgrade can silently pull in breaking changes.

4. Secrets committed to .tfvars or hardcoded in .tf files. Permanently exposes credentials in git history — see the secrets-management question for the fix (pull from a secrets manager or inject via TF_VAR_* in CI).

5. Manual console changes alongside Terraform-managed resources. Causes drift that erodes trust in plan output over time (see the drift-detection question).

6. No plan review step — applying directly from a local machine without anyone else seeing the diff first.

Best practices for large teams

  • Split state by environment and by service/domain, not one file per org — this limits blast radius and lets teams operate independently:
    environments/
      prod/
        network/    # own state
        compute/    # own state
        data/       # own state
    
  • Pin every version — providers, modules, and the Terraform CLI itself:
    terraform {
      required_version = ">= 1.7.0, < 2.0.0"
      required_providers {
        aws = {
          source  = "hashicorp/aws"
          version = "~> 5.0"
        }
      }
    }
    
  • Enforce fmt, validate, linting (tflint/tfsec/checkov), and plan review in CI before any merge that would trigger an apply.
  • Use a remote backend with locking, always, even for small teams — the moment more than one person touches a configuration, local state is a liability.
  • Keep modules small, composable, and independently versioned, with a clear, minimal interface (variables in, outputs out).
  • Require PR review for every change that can trigger apply, with mandatory approval gates for production, mirroring the rigor applied to application code.
  • Restrict console/manual access to Terraform-managed resources so drift can't creep in silently.

Interview-ready summary

Nearly every anti-pattern above boils down to treating infrastructure code with less rigor than application code — the fix is almost always "apply the same engineering discipline (review, versioning, testing, isolation) that you'd already insist on for a application codebase."

Related Resources