Secrets management is one of the areas where naive Terraform usage goes wrong most often — the tempting shortcuts (hardcoding a password, committing a .tfvars file) are exactly the ones that cause real incidents.
What not to do
# Never do this:
resource "aws_db_instance" "main" {
password = "SuperSecret123!" # plaintext in .tf, committed to git
}
# terraform.tfvars — also never commit this if it contains real secrets
db_password = "SuperSecret123!"
Both put the secret in version control history permanently (even if later removed, it remains in git log), visible to anyone with repo access, indefinitely.
Better patterns
1. Pull from a secrets manager at apply time, via a data source:
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/db/password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
2. Inject via environment variables in CI, sourced from a secret store:
export TF_VAR_db_password="$(vault kv get -field=password secret/db)"
terraform apply
variable "db_password" {
type = string
sensitive = true
}
3. Generate secrets Terraform itself creates, and immediately hand them to a secrets manager rather than exposing them as plain outputs:
resource "random_password" "db" {
length = 24
special = true
}
resource "aws_secretsmanager_secret_version" "db" {
secret_id = aws_secretsmanager_secret.db.id
secret_string = random_password.db.result
}
Remaining caveats
sensitive = trueonly redacts CLI/log output — it does not encrypt the value inside the state file, so the backend itself (S3 + KMS, Terraform Cloud) still needs to be properly access-controlled and encrypted at rest.- Restrict who can run
terraform outputor read the raw state file, since sensitive values remain retrievable there even when hidden from normalplan/applyoutput.
Interview-ready summary
Never hardcode secrets in configuration or commit them in tfvars; pull them from a secrets manager at apply time or inject via CI-managed environment variables, and treat the backend/state storage itself as the real security boundary that needs encryption and access control.
Related Resources
Running Terraform safely from CI/CD is less about the YAML pipeline syntax and more about enforcing a review gate before anything actually touches infrastructure.
A typical pipeline shape
On every pull request:
- run: terraform fmt -check
- run: terraform init
- run: terraform validate
- run: terraform plan -out=tfplan
- run: <post tfplan output as a PR comment for human review>
On merge to main (or after manual approval):
- run: terraform init
- run: terraform apply tfplan # apply the *exact* plan that was reviewed
Key practices
- Never auto-apply from an unreviewed PR.
planruns on every PR so reviewers see the intended infrastructure diff alongside the code diff, butapplyis gated behind merge (or an explicit manual approval step for production). - Apply the plan you reviewed, not a freshly recomputed one. Saving the plan to a file (
-out=tfplan) and runningapply tfplanguarantees what gets applied is exactly what was shown in review — re-runningplanright beforeapplyrisks a subtly different plan if something in the environment changed in between. - Use a dedicated, least-privilege CI identity. The pipeline's AWS/Azure/GCP role should have only the permissions needed for the resources it manages, distinct from any individual engineer's broad access, and ideally distinct per environment (a
prod-scoped role separate fromdev). - Remote backend with locking is mandatory, since multiple pipeline runs (or a pipeline run overlapping with a manual
apply) could otherwise race on the same state. - Pin the Terraform CLI and provider versions in CI to match what's used locally — a version mismatch between a developer's laptop and the CI runner is a classic source of "plan looks different in CI" surprises.
- Require explicit approval for production applies — a manual "approve" gate (GitHub Environments, a Slack approval bot, Terraform Cloud's run approval) before the prod
applystep executes, even after the PR is merged.
Why this matters
The whole point of putting Terraform in CI is to make infrastructure changes go through the same rigor as application code changes: a visible diff, required review, and a controlled, auditable execution — never a human running apply -auto-approve from their laptop against production.
Related Resources
Terraform Cloud (now branded HCP Terraform) and Terraform Enterprise (the self-hosted version) are HashiCorp's managed platforms built around the same open-source Terraform engine, adding a layer of collaboration, governance, and execution infrastructure that teams would otherwise have to assemble themselves.
What open-source Terraform alone requires you to build
A typical DIY setup: an S3 bucket + DynamoDB table for remote state and locking, a CI pipeline (GitHub Actions/GitLab CI) to run plan/apply, a way to review plan output before merge, some ad-hoc convention for module versioning, and manually-managed IAM roles for who can trigger applies against which environment.
Pointing a configuration at Terraform Cloud
Instead of an s3/azurerm/gcs backend block, a configuration opts into the platform with a cloud block:
terraform {
cloud {
organization = "my-org"
workspaces {
name = "prod-network"
}
}
}
From here, terraform login authenticates the CLI, and terraform plan/apply execute as remote runs on Terraform Cloud's infrastructure rather than locally — the CLI just streams the log back to your terminal.
What Terraform Cloud/Enterprise adds out of the box
- Remote state storage with locking — no S3/DynamoDB setup needed; state lives in the platform, versioned and locked automatically.
- Remote/consistent execution — runs (
plan/apply) execute on HashiCorp-managed (or self-hosted, for Enterprise) infrastructure rather than an individual's laptop or a bespoke CI runner, so environment/tooling drift between "my machine" and "the pipeline" disappears. - Private module and provider registry — teams publish internal modules with proper versioning and discoverability, rather than relying on ad-hoc Git refs scattered across configs.
- Policy as code — Sentinel or OPA policies can block an
applythat violates organizational rules (e.g., "no security group may allow 0.0.0.0/0 on port 22") before it ever executes, enforced centrally rather than hoping every reviewer catches it. - Run history and audit logs — a searchable record of every plan/apply, who triggered it, and what changed, which is otherwise scattered across CI logs and ad-hoc conventions.
- Team/workspace-level RBAC — fine-grained control over who can plan vs. apply vs. manage variables for a given workspace, tied into SSO.
A minimal Sentinel policy blocking public ingress, enforced automatically before any apply that violates it is allowed to proceed:
import "tfplan/v2" as tfplan
no_open_ingress = rule {
all tfplan.resource_changes as _, rc {
rc.type is not "aws_security_group_rule" or
rc.change.after.cidr_blocks is not ["0.0.0.0/0"]
}
}
main = rule { no_open_ingress }
When teams reach for it
Once a Terraform footprint grows past a handful of engineers and a couple of environments, the DIY combination of S3+DynamoDB+CI+ad-hoc policy checks becomes its own maintenance burden. Terraform Cloud/Enterprise replaces that patchwork with a single, opinionated, governed platform — the tradeoff being cost (for Cloud's paid tiers or Enterprise licensing) and some loss of flexibility versus a fully custom-built pipeline.
Related Resources
Testing infrastructure code is fundamentally harder than testing application code — there's no fast in-memory unit test equivalent when the thing under test is "did a real cloud API create a real VPC correctly?" Real Terraform testing strategies work in layers, from fast/cheap to slow/thorough.
Layer 1 — static checks (fast, run on every commit)
terraform fmt -check
terraform validate
fmt -check enforces consistent style; validate catches syntax errors and internal inconsistencies (referencing an undeclared variable, type mismatches) without needing any provider credentials or real infrastructure.
Layer 2 — security/best-practice linting
Tools like tflint, tfsec, and checkov statically scan configuration for misconfigurations before anything is ever planned: an S3 bucket without encryption, a security group open to 0.0.0.0/0 on a sensitive port, a resource missing required tags. These catch a large class of real-world mistakes without provisioning anything.
Layer 3 — plan review
terraform plan itself is a form of testing — reviewing the diff before apply catches "this would delete a resource I didn't expect" style mistakes, especially when automated in CI as a required PR check.
Layer 4 — real integration testing
For genuinely verifying that a module provisions what it claims:
# Using the built-in `terraform test` framework (.tftest.hcl files)
run "creates_vpc_with_expected_cidr" {
command = plan
assert {
condition = aws_vpc.main.cidr_block == "10.0.0.0/16"
error_message = "VPC CIDR block did not match expected value"
}
}
Alternatively, Terratest (a Go library) actually applies a module against real (or LocalStack-mocked) infrastructure, asserts on real outputs/resource properties via SDK calls, and tears everything down afterward — genuinely exercising the module end-to-end rather than just checking the plan.
Why the layered approach matters
Static checks and linting run in seconds and catch the majority of common mistakes cheaply; real apply-based integration tests are slow and cost real cloud resources, so they're reserved for critical, reusable modules rather than run on every single PR. A mature testing strategy runs the cheap checks constantly and the expensive ones on a schedule or before publishing a new module version.
Related Resources
Interviewers often ask this to see whether you've actually operated Terraform at scale versus only used it on a solo project — the failure modes here are specific and recurring.
Common anti-patterns
1. One monolithic state file for the entire organization.
# Anti-pattern: every team's resources in one root module / one state file
resource "aws_vpc" "shared" { ... }
resource "aws_eks_cluster" "team_a" { ... }
resource "aws_rds_cluster" "team_b" { ... }
# ...hundreds more, all sharing one terraform.tfstate
Every team's resources live in a single apply, so a mistake anywhere blocks (or corrupts) everyone, applies get slower as the resource count grows, and the blast radius of any single change is enormous.
2. Hardcoded values instead of variables/data sources.
# Anti-pattern
resource "aws_instance" "web" {
ami = "ami-0abcdef1234567890" # only valid in one region, one account
}
# Better
resource "aws_instance" "web" {
ami = data.aws_ami.latest.id # resolved per-environment via a data source
}
Account IDs, AMI IDs, and CIDR ranges baked directly into resource blocks make the same configuration impossible to reuse across environments and force copy-paste-and-edit instead of parameterization.
3. Unpinned provider/module versions.
# Anti-pattern: no ref, no version — silently tracks whatever is newest
module "vpc" {
source = "git::https://github.com/my-org/modules.git//vpc"
}
# Better
module "vpc" {
source = "git::https://github.com/my-org/modules.git//vpc?ref=v2.3.0"
}
A bare source with no ?ref=, or no version constraint on a provider, means the next terraform init -upgrade can silently pull in breaking changes.
4. Secrets committed to .tfvars or hardcoded in .tf files. Permanently exposes credentials in git history — see the secrets-management question for the fix (pull from a secrets manager or inject via TF_VAR_* in CI).
5. Manual console changes alongside Terraform-managed resources. Causes drift that erodes trust in plan output over time (see the drift-detection question).
6. No plan review step — applying directly from a local machine without anyone else seeing the diff first.
Best practices for large teams
- Split state by environment and by service/domain, not one file per org — this limits blast radius and lets teams operate independently:
environments/ prod/ network/ # own state compute/ # own state data/ # own state - Pin every version — providers, modules, and the Terraform CLI itself:
terraform { required_version = ">= 1.7.0, < 2.0.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } } - Enforce
fmt,validate, linting (tflint/tfsec/checkov), andplanreview in CI before any merge that would trigger anapply. - Use a remote backend with locking, always, even for small teams — the moment more than one person touches a configuration, local state is a liability.
- Keep modules small, composable, and independently versioned, with a clear, minimal interface (variables in, outputs out).
- Require PR review for every change that can trigger
apply, with mandatory approval gates for production, mirroring the rigor applied to application code. - Restrict console/manual access to Terraform-managed resources so drift can't creep in silently.
Interview-ready summary
Nearly every anti-pattern above boils down to treating infrastructure code with less rigor than application code — the fix is almost always "apply the same engineering discipline (review, versioning, testing, isolation) that you'd already insist on for a application codebase."