State Management

Difficulty

terraform.tfstate is arguably the most important — and most misunderstood — file in a Terraform project. It's a JSON document that Terraform generates and maintains automatically; you almost never hand-edit it.

What's in it

{
  "resources": [
    {
      "type": "aws_instance",
      "name": "web",
      "instances": [
        {
          "attributes": {
            "id": "i-0abcdef1234567890",
            "ami": "ami-0abcdef1234567890",
            "instance_type": "t3.micro",
            "private_ip": "10.0.1.15"
          }
        }
      ]
    }
  ]
}

For every resource block in your configuration, state records the real-world object it maps to (here, an actual EC2 instance ID) along with a cached copy of every attribute the provider returned when it was last read.

Why Terraform needs it

Most cloud APIs have no concept of "which resources belong to Terraform configuration X" — an AWS account just has instances, buckets, and security groups with no inherent grouping by IaC tool. State is what lets Terraform:

  1. Know what it owns. Without state, Terraform couldn't tell "this VPC is mine to manage" from "this VPC belongs to something else."
  2. Compute diffs without re-deriving everything from scratch. plan compares configuration against the cached attributes in state (refreshing from the real API as needed) rather than needing to reverse-engineer intent from the live infrastructure alone.
  3. Know what to destroy. If you delete a resource block from your .tf files, Terraform only knows to destroy the corresponding real object because state still has a record of it — the configuration itself no longer mentions it at all.
  4. Map configuration addresses to real IDs, which is also what makes commands like terraform state mv, terraform import, and -target possible.

The takeaway

State is Terraform's memory of "what I created and what it currently looks like." Losing it (or letting it drift out of sync via manual changes) is one of the most common sources of Terraform pain — which is exactly why remote state, locking, and drift detection (covered in later questions) matter so much in real teams.

Related Resources

By default, terraform init uses a local backend — state lives as a plain file (terraform.tfstate) on whatever machine ran apply. That's fine for solo experimentation, but it breaks down almost immediately for a team.

Why remote state

  • Sharing. If state only exists on one engineer's laptop, nobody else can run plan/apply against the same infrastructure without copying that file around manually (and immediately risking two divergent copies).
  • Durability. A laptop disk failure shouldn't mean losing the only record of what infrastructure exists.
  • Security. Local state files are easy to accidentally commit to git in plaintext (including any sensitive attribute values); a remote backend with proper access controls avoids that.

A remote backend configuration looks like:

terraform {
  backend "s3" {
    bucket         = "my-org-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Why state locking

Without locking, imagine two engineers (or a human and a CI job) run terraform apply at the same moment. Both read the same starting state, both compute a plan, and both write back their own updated version — the second write silently clobbers the first's changes, and the real infrastructure now disagrees with what state records. Worse, concurrent writes to the same state file can corrupt it outright.

State locking prevents this: before any operation that could modify state, Terraform acquires a lock (in the example above, via a DynamoDB table using conditional writes) — a second concurrent apply blocks or fails immediately with a clear "state is locked" error instead of silently racing.

Modern approach

Newer AWS provider versions support S3-native locking (using S3 conditional writes) without a separate DynamoDB table, and platforms like Terraform Cloud/HCP Terraform provide remote state + locking + a full run history out of the box. The specific mechanism varies by backend, but the guarantee is the same: one write to state at a time.

A subtle but important gotcha: marking a variable sensitive = true only changes what Terraform prints — it does not encrypt or omit that value from the state file itself.

variable "db_password" {
  type      = string
  sensitive = true
}

resource "aws_db_instance" "main" {
  password = var.db_password   # ends up in plaintext in terraform.tfstate
}

Run terraform apply and the CLI output will show password = (sensitive value) — but open terraform.tfstate directly and the real password is sitting there in plaintext, because Terraform must retain the actual value to know if it changed on the next plan.

How to actually protect this data

  1. Secure the backend, not just the value. Store state in a backend with encryption at rest (S3 with SSE-KMS, Terraform Cloud's encrypted storage) and strict IAM/RBAC so only the CI role and authorized operators can read it. Enable access logging/audit trails on the state bucket.
  2. Avoid putting long-lived secrets in Terraform at all where possible. Instead of a Terraform variable holding a real password, use a resource that generates a random password and immediately stores it in a secrets manager (random_passwordaws_secretsmanager_secret_version), so the application reads the secret from Secrets Manager/Vault at runtime rather than from a Terraform output.
  3. Never commit .tfvars files containing real secrets. Inject them via CI secret stores (TF_VAR_db_password env var) instead.
  4. Restrict who can run terraform output or read state directly — sensitive outputs are also stored in plaintext in state even though the CLI hides them by default.
  5. Consider state encryption features (available in newer Terraform versions / Terraform Cloud) that encrypt the state file itself at rest, on top of backend-level encryption.

Interview-ready summary

sensitive = true is a UI/UX safeguard against accidental screen/log exposure, not an encryption mechanism. Real protection comes from securing the backend storage and minimizing how many genuine secrets ever flow through Terraform state in the first place.

Related Resources

These three commands all manipulate the mapping between configuration and real infrastructure without necessarily touching the real infrastructure itself — they're the toolkit for when state and configuration need to be reconciled deliberately.

terraform state mv

Renames or moves a resource's address in state without destroying/recreating it.

terraform state mv aws_instance.web aws_instance.web_server

Common scenario: you refactor configuration (rename a resource, move it into a module) and want Terraform to understand "this is the same real object, just under a new address" — without this, Terraform would plan to destroy the old address and create a new one under the new address.

terraform state rm

Removes a resource from state without touching the real object.

terraform state rm aws_instance.legacy

Terraform simply "forgets" the resource — the EC2 instance keeps running, but Terraform no longer manages or tracks it. Useful when you're deliberately handing a resource off to be managed manually or by a different tool/team, or need to remove a broken state entry.

terraform import

The reverse operation: bring an existing, unmanaged real-world object under Terraform's management.

terraform import aws_instance.web i-0abcdef1234567890

This requires a matching resource "aws_instance" "web" { ... } block already written in configuration (import only populates state — it doesn't generate .tf code for you, though terraform plan -generate-config-out in newer versions can help scaffold it). After import, Terraform treats that resource as fully managed going forward.

Why these matter

All three exist to avoid unwanted destroy-and-recreate cycles that would otherwise happen if configuration and state addresses don't line up exactly. They're "state surgery" — precise, deliberate operations rather than something you'd run casually; a mistake with state rm followed by re-applying can cause Terraform to try to create a duplicate of something that already exists.

Related Resources

Configuration drift is the gap between what Terraform's state file believes is true and what's actually running in the real infrastructure.

Common causes

  • A teammate manually edits a resource in the cloud console ("just this once, to fix production quickly").
  • An external automated process modifies a resource Terraform also manages (an autoscaler adjusting instance count, a security tool auto-remediating a misconfigured setting).
  • Another Terraform configuration or tool touches the same underlying resource.
  • A resource is deleted outside of Terraform (e.g., by a cleanup script or another engineer), so state still references something that no longer exists.

Detecting drift

terraform plan

By default, plan first refreshes its in-memory view of each managed resource by querying the provider's API, then diffs that refreshed data against configuration. If the console change altered something your configuration also specifies, plan reports an unexpected diff — e.g., "tags will be updated" when you didn't touch tags in code, which is a signal someone changed it manually.

For a refresh-only check that doesn't also propose config-driven changes:

terraform plan -refresh-only

This isolates just the drift (state vs. reality) from any intentional changes you've made in configuration, which is useful for scheduled drift-detection jobs in CI.

Reconciling drift

Two directions, depending on which side should "win":

  1. Terraform should win — re-run a normal apply. Terraform overwrites the manual change back to what configuration specifies, restoring the intended state.
  2. The manual change should be kept — update your .tf configuration to match the new reality, then run terraform apply -refresh-only to accept the drifted values into state without triggering an actual infrastructure change.

Why this matters operationally

Unmanaged drift erodes the entire premise of IaC — if the console can silently diverge from configuration, plan output stops being trustworthy. Mature teams run scheduled drift-detection (plan -refresh-only in a nightly CI job, alerting on any diff) and restrict console access precisely so Terraform-managed resources stay Terraform-managed.

Related Resources