Observability & Production Readiness

Difficulty

Adding spring-boot-starter-actuator to a project brings in a set of built-in HTTP (and JMX) endpoints exposing operational insight into a running application — no custom monitoring code required:

Common built-in endpoints (accessible under /actuator/* by default):

  • /actuator/health — overall application health status (UP/DOWN), aggregating individual health indicators (database connectivity, disk space, message broker connectivity, and any custom indicators you register).
  • /actuator/info — arbitrary static or build-time application metadata (version, git commit, build time) you configure to be exposed.
  • /actuator/metrics — a browsable list of collected metrics (JVM memory, HTTP request counts/latencies, custom application metrics via Micrometer).
  • /actuator/env — the application's currently resolved Environment properties (configuration values, profiles) — sensitive, since it can reveal connection strings or other configuration detail.
  • /actuator/beans — every bean currently registered in the ApplicationContext.
  • /actuator/mappings — every registered @RequestMapping route in the application.
  • /actuator/threaddump//actuator/heapdump — a live thread dump / full heap dump — very sensitive, potentially exposing in-memory application data.
  • /actuator/loggers — view and even dynamically change logging levels for specific packages at runtime, without a restart.

Default exposure is deliberately conservative: out of the box, only /actuator/health and /actuator/info are exposed over HTTP — everything else must be explicitly opted into:

management.endpoints.web.exposure.include=health,info,metrics,prometheus

This default-off posture exists specifically because many Actuator endpoints reveal genuinely sensitive internal detail (environment variables, full heap contents, every registered bean) — broadening exposure should always be a deliberate decision, paired with appropriate access restriction (see the Actuator-security question), not something enabled wholesale "just in case it's useful."

Actuator is also what integrates naturally with health-check-based orchestration (Kubernetes liveness/readiness probes) and metrics-scraping systems (Prometheus), covered in the following questions.

Custom health indicators let you report the health of application-specific dependencies beyond Spring Boot's automatically-detected defaults (database, disk space):

@Component
class PaymentGatewayHealthIndicator implements HealthIndicator {
    private final PaymentGatewayClient client;

    @Override
    public Health health() {
        try {
            client.ping();
            return Health.up().withDetail("gateway", "reachable").build();
        } catch (Exception e) {
            return Health.down(e).withDetail("gateway", "unreachable").build();
        }
    }
}

Actuator automatically discovers and aggregates every registered HealthIndicator bean into the overall /actuator/health response — if any indicator reports DOWN, the aggregated overall status is DOWN by default.

Kubernetes liveness vs. readiness — and why they're genuinely different questions:

  • Liveness asks: "is this instance in a broken state that a restart would fix?" — e.g., deadlocked, stuck in an infinite loop, or otherwise unable to make forward progress. A failed liveness probe causes Kubernetes to kill and restart the pod.
  • Readiness asks: "is this instance ready to receive traffic right now?" — e.g., has it finished its startup sequence, warmed up its caches, and can it currently reach its required dependencies? A failed readiness probe doesn't restart the pod — it just tells Kubernetes to stop routing traffic to it temporarily until it reports ready again (useful during startup, or a transient dependency outage that will self-resolve).

Spring Boot Actuator has built-in support for exactly this distinction via health groups, auto-enabled in Kubernetes-detected environments:

management.endpoint.health.probes.enabled=true
management.health.livenessstate.enabled=true
management.health.readinessstate.enabled=true

This exposes two dedicated endpoints:

  • /actuator/health/liveness — reflects the application's internal liveness state (Spring Boot's LivenessState, which application code can explicitly mark as BROKEN if it detects an unrecoverable internal condition).
  • /actuator/health/readiness — aggregates readiness-relevant health indicators (has the app fully started, can it reach its critical dependencies) — this is where a custom indicator like the PaymentGatewayHealthIndicator above would typically be classified, since "can't reach the payment gateway" is a reason to stop routing traffic here, not a reason to restart the pod.
# Kubernetes deployment manifest
livenessProbe:
  httpGet: { path: /actuator/health/liveness, port: 8080 }
readinessProbe:
  httpGet: { path: /actuator/health/readiness, port: 8080 }

Why the distinction matters in practice: conflating the two (e.g., pointing both probes at the same generic /actuator/health) can cause a genuinely bad outcome — a transient, self-recovering dependency outage (which should only affect readiness, pausing traffic briefly) instead triggers a full pod restart via the liveness probe, needlessly discarding the instance's warmed-up state and potentially making an already-degraded situation worse by cycling pods that weren't actually broken.

Micrometer is a vendor-neutral metrics facade — conceptually similar to how SLF4J decouples application logging calls from a specific logging backend (Logback, Log4j2). Application code instruments metrics through Micrometer's API, and a separate, swappable registry determines which actual monitoring backend those metrics get exported to.

Auto-configuration: as soon as spring-boot-starter-actuator is on the classpath, Spring Boot automatically wires up Micrometer and starts capturing a substantial set of built-in metrics with zero extra code:

  • HTTP request counts, latencies, and status-code breakdowns per endpoint (http.server.requests).
  • JVM metrics — heap/non-heap memory usage, garbage collection pause counts/durations, thread counts.
  • Data source connection pool metrics (active/idle connections, wait time) for HikariCP.
  • Cache hit/miss rates, if a supported caching provider is in use.

Registering custom application metrics:

@Service
class OrderService {
    private final Counter ordersPlacedCounter;
    private final Timer orderProcessingTimer;

    OrderService(MeterRegistry registry) {
        this.ordersPlacedCounter = registry.counter("orders.placed");
        this.orderProcessingTimer = registry.timer("orders.processing.time");
    }

    void placeOrder(Order order) {
        orderProcessingTimer.record(() -> {
            // ... process the order ...
            ordersPlacedCounter.increment();
        });
    }
}

Exporting to a specific backend just means adding the corresponding registry dependency — the instrumentation code above never needs to change regardless of which backend is chosen:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

This automatically exposes an additional endpoint, /actuator/prometheus, in the text-based exposition format Prometheus's scraper expects. Similar registry dependencies exist for Datadog, New Relic, CloudWatch, Graphite, InfluxDB, and many other monitoring platforms — the same underlying metrics, exported through whichever registry/registries are on the classpath, potentially to multiple backends simultaneously if more than one registry dependency is present.

Why the facade design matters: it lets application code (and, importantly, all of Spring Boot's own built-in auto-generated metrics) remain completely decoupled from any specific monitoring vendor — switching from, say, an on-prem Prometheus setup to a hosted Datadog account (a real, not-uncommon migration) requires only a dependency and configuration change, not rewriting instrumentation scattered throughout the codebase.

Related Resources

Spring Boot uses SLF4J as the logging facade (the API application code actually calls) with Logback as the default underlying implementation — giving both a simple, quick-configuration path and a full-power path for advanced needs.

Simple configuration via application.yml — sufficient for most day-to-day needs:

logging:
  level:
    root: INFO
    com.example.myapp: DEBUG          # more verbose for your own code
    org.springframework.web: WARN     # quieter for a noisy framework package
  file:
    name: /var/log/myapp/app.log

Full Logback configuration (logback-spring.xml, the -spring suffix enabling Spring-specific extensions like profile-conditional sections) for anything beyond basic level tuning — custom appenders, rolling file policies, multiple output destinations:

<configuration>
    <springProfile name="prod">
        <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <fileNamePattern>app.%d{yyyy-MM-dd}.log</fileNamePattern>
                <maxHistory>30</maxHistory>
            </rollingPolicy>
        </appender>
    </springProfile>
</configuration>

Structured/JSON logging matters significantly once logs are shipped to a centralized aggregation system (ELK/OpenSearch, Splunk, a cloud logging service) — a plain text log line requires the aggregator to parse an often-fragile, ad-hoc text format to extract fields, whereas emitting each log entry as a JSON object makes every field (timestamp, level, logger, message, and any custom key-value context) directly and reliably queryable:

{"timestamp":"2026-07-04T10:15:30Z","level":"INFO","logger":"OrderService","message":"Order placed","orderId":"12345","customerId":"cust-1"}

This can be achieved via a dedicated Logback encoder library (e.g., logstash-logback-encoder), or — since Spring Boot 3.4 — via Spring Boot's own built-in structured logging support, configurable with a simple property:

logging.structured.format.console=ecs

supporting common structured formats (Elastic Common Schema, Logstash's format, GELF) without needing an extra third-party encoder dependency.

Adding contextual key-value fields to individual log statements (rather than just the message text) via SLF4J's fluent API or MDC (Mapped Diagnostic Context) makes those fields available as their own structured, filterable attributes in the aggregation system — e.g., attaching a orderId or the current distributed-tracing traceId to every log line emitted while handling a specific request, which is invaluable for correlating logs with the specific request/trace they belong to (see the distributed-tracing question) during a production investigation.

During a normal deployment (a rolling update) or an autoscaling scale-down event, a container orchestrator (Kubernetes, ECS, etc.) routinely sends a termination signal (SIGTERM) to application instances that are still actively handling requests — this isn't an error scenario, it's completely routine operational behavior.

Without graceful shutdown, the application process might terminate abruptly the moment it receives that signal — abandoning any in-flight requests mid-processing, which clients experience as a connection reset or an unexpected error, even though nothing was actually "wrong" with the application from a health perspective.

Graceful shutdown changes this behavior: on receiving a termination signal, the application:

  1. Immediately stops accepting new requests (or the load balancer/orchestrator stops routing new traffic to it, ideally slightly before the shutdown signal even arrives, via a readiness-probe-driven traffic drain).
  2. Allows in-flight requests a bounded grace period to finish naturally.
  3. Only then actually shuts down — either once all in-flight requests complete, or once the configured grace period elapses, whichever comes first (to guarantee the process doesn't hang indefinitely on a stuck request).

Enabling it in Spring Boot:

server.shutdown=graceful
spring.lifecycle.timeout-per-shutdown-phase=30s

With this enabled, a SIGTERM triggers Spring Boot's embedded web server to stop accepting new connections but let existing ones drain, up to the configured timeout, before the JVM actually exits.

Why this matters specifically for zero-downtime rolling deployments: a rolling deployment continuously replaces old instances with new ones while traffic keeps flowing — if outgoing instances abruptly drop in-flight requests the moment they're told to terminate, users experience a steady trickle of failed requests throughout every single deployment, even though the overall system was never actually "down." Graceful shutdown, combined with the orchestrator giving the readiness probe a chance to fail first (so no new traffic gets routed to a terminating instance) and an appropriately generous termination grace period (matching or exceeding the application's configured shutdown timeout), is what actually makes a rolling deployment or autoscaling event invisible to end users rather than a source of a small but real error rate on every release.