What is distributed tracing, and how do tools like Micrometer Tracing/Sleuth and Zipkin help debug requests across services?
Quick Answer
Distributed tracing tracks a single logical request as it flows across multiple microservices, tagging it with a shared trace ID (and a per-hop span ID) so every log line and timing measurement generated anywhere along that request's path can be correlated back together — without it, debugging a slow or failing request that touched five services means manually cross-referencing logs across five separate systems with no shared identifier. Micrometer Tracing (the modern successor to Spring Cloud Sleuth) automatically instruments Spring Boot applications to propagate trace context across service calls, and a backend like Zipkin (or Jaeger) collects and visualizes the resulting traces as a timeline showing exactly how long each service spent handling its part of the request.
Detailed Answer
In a monolith, debugging a slow request usually means looking at one application's logs and a profiler. In a microservices architecture, a single user-facing request might touch five, ten, or more independent services — and without a shared way to correlate what happened across all of them, debugging turns into manually cross-referencing timestamps across separate logging systems with no reliable shared identifier, which is genuinely painful at any real scale.
Distributed tracing solves this by tagging a request with a shared trace ID the moment it enters the system, and propagating that same trace ID across every subsequent service-to-service call the request triggers. Each individual hop/operation within that trace is recorded as a span (with its own span ID, a parent-span reference, a start/end time, and relevant tags), so the full picture — every service the request touched, how long each one took, and in what order — can be reconstructed after the fact.
Trace ID: abc123
├── Span: api-gateway (10ms)
│ └── Span: order-service (150ms)
│ ├── Span: inventory-service call (40ms)
│ └── Span: payment-service call (95ms) <- clearly the slowest part of this request
Micrometer Tracing (the modern replacement for the now-EOL Spring Cloud Sleuth) is what actually instruments a Spring Boot application to generate and propagate this trace/span context automatically — it hooks into Spring MVC, RestClient/WebClient, messaging clients, and other common integration points, so trace context flows through outgoing calls (via propagated HTTP headers) without requiring manual code changes at every call site.
Zipkin (or a comparable backend like Jaeger) is a collector and visualization system: instrumented services export their recorded spans to it, and Zipkin assembles them (using the shared trace ID) into a single, visual timeline showing exactly which service handled which part of the request and how long each part took.
Why this matters practically: given a slow or failing request, distributed tracing lets you immediately see which specific service in the chain was actually responsible for the delay or failure, rather than needing to manually guess and check logs across every service the request might plausibly have touched — turning what could be a lengthy, multi-team debugging session into looking at a single trace timeline that points directly at the culprit span.