Distributed Tracing from the Edge to the Cloud—And Back
Software development has recently shifted from traditional monolithic applications to distributed microservices, containers, and serverless architectures. Unsurprisingly, system observability and performance issue troubleshooting have evolved to support this change.
A traditional monolithic application requires looking deep into the system’s code to understand what is happening internally. In contrast, dozens—if not hundreds—of distributed services call back and forth within a microservice environment. Furthermore, various teams throughout an organization often own these different services. It’s vital to view how these services connect and how requests flow through them.
When a problem arises, determining which services are affected is paramount for resolution and sending the correct team to fix it. As a result, developers need to transform how they look at their distributed environments. Distributed tracing provides this crucial visibility in the cloud and at the edge.
Let’s explore distributed tracing and how it works. Then, let’s discuss both its necessity and benefits before taking a quick look at some helpful distributed tracing tools.
What is Distributed Tracing?
Distributed tracing tracks and observes service requests (transactions) as they propagate or flow through distributed systems. It does this by collecting unique data as the requests move from one service to another. This trace data makes it easy to understand how requests flow through your microservice environment and helps pinpoint performance issues or system failures.
How Distributed Tracing Works
Distributed tracing tracks and follows a single request, collecting and analyzing data on every interaction with every service the request touches. It does this by tagging the request with a unique identifier.
This identifier stays with the transaction as it interacts with the various microservices, containers, and other distributed infrastructure. As a result, it delivers real-time visibility. The moment a request initiates, it triggers the creation of a unique ID and a parent span for that request action. Then, as the request moves through and across services, each activity this request triggers—called a segment or child span—is tracked and recorded as it moves between services.
Every time an application enters a service, this action generates a top-level child span, appearing as a single step on the trace. If the request makes more than one command or query within the same service, a top-level child span may act as a parent for child spans nested below it. These spans carry an ID, start and end (or duration) timestamps, error information, and other relevant activity metadata.
The tracing tool then pieces together the spans to create a trace of the request’s workflow through various services. As this process indicates, tracing times each span. By observing how long a request spends in each service or database, developers can pinpoint their troubleshooting efforts to the exact location or affected span.
Why We Need Distributed Tracing
Historically, organizations developed applications as monoliths, hosting all code inside a single process. Developers wrote monoliths using a single implementation language.
This approach made it easy to understand what might be happening in an application at runtime. You could attach a debugger to the app or use a tracing tool to capture any events because everything occurred inside a single process.
However, applications have become far more distributed over an increasingly complicated landscape with the advent of cloud computing, microservices, and container-based delivery. Workloads now run in the cloud through centralized servers that process the data, and all devices needing to communicate with the data must first connect to the cloud.
This approach is not without its problems. Latency is always a concern due to constant two-way communication between the end device and the cloud, with data filtering through several data centers along the way.
Further advancement introduced edge computing. Edge computing brings rich data storage and computing power closer to the end-user by placing the data source at the network’s edge, where it’s most needed. Data is no longer processed in the cloud or transmitted through distant data centers. Instead, users access the cloud on-demand, right at the edge.
This edge distribution eliminates lag time, has low operational costs, and saves considerable bandwidth. However, this, in turn, increases an application’s complexity, resulting in significant challenges with system observability.
It’s important to understand that cloud computing and edge computing are distinct, non-interchangeable technologies. While edge computing processes time-sensitive data, cloud computing gathers and processes non-time-sensitive data. Rather than relying on one or the other, it often makes sense to run part of your infrastructure at the edge and part in the cloud to form a cloud–edge hybrid computing infrastructure. Solutions like StackPath help you deploy, accelerate, and protect workloads right at the edge of the internet with a complete edge computing service for your every need.
Moving some of your workloads to the edge is beneficial when running back-end services near your users. This enables optimal performance and minimal latency. However, this approach makes some things more challenging, like discerning metrics and enabling tracing. In many cases, your edge services must make calls to your cloud infrastructure to complete a user’s request. It’s challenging to trace and monitor request progress and execution that spans multiple services running in various locations.
Fortunately, distributed tracing can help make the boundary between services seamless. It eliminates much of the difficulty and ensures that DevOps teams can trace and track the performance of distributed edge applications—or a cloud–edge hybrid system—as easily as with centralized cloud applications.
Benefits of Distributed Tracing
Imagine a scenario in which a customer’s edge service must call back into a cloud service. Something goes wrong on the cloud side, and the error propagates back to the edge.
It can be challenging to determine what happened just by looking at separate log entries. However, using the right tools (such as OpenTelemetry for distributed tracing, a Jaeger back end for storing trace data, and Grafana for visualizing traces), it becomes straightforward to track and trace a single request’s execution across service boundaries—even if one is at the edge and one is in the cloud.
The benefits of using distributed tracing include:
- Easily tracing how a request travels across complex distributed systems
- Identifying root causes for every service impact immediately
- Decreasing resolution time/mean time to recovery (MTTR)
- Determining each component’s latency when there is a delay
- Discerning where bottlenecks occur during the request process
- Evaluating and effectively measuring the system’s overall health
- Improving collaborations between various DevOps and site reliability engineer (SRE) teams
- Providing a better user experience by minimizing and resolving issues quickly
- Analyzing and determining where transaction errors occur at the individual service level
Finding a Distributed Tracing Tool
Many open-source solutions help implement distributed tracing, including Open Spatial, OpenTelemetry, Open Tracking, OpenTracing, Jaeger, and Zipkin.
OpenTelemetry combines OpenTracing and OpenCensus and is the most widely used observability framework for distributed tracing. Overall, it supports telemetry data collection, including metrics, tracing, and logging—the three pillars of observability. It collects this from the incoming trace data (the spans) and then sends the data to a third party for analysis.
Many options can store OpenTelemetry trace data, from self-hosted Elasticsearch to fully-managed application performance monitoring (APM) services like New Relic.
Cloud–Edge Distributed Tracing Architecture
When architecting cloud–edge distributed tracing, you’ll typically want to architect your system to run tracing collectors both at the edge and in the cloud. Then, you can aggregate all of the collected data from both edge and cloud into a central tracing data store. This central store might come in the form of a self-hosted instance of Jaeger, or you might instead forward all of your data to a cloud-hosted APM application such as New Relic or DataDog.
Note that it’s crucial that you assign the same span names for tasks split between edge and cloud. Without this, OpenTelemetry (or any other distributed tracing tool you use) won’t be able to accurately trace the execution of a task that runs across both parts of your architecture.
Distributed tracing helps you locate issues in complex distributed systems at the edge or in a combination of the cloud and edge. Now that you know there are solutions offering visibility at the edge, you may be ready to move some of your operations to better serve your customers.
When you’re ready to get started, consider StackPath’s complete edge computing service to help deploy, accelerate, and protect your workloads right at the internet’s edge. Try StackPath today to bring your applications closer to your customers with low-latency performance.