What is Observability?

” In control theory, Observability is a measure of how well internal states of a system can be inferred by knowledge of its external outputs. ” – Wikipedia

In the world of software products and services, Observability means you can answer any questions about what’s happening on the inside of the system just by observing the outside of the system, without having to ship new code to answer new questions.

Some people argue that it’s a another buzzword for monitoring. But monitoring and observability are not same. Observability is a superset of monitoring. It provides not only high-level overviews of the system’s health but also highly granular insights into the implicit failure modes of the system. In addition, an observable system furnishes ample context about its inner workings, unlocking the ability to uncover deeper, systemic issues. More on this here. Logs, metrics, and traces are often known as the three pillars of observability.

The Need for Observability

In the DevOps and SRE world, observability has become an important term. Today infrastructure software is in the midst of a paradigm shift. Almost everyone is working with distributed systems. There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complex to understand present problems and predict future ones.

Observability is what will help you to be better when troubleshooting in production. Is is a feature that needs to be incorporated into a system at the time of system design such that:

  • A system can be tested in an easy and realistic manner and failure modes can be surfaced during the time of testing
  • A system can be deployed incrementally and in a manner such that a rollback (or roll forward) can be triggered if a key set of metrics deviate from the baseline.
  • And finally, post-release, a system can be able to report enough data points about its health and behaviour when serving real traffic, so that the system can be understood, debugged, and evolved.

Observability might mean different things to different people. For some, it’s about logs, metrics, and traces. For others, it’s the old wine of monitoring in a new bottle. The overarching goal of various schools of thought on observability, however, remains the same—bringing better visibility into systems.

Distributed Systems Observability by Cindy Sridharan

The Three Pillars of Observability

Logs, metrics, and traces are often known as the three pillars of observability.

A log is a text line that describes an event that happened at a certain time. It consists of a timestamp and a payload that helps give more context about the event. Logs are the information we look at only when things are bad.

A metric is a numeric value measured over a period of time. Metrics consist of a set of attributes like name, timestamp, value, and labels. For example, the average CPU consumption from the last five minutes of a server is a common metric. DevOps, SREs, and sysadmins use metrics to trigger alerts when a number goes above a certain threshold.

A trace is a span that represents an execution of code. A trace consists of a name, an ID, and a time value. When you combine different traces from a distributed system, you can see an end-to-end flow of an execution path. By using traces, you can know which part of the code in the system is taking more time to process inputs. Traces are useful when you want to fix latency issues. Lets talk about tracing more in detail.

Distributed Tracing

Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance. It tracks the progression of a single user request as it is handled by other services that make up an application.

Distributed Tracing is the process of tracking and analyzing what happens to a request (transaction) across all services it touches.

 

Each unit work is called a Span in a trace. Spans include metadata about the work, including the time spent in the step (latency), status, time events, attributes, links. You can use tracing to debug errors and latency issues in your applications.

Trace

Trace can be thought of as a directed acyclic graph (DAG) of Spans (see diagram below), where the edges between Spans are called References. It is a collective of observable signals showing the path of work through a system. A trace on its own is distinguishable by a unique 16 byte sequence called a TraceId. This TraceId groups and distinguishes spans.

Span

A span represents a single operation in a trace. A span could be representative of an HTTP request, a remote procedure call (RPC), a database query, or even the path that a code takes in user code, etc.

A span may or may not have a parent span:

  • A span without a parent is called a “root span” for example, span “/messages”
  • A span with a parent is called a “child span” for example, spans “auth”, “cache.Get”, “mysql.Query”, “cache.Put”

Spans are identified by a SpanID and each span belongs to a single trace. Each trace is uniquely identified by a TraceID which all constituent spans will share. These identifiers and options byte together are called Span Context. Inside the same process, Span context is propagated in a context object. When crossing process boundaries, it is serialized into protocol headers. The receiving end can read the Span context and create child spans.

Each Span encapsulates the following state:

  • An operation name
  • A start timestamp
  • A finish timestamp
  • A set of zero or more key:value Span Tags. The keys must be strings. The values may be strings, bools, or numeric types.
  • A set of zero or more Span Logs, each of which is itself a key:value map paired with a timestamp. The keys must be strings, though the values may be of any type.
  • SpanContext
  • References to zero or more causally-related Spans (via the SpanContext of those related Spans)
What is Span lifetime?

Span lifetime represents the process of recording the start and the end timestamps to the Span object:

  • The start time is recorded when the Span is created.
  • The end time needs to be recorded when the operation is ended.

It’s easier to visualize a Trace with a time axis as in the diagram below:

OpenTracing

OpenTracing is the open standard in distributed tracing. It’s a vendor-agnostic API to help developers easily instrument tracing into their code base. It’s open because no one company owns it. In fact, many tracing tooling companies are getting behind OpenTracing as a standardized way to instrument distributed tracing. A subset of these tracing system authors are part of the OpenTracing Specification Council (OTSC). Said Council members have write access across github.com/opentracing and opentracing contrib. Each OTSC member represents a tracing system which (a) maintains a public OpenTracing implementation, and (b) has substantial institutional support, and thus longevity. Various opentracing implementations are: Zipkin, Jaeger, LightStep, TraceView, Instana, DataDog, New Relic, Hawkular, SkyWalking.

Integrating OpenTracing with Java Spring-boot application

Before getting started we need to decide which tracer implementation to use. For the sake of example we will use Jaeger.

  • Installation

Include following dependecies to your pom.xml file:

  • Setting up your Tracer

When using configuration object to instantiate the tracer, the type of sampling can be selected via sampler.type and sampler.param properties. E.g Probabilistic (sampler.type=probabilistic)  sampler makes a random sampling decision with the probability of sampling equal to the value of sampler.param property. With sampler.param=0.1 approximately 1 in 10 traces will be sampled and value 1 allows for all.

  • Create a trace

Let’s create a trace that consists of just a single span. To do that we need an instance of the io.opentracing.Tracer which we will autowire. The tracer.buildspan(“say-hello”).start() starts a new span around the code and span.finish() ends it.

If we run the program now, we should see a span logged:

  • Tracing individual functions

In previous example we wrote a program that creates a trace that consists of a single span. That single span combined two operations performed by the program, formatting the output string and printing it. Let’s move those operations into standalone functions first:

And the functions are:

We can add tag or log in span.setTag() and span.log() functions.

Now we will wrap each function into its own span:

After running this:

We got three spans, but there is a problem here. The first hexadecimal segment of the output represents Jaeger trace ID, yet they are all different. If we search for those IDs in the UI each one will represent a standalone trace with a single span. That’s not what we wanted!

What we really wanted was to establish causal relationship between the two new spans to the root span started in main(). We can do that by passing an additional option asChildOf() to the span builder:

You may have noticed a few unpleasant side effects of our recent changes

  • we had to pass the Span object as the first argument to each function
  • we also had to write somewhat verbose try/finally code to finish the spans

OpenTracing API for Java provides a better way. Using thread-locals and the notion of an “active span”, we can avoid passing the span through our code and just access it via tracer.

In the above code we’re making the following changes:

  • We use startActive() method of the span builder instead of start(), which makes the span “active” by storing it in a thread-local storage.
  • startActive() returns a Scope object instead of a Span. Scope is a container of the currently active span. We access the active span via scope.span(). Once the scope is closed, the previous scope becomes current, thus re-activating previously active span in the current thread.
  • Scope is auto-closable, which allows us to use try-with-resource syntax.
  • The boolean parameter in startActive(true) tells the Scope that once it is closed it should finish the span it represents.
  • startActive() automatically creates a ChildOf reference to the previously active span, so that we don’t have to use asChildOf() builder method explicitly.

If we run this program, we will see that all three reported spans have the same trace ID.

  • Tracing RPC requests

In order to trace across process boundaries and RPC calls in distributed systems, spanContext needs to propagated over the wire. The OpenTracing Java API provides two methods in the Tracer interface to do just that, inject(SpanContext, format, carrier) and extract(format, carrier).

Injecting and Extracting HTTP

When your service calls a downstream service, it’s useful to pass down the SpanContext, so that Spans generated by this service could join the Spans from our service in a single trace. To do that, our service needs to Inject the SpanContext into the payload. On the other side of the connection, the downstream service needs then to Extract the SpanContext before it creates any Spans.

Note that manually dealing with the Inject and Extract operations is not common in Java: there are OpenTracing instrumentation libraries that take care of this for you, such as the java-web-servlet-filter for the Extract operation, or one of the following for the Inject operation:

  • java-okhttp
  • java-apache-httpclient
  • java-asynchttpclient
  • Spring’s RestTemplate

You can learn more on this topic here and here.

  • Using Aspects for cleaner code for adding Tracing

In all the above examples we are inserting tracing code into our code. We can achieve the same without touching the original code using Spring AOP.

We have used @Around advice to add spans around methods of all classes in com.saurav.webapp.service package.

  • Viewing traces on Jaeger UI

All-in-one is an executable designed for quick local testing, launches the Jaeger UI, collector, query, and agent, with an in memory storage component.

The simplest way to start the all-in-one is to use the pre-built image published to DockerHub (a single command line).

Then navigate to http://localhost:16686 to access the Jaeger UI. You can select service name from drop down on left and see the corresponding traces collected from different timestamps.

 

List of traces on Jaeger UI
A sample trace of a rest api GET request from one service to another

References and useful links