Monitoring improvements with Prometheus 2.0

Swapnil Kulkarni Thu, 09/11/2017 - 11:44

Posted By

Swapnil Kulkarni

Date Posted

09-Nov-2017

Continuous Improvement in Monitoring with Prometheus 2.0

Monitoring is the backbone of any good production environment. The Prometheus monitoring system, a project hosted by the Cloud Native Computing Foundation (CNCF), has rapidly gained popularity for containerized infrastructure and application monitoring alike, and it's taking its next step forward.

It seems a short while ago that Prometheus achieved its first significant milestone with release 1.0, which embarked with a broad set of features that make up Prometheus' simple yet extremely powerful monitoring philosophy. Since then, each subsequent release added and improved on various service discovery integrations, extensions to PromQL, and experimented with a first iteration on remote APIs to enable pluggable long-term storage solutions.

Prometheus has a simple and robust operational model that its users have learned to love. Yet, the infrastructure space did not stand still and projects like Kubernetes and Mesos are rapidly changing how software is being deployed and managed. Monitored environments are increasingly becoming more dynamic. It took three alphas, six betas, and three release candidates for the official 2.0 release of Prometheus. Let’s have a look at what has changed in this latest major release.

The Storage Challenge

Highly detailed metrics instrumentation across all layers of the stack, e.g., the status of containers, and the requests flowing through them, is the philosophy behind monitoring with Prometheus. Prometheus even captures the deep internals of the applications running inside of them, which is explorable through the metrics powered by a powerful query language to help aggregate these metrics into actionable insights. Prometheus is fundamentally designed to collect and store data as time series and sequence of timestamped data points collected at regular intervals; this can easily result in thousands or more time series exposed for each running container and application in the monitored infrastructure. When you are running thousands of containers, it may even result in millions of time series being tracked across a cluster.

Although there is a roughly fixed amount of time series that are actively being tracked, the constant innovation in the continuous deployments, auto-scaling, and scheduling of batch jobs makes it really easy to continuously destroy containers and create new ones with orchestration tools like Kubernetes and Mesos leading to a continuously growing total history of time series data that is accessible to Prometheus with potentially billions of time-series records.

Over the course of the release, the independent time series database for Prometheus has been stabilized and re-integrated into Prometheus itself. The result is a significantly better-performing Prometheus 2.0 with improvements along virtually all dimensions. With the inverted index, inspired by full-text search, the storage engine is written to provide fast lookups across the arbitrary dimensions that time series within Prometheus may have. With an all-new disk format, Prometheus 2.0 is resilient to crashes featuring a good collocation of related time series data and write-ahead log. Query latency is more consistent, and it especially scales better in the face of high series churn. Resource consumption has decreased significantly, as measured in different real-world production scenarios.

Improved Benchmarks

Prometheus 2.0 deployments have provided impressive results with the latest storage engine upgrade. The new indexing approach lowers the query latency and provides a more consistent approach which greatly reduces resource consumption without compromising stability. It also significantly lowers the amount of data written to the disk, increasing the lifetime of storage devices while lowering costs when there is high series churn. Even when using the same time series compression algorithm, Prometheus 2.0 provides significant disk space savings and maintains stable performance and significantly more responsive queries.

Database Snapshot Backups

Prometheus 2.0 also comes with built-in support for the more frequently requested feature of snapshot backups for the entire database through a simple native API call:

$ curl -XPOST http:///api/v2/admin/tsdb/snapshot 

{"name":"2017-11-08T21:44:35Z-3f7a688bb002e56d"}

Upon request, the snapshot is located in a directory; there is also support for snapshots to be uploaded to an archive store. One can easily move the snapshot to the new server's data directory and start a new Prometheus server with the backed-up data.

Prometheus 2.0 is focused on improving performance and making it even simpler to operate. It makes the experience more consistent and intuitive with many small and big changes. One of the notable changes is the staleness handling, which was one of the oldest and most requested roadmap items. It helps explicitly to track the disappearing monitoring targets or series from those targets, which reduces querying artifacts and increases alerting responsiveness. The latest release also migrated the recording and alerting rules from a custom format to the ubiquitous YAML format. This makes it much easier to integrate with most configuration management tools and is invaluable for templating.

What’s Next with Prometheus?

The new storage subsystem is designed to be accessible and extensible. This goes for new features directly integrated into Prometheus as well as custom tools that can be built on top of it. The simple and open storage format and the library allow users to build custom extensions like dynamic retention policies easily. This enables the storage layer to meet a wide array of requirements without drawing complexity into Prometheus itself, allowing it to focus on its core goals.

The remote APIs will continue to evolve to satisfy requirements for long-term storage without sacrificing Prometheus' model of reliability through simplicity.

The best part about the enhancements in Prometheus 2.0 is that it not only supports today's workloads better than ever, but it also has room to support tomorrow's workloads which enables increasing dynamism in infrastructure. Prometheus 2 is still the Prometheus you have learned to love, just a lot faster and even easier to operate and use. You can try out Prometheus 2.0 as usual by downloading official binaries and container images. Rock On!