Manage infrastructure and service monitoring
This guide describes the best practices for monitoring your infrastructure and services. Monitoring your infrastructure and services lets you maintain their reliability, performance, and security. You can use what you learn here to continuously observe key metrics such as resource usage, response times and error rates, to proactively identify and fix problems before they impact users.
There are three main steps to monitoring your infrastructure:
- Identify the key metrics relevant to your infrastructure and services.
- Install and set up monitoring agents to collect metrics and logs.
- Create dashboards to view the metrics and set up alerts to notify you of issues.
This guide recommends which metrics to monitor, how to properly configure and deploy your monitoring agents, and how to set up dashboard and alerts. Following these recommendations helps you standardize and automate your entire monitoring process across your organization.
Tip
While monitoring agents can collect some metrics automatically, obtaining comprehensive telemetry data often requires instrumenting your application code directly. To understand how to implement instrumentation, refer to How OpenTelemetry facilitates instrumentation.
Identify common metrics
Choose infrastructure and service metrics that match your monitoring goals so that you can understand how healthy your infrastructure and services are. This proactive approach lets you quickly fix any problems, improve performance, and provide a smooth experience for your users.
Common metrics include:
- CPU usage
- memory utilization
- disk I/O
- network traffic
- server uptime
Common metrics include:
- response times
- error rates
- throughput
- latency
Analyzing infrastructure and service metrics helps you prevent issues like bottlenecks, congestion, and paying for excess unused capacity. The following examples show how different infrastructure and service metrics contribute to different insights and actions you may need to take.
To optimize performance, monitor these key metrics:
- Infrastructure metrics:
- CPU usage: High CPU usage can signal performance issues.
- Memory utilization: High memory usage can signal a need for more memory resources.
- Service metrics:
- Response times: Slow response times help you identify and address bottlenecks. Response times measure how quickly services respond to users.
For capacity planning, monitor these key metrics:
- Infrastructure metrics:
- Disk I/O - High disk I/O can signal storage bottlenecks.
- Network traffic: Dynamic traffic levels let you scale up resources during times when you expect the most traffic, and scale down when you expect lower levels of traffic.
- Service metrics:
- Service throughput: High throughput levels can signal sufficient capacity since services can sustain performance when they receive more traffic.
External resources: The following cloud agnostic monitoring documentation can help you identify and design a monitoring system:
Install and configure monitoring agents
Monitoring agents run on your host (for example, virtual machines, Kubernetes node, Nomad cluster) and send metrics to a centralized monitoring platform (for example, Grafana or DataDog). Common monitoring agents include Prometheus, Telegraf, and StatsD.
If you improperly configure the monitoring agent, the centralized monitoring platform will not be able to collect data for the host and all its services.
Each monitoring agent has specific installation and configuration methods, but there are common best practices that apply to them.
- Securely manage monitoring agents secrets
- Configure and deploy your monitoring agent on VMs
- Configure monitoring agent on container orchestrators (Kubernetes and Nomad)
- Platform agnostic monitoring through service mesh
No matter how you deploy and manage your monitoring agents, you will want to securely manage their secrets. Your deployment method depends on the platform where you run your applications and whether you network your services traditionally or through a service mesh.
Securely manage monitoring agents secrets
Most monitoring agents need authentication credentials or tokens to securely connect back to the central monitoring platform. For example, Datadog agents need an API key to associate the agent with your Datadog account. Exposing these secrets directly in Packer templates, Terraform configurations, or in your configuration management tools is insecure.
We recommend that you retrieve the required credentials from a secure secret management solution like HashiCorp Vault during the deployment process.
- If you use Packer, you can directly pull dynamic secrets from Vault when you build the golden image.
- If you use configuration management tools like Ansible, you can integrate it with Vault to securely inject monitoring credentials when you configure the monitoring agent.
- When you deploy your image with Terraform, you can leverage user scripts (startup/
cloud-init
scripts) to fetch the credentials from Vault.
This secret injection approach keeps credentials out of your codebase even as you adopt “as-code” best practices to configure your monitoring agents. Vault lets you securely store and distribute secrets through strict access controls and auditing capabilities.
HashiCorp resources:
- The Packer
vault
function lets you read secrets from Vault and use the secrets within your template as user variables. - The Terraform Vault provider lets you connect to your Vault cluster and manage its resources. You can use the Vault provider to read dynamic secrets and provide them to your user scripts.
- The Vault Agent containers tutorial shows you how to inject secrets into Kubernetes pods. Nomad supports a Vault integration that lets you read secrets from Vault and use it in your Nomad templates. You can use these methods to provide Vault secrets to configure your monitoring agents on the respective container orchestrator.
External resources:
The following resources let you read vault secrets from configuration management tools:
- Ansible supports Vault integration with the hashi_vault lookup plugin.
- The Using HashiCorp's Vault with Chef blog post provides strategies for retrieving secrets from Vault using Chef at runtime, at configuration time, within a Chef resource/provider, and within the application logic itself.
- Puppet supports Vault integration with the
vault_lookup
module.
Refer to Retrieving CI/CD secrets from Vault for information on how to integrate Vault with your CI/CD system.
Configure and deploy your monitoring agent on VMs
You can configure and deploy monitoring agents in different ways:
- Use golden images with the monitoring agent already configured. (Recommended)
- Use a post-provisioning script or configuration management tools to configure the monitoring agent.
We recommend using golden images that include the pre-installed monitoring agents, which developers can use to build their applications. These produce faster, more consistent deployments and reduce the risk of flawed deployments.
Post-provisioning workflows require more time and effort to install and configure software on each individual system after provisioning, which can lead to inconsistencies and increased maintenance overhead.
We recommend you use Terraform to deploy and manage your virtual machines. Terraform can deploy virtual machines from golden images created by Packer or execute configuration management scripts to install and configure agents. Adopting infrastructure-as-code practices with Terraform provides a consistent, versionable workflow for your infrastructure.
How you deploy monitoring agents depends on how you configure and install them on virtual machines. The following sections provide resources for deploying pre-built images created with Packer, as well as deploying virtual machines and installing agents through post-provisioning scripts.
We recommend creating a base machine image that already has your chosen monitoring agent installed. This ensures consistency, since each service will automatically send monitoring data to your central monitoring platform. Your organization can use Packer with configuration management tools, like Ansible, to create consistent and centrally-managed images. These preconfigured “golden images” have all the necessary software, dependencies, and security patches to run your services.
Platform teams can use Packer to codify and build golden images across multiple platforms. Application teams then create new images using the golden images as a starting point, and add their own services to them. They can then deploy the images to cloud providers like AWS, Azure, and Google Cloud using tools like Terraform. When you build the service image from the golden image, it includes a properly configured monitoring agent that sends both node and service metrics to your central monitoring platform.
If platform teams need to update the monitoring agent or install security patches, they can rebuild the golden image and notify any downstream teams that rely on it so they can update their own images. HCP Packer registry automates this process by tracking artifact metadata, and providing developers the correct information through Packer or Terraform. HCP Packer also lets you revoke artifacts to remove them from use if they become outdated or have security vulnerabilities. By using HCP Packer, you can automate your image pipelines and ensure all your machine images are secure and follow your organization's rules and policies.
HashiCorp resources:
- The Build a golden image pipeline with HCP Packer tutorial guides you through using HCP Packer to build parent and child application images and deploy them with Terraform.
- The Deploy a Packer generated AMI with Terraform tutorial shows you how to build an AMI using Packer and deploy to AWS.
- The Automate Packer with GitHub Actions tutorials guides you through a complete GitHub Actions workflow to automatically build and manage different versions of an image artifact. Automating image pipelines with HCP Packer is a companion video that covers the same topic.
- The Standardize artifacts across multiple cloud providers tutorial guides you through using Packer and HCP Packer to standardize artifacts across multi-cloud and hybrid environments. You will also deploy the image artifacts to both AWS and Azure.
- The Ansible Packer provisioner runs Ansible playbooks. This provisioner expects that Ansible is already installed on the guest or remote machine.
- To learn more about immutable infrastructure, view Armon's What is mutable vs. immutable infrastructure? video.
External resources:
- DataDog provides resources to configure their agent using a configuration management tool like Ansible, Chef, Puppet, and Saltstack.
- New Relic provides guides to automatically configure its agent with Ansible, Chef, and Puppet.
- AWS provides instructions on how to install its AWS Unified CloudWatch Agent on Linux virtual machines.
Configure monitoring agent on container orchestrators
Monitoring container orchestrators, like Kubernetes and Nomad, is important for keeping your clusters and services healthy, and lets you sustain high performance and reliability. The built-in telemetry data from these tools doesn't provide much value alone; you need a monitoring tool to collect, parse, and alert on raw telemetry data. By setting up monitoring agents, you can get valuable insights into how your clusters and services are functioning.
Track metrics about Kubernetes cluster nodes, like CPU and memory usage, to understand if the nodes are healthy and have enough resources. Monitor application-level metrics, like request latency and error rates, to ensure the services are running smoothly. Tools like Prometheus and Grafana let you collect and visualize these metrics.
Track Nomad cluster node metrics like resource usage to optimize resources and keep the cluster stable, and identify any performance bottlenecks. Nomad’s integration with Prometheus collects and analyzes cluster metrics, providing insights into the cluster health and performance. Monitor Nomad job metrics so you know if jobs execute smoothly. Using monitoring tools like Prometheus and Grafana with Nomad lets you comprehensively monitor the entire system - both the cluster itself and all running jobs.
HashiCorp resources:
- The Terraform Datadog provider tutorial shows you how to use Terraform to deploy an application in EKS and install the DataDog agent across the Kubernetes cluster.
- For node-level Nomad metrics, refer to the following resources:
- The Nomad Prometheus tutorial guides you through configuring Prometheus to integrate with a Nomad cluster. This tutorial covers how to gather node-level metrics.
- The Monitoring Nomad, Metrics reference, Nomad autoscaler documentation, and Nomad telemetry block documentation provide a deep dive into the telemetry and metrics that Nomad has to offer.
- The Collect resource utilization metrics shows you how to view naive Nomad job usage for simple service level metrics.
External resources:
- Kubernetes provides resources to learn more about tools that help you monitor Kubernetes resources and node health.
- The Nomad integration for Grafana includes two pre-built dashboards to help monitor and visualize Nomad metrics.
Platform agnostic monitoring through service mesh
Service meshes like Consul let you monitor the health and performance of services in your distributed system. Consul gives you visibility into service-to-service communication and access to advanced monitoring and observability capabilities.
Consul integrates with monitoring tools like Prometheus and Grafana, which let you collect and analyze metrics and logs from your mesh services. Once you enable proxy metrics and access logs in Consul, you do not need to configure or instrument your services. Consul configures the sidecar Envoy proxies to automatically send metrics and logs data directly to your centralized monitoring platform.
A common challenge with application monitoring across distributed systems is a lack of standardized metric naming conventions as applications evolve. This makes it difficult to consistently template dashboards and alters. Consul standardizes metric names and values based on the telemetry data emitted from its Envoy proxies.
Consul solves this by providing built-in instrumentation that captures standardized observability data from the service mesh layer. This makes it easier to create standardized monitoring dashboards, and let you monitor services on the mesh without manually instrumenting them.
HashiCorp resources:
- The Consul proxy metrics tutorial guides you through how to configure Consul Envoy proxies to help you monitor service health and performance on the mesh.
- The Consul proxy access logs tutorial guides you through how to configure Consul Envoy proxies to help you debug service mesh events and errors.
Set up dashboards and alerts
As the number of services you manage and maintain grows, manually managing monitoring components like dashboards and alerts become unsustainable. This can lead to inconsistencies across environments, observability gaps where issues go undetected, and potential security risks. We recommend adopting monitoring-as-code (MaC) to manage these configurations to solve these challenges as your organization scales.
With monitoring-as-code (MaC), you can adopt many of the best practices as infrastructure-as-code (IaC), such as:
Consistent configuration: MaC lets you consistently deploy standardized monitoring setups across teams and environments. Terraform lets you create modules that include standard monitoring configurations with built-in reasonable defaults. A range of monitoring tools also offer official Terraform providers and modules your organization can use.
For example, the team responsible for ensuring data integrity can make changes to the Terraform modules and propagate those changes throughout the organization.
Automated provisioning: As your organization scales, you can configure infrastructure and service deployments to automatically trigger monitoring dashboards.
Auditable changes: All changes to monitoring components are traceable through version control.
Policy-compliant resources: You can use Sentinel and Open Policy Agent (OPA) to ensure your monitoring resources are secure and compliant with your organization's policies.
While the codified approach offers significant benefits, designing complex monitoring dashboards and alert rules directly in code can be challenging initially.
To balance this, we recommend an iterative workflow: First, leverage your monitoring tool's UI to visually design and build dashboards, layouts, and alert rules. This allows you to fully utilize the robust querying capabilities and intuitive interfaces provided by monitoring solutions. Once you have functional prototypes, import those configurations into Terraform code and standardize them as a Terraform module. From there, your organization can consume and modify the Terraform modules to create consistent monitoring dashboards and alerts across your infrastructure and services.
This approach combines the flexibility of visually designing your dashboards first with the consistency and maintainability of managing it as code.
External resources:
- New Relic's article provides additional insights into why organizations should adopt monitoring-as-code.
Deploy vendor monitoring tools
Terraform makes it easy to deploy and manage various vendor monitoring tools through official providers and modules. Terraform has over 200 partner and community providers for logging and monitoring — some popular ones include DataDog, New Relic, Grafana, and Splunk. With these providers, you can automatically provision the monitoring tool, and its resources like dashboard alerts.
Many monitoring vendors also contribute to OpenTelemetry (OTel). OpenTelemetry provides a standardized way to generate and export telemetry data like metrics, traces, and logs from your applications. The OpenTelemetry agent collects this data from your applications. Instead of deploying separate agents for each vendor monitoring tool, you can configure the OTel agent to export data to multiple backends simultaneously. For example, you can send metrics to Datadog, traces to Honeycomb, and logs to Splunk from the same Otel agent.
With Terraform, you have a unified approach to manage and operate all different monitoring systems through a single, automated workflow defined in code. Terraform lets you define the exact configuration you want for everything – from the OpenTelemetry agents collecting data (Refer to GCE example in external resources) to backend monitoring tools like DataDog, New Relic, Grafana, and others.
HashiCorp resources:
- The Terraform Datadog provider tutorial guides you through how to use Terraform to deploy an application in EKS and install the DataDog agent across the Kubernetes cluster.
- Terraform Registry hosts over 200 Logging and Monitoring partner and community providers. You should be able to find a provider to manage your monitoring tool of choice. This includes popular providers such as the New Relic Terraform provider and DataDog Terraform provider.
External resources:
The Deploying OpenTelemetry (OTel) agent to your GCE instances article provides insights from LiveRamp as they automatically deploy OTel agents using Terraform on their Google Cloud instances.
DataDog provides a quick start guide where they walk you through creating dashboards, deploying monitors and alerts, and integrating into AWS. This guide uses the resources in the DataDog provider module.
New Relic provides resources on implementing monitoring-as-code (MaC) with Terraform:
- The Automate your configuration with observability as code tutorial covers the importance of codifying monitoring using HCL.
- A three-part series guides you through using Terraform with JSON to create and dynamically generate New Relic dashboards:
- Creating dashboards with Terraform and JSON templates guides you through quickly updating New Relic dashboards with Terraform by using JSON templates
- Dynamically creating New Relic dashboards with Terraform guides you through using JSON templates to create dynamic dashboards
- Using Terraform to generate New Relic dashboards from NRQL queries guides you through using Terraform and NRQL queries to generate dashboards with dynamic data
Deploy cloud native
Many cloud providers, such as AWS, Azure, and Google Cloud, offer their own monitoring services, which can effectively monitor infrastructure metrics and application logs. With Terraform, you can use cloud provider resources and specific monitoring modules to deploy and manage your cloud-native monitoring infrastructure without installing additional monitoring agents.
HashiCorp resources:
- AWS maintains the AWS Integration and Automation (IA) Terraform modules - the
cloudwatch-log-group
module deploys and manages an AWS Cloudwatch log group along with the corresponding IAM permissions. The Terraform AWS provider contains CloudWatch resources that Terraform can create and manage such as theaws_cloudwatch_dashboard
resource. - Azure maintains Azure Verified Modules - the avm-res-operationalinsights-workspace module deploys and manages a Log Analytics Workspace with reasonable defaults. The Azure Terraform provider contains the resources you need to deploy monitoring for your application in Azure such as
azurerm_portal_dashboard
andazurerm_monitor_metric_alert
. - Google maintains a cloud operations module that manages Google Cloud's operations suite (Cloud Logging and Monitoring). The Terraform Google Cloud provider page provides Google Cloud Monitoring resources that Terraform can create and manage such as the
google_monitoring_dashboard
resource.
External resources:
- Andrei Maksimov's tutorial with a video guides you through how to automate alarms, dashboards, and logs in the AWS CloudWatch service.
- Azure's Multi-cloud monitoring article guides you through setting up Azure Monitor to monitor your services and infrastructure across different clouds, and ingest cloud native metrics and telemetry information into your existing monitoring solution.