docs/sources/setup/install/helm/monitor-and-alert/with-grafana-cloud.md
{{< admonition type="warning" >}} We no longer recommend using the meta-monitoring Helm chart to monitor Loki. To consolidate monitoring efforts into one Helm chart, Grafana Labs recommends using the Kubernetes monitoring Helm chart. Instructions for setting up the Kubernetes monitoring Helm chart can be found under Manage. {{< /admonition >}}
<!-- vale Grafana.We = YES -->This guide will walk you through using Grafana Cloud to monitor a Loki installation set up with the meta-monitoring Helm chart. This method takes advantage of many of the chart's self-monitoring features, sending metrics, logs, and traces from the Loki deployment to Grafana Cloud. Monitoring Loki with Grafana Cloud offers the added benefit of troubleshooting Loki issues even when the Helm-installed Loki is down, as the telemetry data will remain available in the Grafana Cloud instance.
These instructions are based off the meta-monitoring-chart repository.
The meta-monitoring stack will be installed in a separate namespace called meta. To create this namespace, run the following command:
kubectl create namespace meta
The meta-monitoring stack sends metrics, logs, and traces to Grafana Cloud. This requires that you know your connection credentials to Grafana Cloud. To obtain connection credentials, follow the steps below:
Create a new Cloud Access Policy in Grafana Cloud.
Click Create.
Once the policy is created, select the policy and click Add token.
Name the token, select an expiration date, then click Create.
Copy the token to a secure location as it will not be displayed again.
Navigate to the Grafana Cloud Portal Overview page.
Click the Details button for your Prometheus or Mimir instance.
Click the Details button for your Loki instance.
Click the Details button for your Tempo instance.
Finally, generate the secrets to store your credentials for each metric type within your Kubernetes cluster:
kubectl create secret generic logs -n meta \
--from-literal=username=<USERNAME LOGS> \
--from-literal= <ACCESS POLICY TOKEN> \
--from-literal=endpoint='https://<LOG URL>/loki/api/v1/push'
kubectl create secret generic metrics -n meta \
--from-literal=username=<USERNAME METRICS> \
--from-literal=password=<ACCESS POLICY TOKEN> \
--from-literal=endpoint='https://<METRICS URL>/api/prom/push'
kubectl create secret generic traces -n meta \
--from-literal=username=<OTLP INSTANCE ID> \
--from-literal=password=<ACCESS POLICY TOKEN> \
--from-literal=endpoint='https://<OTLP URL>/otlp'
To install the meta-monitoring Helm chart, you must create a values.yaml file. At a minimum this file should contain the following:
This example values.yaml file provides the minimum configuration to monitor the loki namespace:
namespacesToMonitor:
- default
cloud:
logs:
enabled: true
secret: "logs"
metrics:
enabled: true
secret: "metrics"
traces:
enabled: true
secret: "traces"
For further configuration options, refer to the sample values.yaml file.
To install the meta-monitoring Helm chart, run the following commands:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install meta-monitoring grafana/meta-monitoring -n meta -f values.yaml
or when upgrading the configuration:
helm upgrade meta-monitoring grafana/meta-monitoring -n meta -f values.yaml
To verify the installation, run the following command:
kubectl get pods -n meta
It should return the following pods:
NAME READY STATUS RESTARTS AGE
meta-alloy-0 2/2 Running 0 23h
meta-alloy-1 2/2 Running 0 23h
meta-alloy-2 2/2 Running 0 23h
By default, Loki does not have tracing enabled. To enable tracing, modify the Loki configuration by editing the values.yaml file and adding the following configuration:
Set the tracing.enabled configuration to true:
loki:
tracing:
enabled: true
Next, instrument each of the Loki components to send traces to the meta-monitoring stack. Add the extraEnv configuration to each of the Loki components:
ingester:
replicas: 3
extraEnv:
- name: JAEGER_ENDPOINT
value: "http://mmc-alloy-external.default.svc.cluster.local:14268/api/traces"
# This sets the Jaeger endpoint where traces will be sent.
# The endpoint points to the mmc-alloy service in the default namespace at port 14268.
- name: JAEGER_AGENT_TAGS
value: 'cluster="prod",namespace="default"'
# This specifies additional tags to attach to each span.
# Here, the cluster is labeled as "prod" and the namespace as "default".
- name: JAEGER_SAMPLER_TYPE
value: "ratelimiting"
# This sets the sampling strategy for traces.
# "ratelimiting" means that traces will be sampled at a fixed rate.
- name: JAEGER_SAMPLER_PARAM
value: "1.0"
# This sets the parameter for the sampler.
# For ratelimiting, "1.0" typically means one trace per second.
Since the meta-monitoring stack is installed in the meta namespace, the Loki components will need to be able to communicate with the meta-monitoring stack. To do this, create a new externalname service in the default namespace that points to the meta namespace by running the following command:
kubectl create service externalname mmc-alloy-external --external-name meta-alloy.meta.svc.cluster.local -n default
Finally, upgrade the Loki installation with the new configuration:
helm upgrade --values values.yaml loki grafana/loki
The meta-monitoring stack includes a set of dashboards that can be imported into Grafana Cloud. These can be found in the meta-monitoring repository.
The meta-monitoring stack includes a set of rules that can be installed to monitor the Loki installation. These rules can be found in the meta-monitoring repository. To install the rules:
git clone https://github.com/grafana/meta-monitoring-chart/
mimirtool based on the instructions located heremimirtool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_cloud_access_policy_token> *.yaml
mimirtool rules list --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_cloud_access_policy_token>
loki-rules:
- name: loki_rules
rules:
- record: cluster_job:loki_request_duration_seconds:99quantile
expr: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job))
- record: cluster_job:loki_request_duration_seconds:50quantile
expr: histogram_quantile(0.50, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job))
- record: cluster_job:loki_request_duration_seconds:avg
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, job) / sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, job)
- record: cluster_job:loki_request_duration_seconds_bucket:sum_rate
expr: sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job)
- record: cluster_job:loki_request_duration_seconds_sum:sum_rate
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, job)
- record: cluster_job:loki_request_duration_seconds_count:sum_rate
expr: sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, job)
- record: cluster_job_route:loki_request_duration_seconds:99quantile
expr: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job, route))
- record: cluster_job_route:loki_request_duration_seconds:50quantile
expr: histogram_quantile(0.50, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job, route))
- record: cluster_job_route:loki_request_duration_seconds:avg
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, job, route) / sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, job, route)
- record: cluster_job_route:loki_request_duration_seconds_bucket:sum_rate
expr: sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, job, route)
- record: cluster_job_route:loki_request_duration_seconds_sum:sum_rate
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, job, route)
- record: cluster_job_route:loki_request_duration_seconds_count:sum_rate
expr: sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, job, route)
- record: cluster_namespace_job_route:loki_request_duration_seconds:99quantile
expr: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, namespace, job, route))
- record: cluster_namespace_job_route:loki_request_duration_seconds:50quantile
expr: histogram_quantile(0.50, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, namespace, job, route))
- record: cluster_namespace_job_route:loki_request_duration_seconds:avg
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, namespace, job, route) / sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, namespace, job, route)
- record: cluster_namespace_job_route:loki_request_duration_seconds_bucket:sum_rate
expr: sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, cluster, namespace, job, route)
- record: cluster_namespace_job_route:loki_request_duration_seconds_sum:sum_rate
expr: sum(rate(loki_request_duration_seconds_sum[5m])) by (cluster, namespace, job, route)
- record: cluster_namespace_job_route:loki_request_duration_seconds_count:sum_rate
expr: sum(rate(loki_request_duration_seconds_count[5m])) by (cluster, namespace, job, route)
Metrics about Kubernetes objects are scraped from kube-state-metrics. This needs to be installed in the cluster. The kubeStateMetrics.endpoint entry in the meta-monitoring values.yaml should be set to its address (without the /metrics part in the URL):
kubeStateMetrics:
# Scrape https://github.com/kubernetes/kube-state-metrics by default
enabled: true
# This endpoint is created when the helm chart from
# https://artifacthub.io/packages/helm/prometheus-community/kube-state-metrics/
# is used. Change this if kube-state-metrics is installed somewhere else.
endpoint: kube-state-metrics.kube-state-metrics.svc.cluster.local:8080