- Published on
Building Observable Systems: From Metrics to Insights in Production
25 min read
- Authors
- Name
- Bhakta Bahadur Thapa
- @Bhakta7thapa
Table of Contents
- Building Observable Systems: From Metrics to Insights in Production
- The Three Pillars of Observability
- 1. Metrics - The Numbers That Matter
- 2. Logs - The Stories Your System Tells
- 3. Traces - The Journey Through Your System
- Building a Comprehensive Metrics Strategy
- Application Metrics: The RED Method
- Infrastructure Metrics: The USE Method
- Advanced Alerting Strategies
- Smart Alerting Rules
- Alert Fatigue Prevention
- Distributed Tracing Implementation
- Comprehensive Logging Strategy
- Structured Logging with Context
- Grafana Dashboard Best Practices
- Incident Response Workflow
- Automated Runbooks
- Key Performance Indicators (KPIs)
- Technical KPIs
- Business KPIs
- Conclusion
Building Observable Systems: From Metrics to Insights in Production
After countless late-night debugging sessions and production incidents, I've learned that you can't fix what you can't see. Building truly observable systems isn't just about collecting metrics—it's about creating a comprehensive view of your system's health that enables rapid problem detection and resolution.
In this post, I'll share the observability strategies and tools that have saved me hours of debugging and helped prevent critical outages.
The Three Pillars of Observability
Modern observability is built on three foundational pillars:
1. Metrics - The Numbers That Matter
Quantitative data about your system's performance over time.
2. Logs - The Stories Your System Tells
Detailed records of events and transactions.
3. Traces - The Journey Through Your System
End-to-end visibility into request flows across services.
But here's what I've learned: having all three pillars isn't enough. You need to correlate them effectively to get true observability.
Building a Comprehensive Metrics Strategy
Application Metrics: The RED Method
For user-facing services, I follow the RED method:
- Rate - How many requests per second
- Errors - How many of those requests are failing
- Duration - How long those requests take
Here's how I implement this in a Go microservice:
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"time"
)
var (
// Rate metrics
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
// Duration metrics
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Error metrics
httpRequestErrors = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_request_errors_total",
Help: "Total number of HTTP request errors",
},
[]string{"method", "endpoint", "error_type"},
)
// Business metrics
userSignups = promauto.NewCounter(
prometheus.CounterOpts{
Name: "user_signups_total",
Help: "Total number of user signups",
},
)
activeUsers = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_users_current",
Help: "Current number of active users",
},
)
)
// Middleware to automatically collect HTTP metrics
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap the response writer to capture status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
statusCode := fmt.Sprintf("%d", wrapped.statusCode)
// Record metrics
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
statusCode,
).Inc()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(duration)
// Track errors
if wrapped.statusCode >= 400 {
errorType := "client_error"
if wrapped.statusCode >= 500 {
errorType = "server_error"
}
httpRequestErrors.WithLabelValues(
r.Method,
r.URL.Path,
errorType,
).Inc()
}
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
Infrastructure Metrics: The USE Method
For infrastructure components, I use the USE method:
- Utilization - How busy is the resource
- Saturation - How much work is queued
- Errors - Error events
# Prometheus configuration for infrastructure monitoring
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'rules/*.yml'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Application metrics
- job_name: 'web-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
scrape_interval: 30s
# Infrastructure metrics
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
scrape_interval: 30s
# Database metrics
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter:9187']
scrape_interval: 30s
# Kubernetes metrics
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics:8080']
scrape_interval: 30s
Advanced Alerting Strategies
Smart Alerting Rules
Here are the alerting rules I've refined through multiple production incidents:
# prometheus-rules.yml
groups:
- name: application-alerts
rules:
# Error rate alert with burn rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: 'High error rate detected for {{ $labels.service }}'
description: 'Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}'
runbook_url: 'https://runbooks.company.com/high-error-rate'
# Latency alert with multiple percentiles
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: 'High latency detected for {{ $labels.service }}'
description: '95th percentile latency is {{ $value }}s for service {{ $labels.service }}'
# Traffic anomaly detection
- alert: TrafficDrop
expr: |
(
sum(rate(http_requests_total[5m])) by (service) <
0.5 * avg_over_time(sum(rate(http_requests_total[5m])) by (service)[1h:5m])
) and
(
sum(rate(http_requests_total[5m])) by (service) > 1
)
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: 'Significant traffic drop for {{ $labels.service }}'
description: 'Traffic is 50% below normal levels for service {{ $labels.service }}'
- name: infrastructure-alerts
rules:
# Node resource alerts
- alert: NodeCPUHigh
expr: |
(
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: 'High CPU usage on {{ $labels.instance }}'
description: 'CPU usage is {{ $value }}% on node {{ $labels.instance }}'
- alert: NodeMemoryHigh
expr: |
(
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes
) > 0.85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: 'High memory usage on {{ $labels.instance }}'
description: 'Memory usage is {{ $value | humanizePercentage }} on node {{ $labels.instance }}'
# Disk space alerts
- alert: DiskSpaceHigh
expr: |
(
(node_filesystem_size_bytes - node_filesystem_avail_bytes) /
node_filesystem_size_bytes
) > 0.85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: 'High disk usage on {{ $labels.instance }}'
description: 'Disk usage is {{ $value | humanizePercentage }} on {{ $labels.device }} at {{ $labels.instance }}'
- name: business-metrics
rules:
# Business impact alerts
- alert: LowUserSignups
expr: |
sum(increase(user_signups_total[1h])) < 10
for: 15m
labels:
severity: warning
team: product
annotations:
summary: 'Low user signup rate'
description: 'Only {{ $value }} user signups in the last hour'
- alert: PaymentProcessingDown
expr: |
sum(rate(payment_requests_total{status="success"}[5m])) == 0
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: 'Payment processing appears to be down'
description: 'No successful payments processed in the last 5 minutes'
Alert Fatigue Prevention
I've implemented several strategies to prevent alert fatigue:
# alertmanager.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@company.com'
# Routing strategy to prevent noise
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts go directly to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
group_wait: 0s
repeat_interval: 5m
# Warning alerts to Slack during business hours
- match:
severity: warning
receiver: 'slack-warnings'
active_time_intervals:
- business-hours
# Infrastructure alerts to dedicated channel
- match:
team: infrastructure
receiver: 'slack-infrastructure'
# Time-based routing
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '18:00'
weekdays: ['monday:friday']
receivers:
- name: 'default'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slack-warnings'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#warnings'
color: 'warning'
- name: 'slack-infrastructure'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#infrastructure'
# Inhibition rules to reduce noise
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Distributed Tracing Implementation
For microservices, distributed tracing is essential. Here's how I implement it with Jaeger:
package tracing
import (
"context"
"io"
"log"
"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/ext"
"github.com/uber/jaeger-client-go"
jaegercfg "github.com/uber/jaeger-client-go/config"
)
// InitTracer initializes Jaeger tracer
func InitTracer(serviceName string) (opentracing.Tracer, io.Closer) {
cfg := jaegercfg.Configuration{
ServiceName: serviceName,
Sampler: &jaegercfg.SamplerConfig{
Type: jaeger.SamplerTypeConst,
Param: 1, // Sample 100% in development, adjust for production
},
Reporter: &jaegercfg.ReporterConfig{
LogSpans: true,
BufferFlushInterval: 1 * time.Second,
LocalAgentHostPort: "jaeger-agent:6831",
},
}
tracer, closer, err := cfg.NewTracer()
if err != nil {
log.Fatalf("Cannot initialize Jaeger: %v", err)
}
opentracing.SetGlobalTracer(tracer)
return tracer, closer
}
// HTTP middleware for tracing
func TracingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
spanCtx, _ := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header),
)
span := opentracing.GlobalTracer().StartSpan(
"HTTP "+r.Method+" "+r.URL.Path,
ext.RPCServerOption(spanCtx),
)
defer span.Finish()
// Set HTTP tags
ext.HTTPMethod.Set(span, r.Method)
ext.HTTPUrl.Set(span, r.URL.String())
ext.Component.Set(span, "http-server")
// Add span to context
ctx := opentracing.ContextWithSpan(r.Context(), span)
r = r.WithContext(ctx)
// Wrap response writer to capture status code
wrapped := &tracingResponseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
// Set response tags
ext.HTTPStatusCode.Set(span, uint16(wrapped.statusCode))
if wrapped.statusCode >= 400 {
ext.Error.Set(span, true)
}
})
}
// Database tracing helper
func TraceDBQuery(ctx context.Context, query string, args ...interface{}) (opentracing.Span, context.Context) {
if span := opentracing.SpanFromContext(ctx); span != nil {
childSpan := opentracing.GlobalTracer().StartSpan(
"db-query",
opentracing.ChildOf(span.Context()),
)
ext.DBType.Set(childSpan, "postgresql")
ext.DBStatement.Set(childSpan, query)
ext.Component.Set(childSpan, "database")
return childSpan, opentracing.ContextWithSpan(ctx, childSpan)
}
return nil, ctx
}
// Service-to-service call tracing
func TraceHTTPCall(ctx context.Context, method, url string) (opentracing.Span, *http.Request) {
if span := opentracing.SpanFromContext(ctx); span != nil {
childSpan := opentracing.GlobalTracer().StartSpan(
"HTTP "+method,
opentracing.ChildOf(span.Context()),
)
ext.HTTPMethod.Set(childSpan, method)
ext.HTTPUrl.Set(childSpan, url)
ext.Component.Set(childSpan, "http-client")
req, _ := http.NewRequest(method, url, nil)
// Inject span context into HTTP headers
opentracing.GlobalTracer().Inject(
childSpan.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header),
)
return childSpan, req
}
req, _ := http.NewRequest(method, url, nil)
return nil, req
}
type tracingResponseWriter struct {
http.ResponseWriter
statusCode int
}
func (trw *tracingResponseWriter) WriteHeader(code int) {
trw.statusCode = code
trw.ResponseWriter.WriteHeader(code)
}
Comprehensive Logging Strategy
Structured Logging with Context
package logging
import (
"context"
"github.com/sirupsen/logrus"
"github.com/opentracing/opentracing-go"
)
var logger = logrus.New()
func init() {
logger.SetFormatter(&logrus.JSONFormatter{
TimestampFormat: "2006-01-02T15:04:05.000Z07:00",
FieldMap: logrus.FieldMap{
logrus.FieldKeyTime: "timestamp",
logrus.FieldKeyLevel: "level",
logrus.FieldKeyMsg: "message",
},
})
}
// ContextLogger adds tracing information to logs
func ContextLogger(ctx context.Context) *logrus.Entry {
fields := logrus.Fields{}
if span := opentracing.SpanFromContext(ctx); span != nil {
if spanCtx, ok := span.Context().(jaeger.SpanContext); ok {
fields["trace_id"] = spanCtx.TraceID().String()
fields["span_id"] = spanCtx.SpanID().String()
}
}
// Add user context if available
if userID := ctx.Value("user_id"); userID != nil {
fields["user_id"] = userID
}
// Add request ID if available
if requestID := ctx.Value("request_id"); requestID != nil {
fields["request_id"] = requestID
}
return logger.WithFields(fields)
}
// Business event logging
func LogBusinessEvent(ctx context.Context, event string, data map[string]interface{}) {
fields := logrus.Fields{
"event_type": "business",
"event_name": event,
}
for k, v := range data {
fields[k] = v
}
ContextLogger(ctx).WithFields(fields).Info("Business event")
}
// Security event logging
func LogSecurityEvent(ctx context.Context, event string, severity string, data map[string]interface{}) {
fields := logrus.Fields{
"event_type": "security",
"event_name": event,
"severity": severity,
}
for k, v := range data {
fields[k] = v
}
ContextLogger(ctx).WithFields(fields).Warn("Security event")
}
// Performance logging
func LogPerformanceMetric(ctx context.Context, operation string, duration time.Duration, success bool) {
ContextLogger(ctx).WithFields(logrus.Fields{
"event_type": "performance",
"operation": operation,
"duration_ms": duration.Milliseconds(),
"success": success,
}).Info("Performance metric")
}
Grafana Dashboard Best Practices
Here's a production-ready dashboard configuration:
{
"dashboard": {
"title": "Application Performance Dashboard",
"tags": ["application", "performance"],
"timezone": "UTC",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "Requests/sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"min": 0
}
}
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"title": "Response Time Distribution",
"type": "heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
]
},
{
"title": "Top Endpoints by Error Rate",
"type": "table",
"targets": [
{
"expr": "topk(10, sum by (endpoint) (rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum by (endpoint) (rate(http_requests_total[5m])))",
"format": "table"
}
]
}
]
}
}
Incident Response Workflow
Automated Runbooks
#!/bin/bash
# runbooks/high-error-rate.sh
set -e
SERVICE=$1
ERROR_THRESHOLD=$2
echo "🚨 High error rate detected for service: $SERVICE"
echo "📊 Gathering diagnostic information..."
# Collect recent logs
echo "📝 Recent error logs:"
kubectl logs -l app=$SERVICE --tail=50 --since=10m | grep -i error
# Check resource usage
echo "💾 Resource usage:"
kubectl top pods -l app=$SERVICE
# Check recent deployments
echo "🚀 Recent deployments:"
kubectl rollout history deployment/$SERVICE
# Check service mesh metrics if available
if command -v istioctl &> /dev/null; then
echo "🌐 Service mesh metrics:"
istioctl proxy-config cluster $SERVICE-pod | grep -E "HEALTHY|UNHEALTHY"
fi
# Check database connections
echo "🗄️ Database connection status:"
kubectl exec deployment/$SERVICE -- curl -f http://localhost:8080/health/db || echo "Database health check failed"
# Automated mitigation options
echo "🔧 Suggested actions:"
echo "1. Check recent deployments: kubectl rollout history deployment/$SERVICE"
echo "2. Rollback if needed: kubectl rollout undo deployment/$SERVICE"
echo "3. Scale up if resource constrained: kubectl scale deployment/$SERVICE --replicas=6"
echo "4. Check dependencies: ./check-dependencies.sh $SERVICE"
# Create incident if error rate is very high
if (( $(echo "$ERROR_THRESHOLD > 10" | bc -l) )); then
echo "🆘 Creating high-priority incident..."
curl -X POST "https://api.pagerduty.com/incidents" \
-H "Authorization: Token token=$PAGERDUTY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"incident": {
"type": "incident",
"title": "High error rate: '$SERVICE'",
"service": {
"id": "'$PAGERDUTY_SERVICE_ID'",
"type": "service_reference"
},
"urgency": "high"
}
}'
fi
Key Performance Indicators (KPIs)
Track these essential metrics for observability maturity:
Technical KPIs
- Mean Time to Detection (MTTD): < 5 minutes
- Mean Time to Recovery (MTTR): < 30 minutes
- Alert Noise Ratio: < 5% false positives
- Coverage: > 90% of services instrumented
Business KPIs
- User Experience: 95th percentile response time < 2s
- Availability: 99.9% uptime
- Data Quality: < 0.1% data loss events
Conclusion
Building observable systems is a journey, not a destination. The key principles I follow are:
- Start with the User Experience - Monitor what users actually experience
- Automate Everything - From data collection to incident response
- Correlate Data - Connect metrics, logs, and traces
- Reduce Noise - Only alert on actionable issues
- Learn from Incidents - Every outage is a learning opportunity
Remember: Good observability isn't about having more data—it's about having the right data at the right time to make informed decisions.
What observability challenges have you faced in your production systems? I'd love to hear about your monitoring strategies and lessons learned!
Coming Next: In my upcoming post, I'll dive into "GitOps with ArgoCD: Modern Kubernetes Deployment Patterns."
Tags: #Observability #Monitoring #Prometheus #Grafana #DevOps #SRE #Production