Monitoring & Observability¶
Overview¶
Monitoring and observability are critical for maintaining system health, detecting issues early, and enabling rapid incident response.
Monitoring Stack¶
Prometheus¶
Time-series metrics database and alerting system.
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
Install Prometheus¶
# Docker
docker run -d -p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Linux
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-2.40.0.linux-amd64.tar.gz
cd prometheus-2.40.0.linux-amd64
./prometheus --config.file=prometheus.yml
Grafana¶
Visualization and alerting platform.
# Run Grafana
docker run -d -p 3000:3000 grafana/grafana
# Access at http://localhost:3000
# Default: admin/admin
# Configure Prometheus data source
# URL: http://prometheus:9090
Alert Rules¶
groups:
- name: application
interval: 30s
rules:
- alert: HighCPU
expr: 'node_cpu_seconds_total > 0.8'
for: 5m
annotations:
summary: 'High CPU usage'
description: 'CPU usage is above 80%'
- alert: DiskSpaceLow
expr: 'node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1'
for: 10m
annotations:
summary: 'Low disk space'
description: 'Less than 10% disk space available'
- alert: HighMemory
expr: 'node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1'
for: 5m
annotations:
summary: 'High memory usage'
description: 'Memory usage above 90%'
ELK Stack¶
Elasticsearch¶
Search and analytics engine.
# Docker
docker run -d -p 9200:9200 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.0.0
# Verify
curl http://localhost:9200
Logstash¶
Log processing pipeline.
input {
file {
path => "/var/log/application/*.log"
start_position => "beginning"
codec => json
}
tcp {
port => 5000
codec => json
}
}
filter {
if [type] == "apache-access" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
Kibana¶
Visualization for Elasticsearch.
# Docker
docker run -d -p 5601:5601 \
-e "ELASTICSEARCH_HOSTS=http://elasticsearch:9200" \
docker.elastic.co/kibana/kibana:8.0.0
# Access at http://localhost:5601
Docker Compose Stack¶
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- es-data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.0.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5000:5000"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.0.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
volumes:
es-data:
Key Metrics¶
Application Metrics¶
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Request Rate | Traffic volume | Spike detection |
| Response Time | Latency | >500ms |
| Error Rate | Failures | >1% |
| Throughput | Operations/sec | Baseline dependent |
| Queue Depth | Pending work | >1000 items |
Infrastructure Metrics¶
| Metric | Purpose | Alert Threshold |
|---|---|---|
| CPU | Processor usage | >80% |
| Memory | RAM usage | >90% |
| Disk | Storage space | <10% free |
| Network | Bandwidth | Baseline dependent |
| Processes | Running tasks | Baseline dependent |
Logging Best Practices¶
| Practice | Benefit | Implementation |
|---|---|---|
| Structured logging | Easy parsing | JSON logs |
| Correlation IDs | Request tracking | Trace across services |
| Log levels | Filtering | DEBUG, INFO, WARN, ERROR |
| Centralized logs | Unified view | ELK, Splunk |
| Log retention | Cost control | Delete old logs |
| Sensitive data | Security | Mask passwords, tokens |
| Context | Debugging | Include user, service, timestamp |
Structured Logging Example¶
{
"timestamp": "2023-10-01T10:30:45Z",
"level": "INFO",
"service": "auth-service",
"correlation_id": "abc-123-def",
"user_id": "user-456",
"message": "User login successful",
"response_time_ms": 245,
"status_code": 200
}
Alerting Strategy¶
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: default
group_by: [alertname, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: pagerduty
continue: true
- match:
severity: warning
receiver: slack
receivers:
- name: default
email_configs:
- to: team@example.com
- name: pagerduty
pagerduty_configs:
- service_key: xxx
- name: slack
slack_configs:
- api_url: https://hooks.slack.com/...
channel: '#alerts'
Health Checks¶
Application Health Endpoint¶
# Flask example
@app.route('/health')
def health():
checks = {
'database': check_database(),
'cache': check_cache(),
'disk': check_disk_space(),
}
status = 'healthy' if all(checks.values()) else 'unhealthy'
status_code = 200 if status == 'healthy' else 503
return jsonify({'status': status, 'checks': checks}), status_code
def check_database():
try:
db.session.execute('SELECT 1')
return True
except:
return False
def check_cache():
try:
redis_client.ping()
return True
except:
return False
def check_disk_space():
import shutil
total, used, free = shutil.disk_usage("/")
return (free / total) > 0.1
Kubernetes Probes¶
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myapp:1.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 10
Distributed Tracing¶
Jaeger Configuration¶
apiVersion: v1
kind: Service
metadata:
name: jaeger
spec:
ports:
- name: jaeger-agent-zipkin-thrift
port: 6831
protocol: UDP
- name: jaeger-collector
port: 14268
selector:
app: jaeger
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 6831
protocol: UDP
- containerPort: 14268
env:
- name: COLLECTOR_ZIPKIN_HTTP_PORT
value: "9411"
Python Tracing Example¶
from jaeger_client import Config
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name=service_name,
)
return config.initialize_tracer()
tracer = init_tracer('my-service')
with tracer.start_active_span('my-operation') as scope:
with tracer.start_active_span('db-query', child_of=scope.span) as db_scope:
# Database operation
pass
Prometheus Queries¶
# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
(node_filesystem_used_bytes / node_filesystem_size_bytes) * 100
# Request rate
rate(http_requests_total[1m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# P95 response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Monitoring Best Practices¶
| Practice | Benefit | Implementation |
|---|---|---|
| Baseline metrics | Anomaly detection | Track normal patterns |
| Alert fatigue | Actionable alerts | Tune thresholds |
| SLOs | Service goals | Define objectives |
| Dashboards | Quick visibility | Key metrics only |
| Documentation | Troubleshooting | Runbooks for alerts |
| Testing | Alert reliability | Test alert firing |
| Retention | Cost & compliance | Balance requirements |
| Correlation | Root cause | Link metrics, logs, traces |
SLO/SLI/SLA¶
Service Level Objectives (SLOs)¶
services:
api:
availability: 99.9% # SLO: API available 99.9% of time
response_time: 200ms # SLO: P95 response time < 200ms
error_budget: 0.1% # SLO: Max 0.1% errors allowed
Service Level Indicators (SLIs)¶
# Availability SLI
sum(rate(http_requests_total{status="200"}[1d])) / sum(rate(http_requests_total[1d]))
# Latency SLI
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1d]))
Summary Table: Monitoring Tools¶
| Tool | Purpose | Cost | Best For |
|---|---|---|---|
| Prometheus | Metrics collection | Free | Time-series metrics |
| Grafana | Visualization | Free | Dashboards |
| ELK Stack | Log analysis | Free | Log aggregation |
| Jaeger | Distributed tracing | Free | Request tracing |
| AlertManager | Alert routing | Free | Alert management |
| Datadog | Full-stack monitoring | Paid | Enterprise monitoring |
| New Relic | APM & monitoring | Paid | Application performance |
| Splunk | Log management | Paid | Enterprise logging |
| CloudWatch | AWS native | Paid | AWS-specific metrics |
| Azure Monitor | Azure native | Paid | Azure resources |