Monitoring & Observability¶

Overview¶

Monitoring and observability are critical for maintaining system health, detecting issues early, and enabling rapid incident response.

Monitoring Stack¶

Prometheus¶

Time-series metrics database and alerting system.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - 'alerts.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

Install Prometheus¶

# Docker
docker run -d -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Linux
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-2.40.0.linux-amd64.tar.gz
cd prometheus-2.40.0.linux-amd64
./prometheus --config.file=prometheus.yml

Grafana¶

Visualization and alerting platform.

# Run Grafana
docker run -d -p 3000:3000 grafana/grafana

# Access at http://localhost:3000
# Default: admin/admin

# Configure Prometheus data source
# URL: http://prometheus:9090

Alert Rules¶

groups:
  - name: application
    interval: 30s
    rules:
      - alert: HighCPU
        expr: 'node_cpu_seconds_total > 0.8'
        for: 5m
        annotations:
          summary: 'High CPU usage'
          description: 'CPU usage is above 80%'

      - alert: DiskSpaceLow
        expr: 'node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1'
        for: 10m
        annotations:
          summary: 'Low disk space'
          description: 'Less than 10% disk space available'

      - alert: HighMemory
        expr: 'node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1'
        for: 5m
        annotations:
          summary: 'High memory usage'
          description: 'Memory usage above 90%'

ELK Stack¶

Elasticsearch¶

Search and analytics engine.

# Docker
docker run -d -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.0.0

# Verify
curl http://localhost:9200

Logstash¶

Log processing pipeline.

input {
  file {
    path => "/var/log/application/*.log"
    start_position => "beginning"
    codec => json
  }

  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [type] == "apache-access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }

  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }

  stdout {
    codec => rubydebug
  }
}

Kibana¶

Visualization for Elasticsearch.

# Docker
docker run -d -p 5601:5601 \
  -e "ELASTICSEARCH_HOSTS=http://elasticsearch:9200" \
  docker.elastic.co/kibana/kibana:8.0.0

# Access at http://localhost:5601

Docker Compose Stack¶

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.0.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5000:5000"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.0.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

volumes:
  es-data:

Key Metrics¶

Application Metrics¶

Metric	Purpose	Alert Threshold
Request Rate	Traffic volume	Spike detection
Response Time	Latency	>500ms
Error Rate	Failures	>1%
Throughput	Operations/sec	Baseline dependent
Queue Depth	Pending work	>1000 items

Infrastructure Metrics¶

Metric	Purpose	Alert Threshold
CPU	Processor usage	>80%
Memory	RAM usage	>90%
Disk	Storage space	<10% free
Network	Bandwidth	Baseline dependent
Processes	Running tasks	Baseline dependent

Logging Best Practices¶

Practice	Benefit	Implementation
Structured logging	Easy parsing	JSON logs
Correlation IDs	Request tracking	Trace across services
Log levels	Filtering	DEBUG, INFO, WARN, ERROR
Centralized logs	Unified view	ELK, Splunk
Log retention	Cost control	Delete old logs
Sensitive data	Security	Mask passwords, tokens
Context	Debugging	Include user, service, timestamp

Structured Logging Example¶

{
  "timestamp": "2023-10-01T10:30:45Z",
  "level": "INFO",
  "service": "auth-service",
  "correlation_id": "abc-123-def",
  "user_id": "user-456",
  "message": "User login successful",
  "response_time_ms": 245,
  "status_code": 200
}

Alerting Strategy¶

alertmanager:
  config:
    global:
      resolve_timeout: 5m

    route:
      receiver: default
      group_by: [alertname, cluster]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h

      routes:
        - match:
            severity: critical
          receiver: pagerduty
          continue: true

        - match:
            severity: warning
          receiver: slack

    receivers:
      - name: default
        email_configs:
          - to: team@example.com

      - name: pagerduty
        pagerduty_configs:
          - service_key: xxx

      - name: slack
        slack_configs:
          - api_url: https://hooks.slack.com/...
            channel: '#alerts'

Health Checks¶

Application Health Endpoint¶

# Flask example
@app.route('/health')
def health():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'disk': check_disk_space(),
    }

    status = 'healthy' if all(checks.values()) else 'unhealthy'
    status_code = 200 if status == 'healthy' else 503

    return jsonify({'status': status, 'checks': checks}), status_code

def check_database():
    try:
        db.session.execute('SELECT 1')
        return True
    except:
        return False

def check_cache():
    try:
        redis_client.ping()
        return True
    except:
        return False

def check_disk_space():
    import shutil
    total, used, free = shutil.disk_usage("/")
    return (free / total) > 0.1

Kubernetes Probes¶

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: myapp
      image: myapp:1.0
      ports:
        - containerPort: 8080

      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3

      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        timeoutSeconds: 3
        failureThreshold: 3

      startupProbe:
        httpGet:
          path: /startup
          port: 8080
        failureThreshold: 30
        periodSeconds: 10

Distributed Tracing¶

Jaeger Configuration¶

apiVersion: v1
kind: Service
metadata:
  name: jaeger
spec:
  ports:
    - name: jaeger-agent-zipkin-thrift
      port: 6831
      protocol: UDP
    - name: jaeger-collector
      port: 14268
  selector:
    app: jaeger
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          ports:
            - containerPort: 6831
              protocol: UDP
            - containerPort: 14268
          env:
            - name: COLLECTOR_ZIPKIN_HTTP_PORT
              value: "9411"

Python Tracing Example¶

from jaeger_client import Config

def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_tracer('my-service')

with tracer.start_active_span('my-operation') as scope:
    with tracer.start_active_span('db-query', child_of=scope.span) as db_scope:
        # Database operation
        pass

Prometheus Queries¶

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(node_filesystem_used_bytes / node_filesystem_size_bytes) * 100

# Request rate
rate(http_requests_total[1m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# P95 response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Monitoring Best Practices¶

Practice	Benefit	Implementation
Baseline metrics	Anomaly detection	Track normal patterns
Alert fatigue	Actionable alerts	Tune thresholds
SLOs	Service goals	Define objectives
Dashboards	Quick visibility	Key metrics only
Documentation	Troubleshooting	Runbooks for alerts
Testing	Alert reliability	Test alert firing
Retention	Cost & compliance	Balance requirements
Correlation	Root cause	Link metrics, logs, traces

SLO/SLI/SLA¶

Service Level Objectives (SLOs)¶

services:
  api:
    availability: 99.9%      # SLO: API available 99.9% of time
    response_time: 200ms     # SLO: P95 response time < 200ms
    error_budget: 0.1%       # SLO: Max 0.1% errors allowed

Service Level Indicators (SLIs)¶

# Availability SLI
sum(rate(http_requests_total{status="200"}[1d])) / sum(rate(http_requests_total[1d]))

# Latency SLI
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1d]))

Summary Table: Monitoring Tools¶

Tool	Purpose	Cost	Best For
Prometheus	Metrics collection	Free	Time-series metrics
Grafana	Visualization	Free	Dashboards
ELK Stack	Log analysis	Free	Log aggregation
Jaeger	Distributed tracing	Free	Request tracing
AlertManager	Alert routing	Free	Alert management
Datadog	Full-stack monitoring	Paid	Enterprise monitoring
New Relic	APM & monitoring	Paid	Application performance
Splunk	Log management	Paid	Enterprise logging
CloudWatch	AWS native	Paid	AWS-specific metrics
Azure Monitor	Azure native	Paid	Azure resources