Skip to content

Monitoring & Observability

Overview

Monitoring and observability are critical for maintaining system health, detecting issues early, and enabling rapid incident response.


Monitoring Stack

Prometheus

Time-series metrics database and alerting system.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - 'alerts.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

Install Prometheus

# Docker
docker run -d -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Linux
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-2.40.0.linux-amd64.tar.gz
cd prometheus-2.40.0.linux-amd64
./prometheus --config.file=prometheus.yml

Grafana

Visualization and alerting platform.

# Run Grafana
docker run -d -p 3000:3000 grafana/grafana

# Access at http://localhost:3000
# Default: admin/admin

# Configure Prometheus data source
# URL: http://prometheus:9090

Alert Rules

groups:
  - name: application
    interval: 30s
    rules:
      - alert: HighCPU
        expr: 'node_cpu_seconds_total > 0.8'
        for: 5m
        annotations:
          summary: 'High CPU usage'
          description: 'CPU usage is above 80%'

      - alert: DiskSpaceLow
        expr: 'node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1'
        for: 10m
        annotations:
          summary: 'Low disk space'
          description: 'Less than 10% disk space available'

      - alert: HighMemory
        expr: 'node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1'
        for: 5m
        annotations:
          summary: 'High memory usage'
          description: 'Memory usage above 90%'

ELK Stack

Elasticsearch

Search and analytics engine.

# Docker
docker run -d -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.0.0

# Verify
curl http://localhost:9200

Logstash

Log processing pipeline.

input {
  file {
    path => "/var/log/application/*.log"
    start_position => "beginning"
    codec => json
  }

  tcp {
    port => 5000
    codec => json
  }
}

filter {
  if [type] == "apache-access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }

  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }

  stdout {
    codec => rubydebug
  }
}

Kibana

Visualization for Elasticsearch.

# Docker
docker run -d -p 5601:5601 \
  -e "ELASTICSEARCH_HOSTS=http://elasticsearch:9200" \
  docker.elastic.co/kibana/kibana:8.0.0

# Access at http://localhost:5601

Docker Compose Stack

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.0.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5000:5000"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.0.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

volumes:
  es-data:

Key Metrics

Application Metrics

Metric Purpose Alert Threshold
Request Rate Traffic volume Spike detection
Response Time Latency >500ms
Error Rate Failures >1%
Throughput Operations/sec Baseline dependent
Queue Depth Pending work >1000 items

Infrastructure Metrics

Metric Purpose Alert Threshold
CPU Processor usage >80%
Memory RAM usage >90%
Disk Storage space <10% free
Network Bandwidth Baseline dependent
Processes Running tasks Baseline dependent

Logging Best Practices

Practice Benefit Implementation
Structured logging Easy parsing JSON logs
Correlation IDs Request tracking Trace across services
Log levels Filtering DEBUG, INFO, WARN, ERROR
Centralized logs Unified view ELK, Splunk
Log retention Cost control Delete old logs
Sensitive data Security Mask passwords, tokens
Context Debugging Include user, service, timestamp

Structured Logging Example

{
  "timestamp": "2023-10-01T10:30:45Z",
  "level": "INFO",
  "service": "auth-service",
  "correlation_id": "abc-123-def",
  "user_id": "user-456",
  "message": "User login successful",
  "response_time_ms": 245,
  "status_code": 200
}

Alerting Strategy

alertmanager:
  config:
    global:
      resolve_timeout: 5m

    route:
      receiver: default
      group_by: [alertname, cluster]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h

      routes:
        - match:
            severity: critical
          receiver: pagerduty
          continue: true

        - match:
            severity: warning
          receiver: slack

    receivers:
      - name: default
        email_configs:
          - to: team@example.com

      - name: pagerduty
        pagerduty_configs:
          - service_key: xxx

      - name: slack
        slack_configs:
          - api_url: https://hooks.slack.com/...
            channel: '#alerts'

Health Checks

Application Health Endpoint

# Flask example
@app.route('/health')
def health():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'disk': check_disk_space(),
    }

    status = 'healthy' if all(checks.values()) else 'unhealthy'
    status_code = 200 if status == 'healthy' else 503

    return jsonify({'status': status, 'checks': checks}), status_code

def check_database():
    try:
        db.session.execute('SELECT 1')
        return True
    except:
        return False

def check_cache():
    try:
        redis_client.ping()
        return True
    except:
        return False

def check_disk_space():
    import shutil
    total, used, free = shutil.disk_usage("/")
    return (free / total) > 0.1

Kubernetes Probes

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: myapp
      image: myapp:1.0
      ports:
        - containerPort: 8080

      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3

      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        timeoutSeconds: 3
        failureThreshold: 3

      startupProbe:
        httpGet:
          path: /startup
          port: 8080
        failureThreshold: 30
        periodSeconds: 10

Distributed Tracing

Jaeger Configuration

apiVersion: v1
kind: Service
metadata:
  name: jaeger
spec:
  ports:
    - name: jaeger-agent-zipkin-thrift
      port: 6831
      protocol: UDP
    - name: jaeger-collector
      port: 14268
  selector:
    app: jaeger
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          ports:
            - containerPort: 6831
              protocol: UDP
            - containerPort: 14268
          env:
            - name: COLLECTOR_ZIPKIN_HTTP_PORT
              value: "9411"

Python Tracing Example

from jaeger_client import Config

def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_tracer('my-service')

with tracer.start_active_span('my-operation') as scope:
    with tracer.start_active_span('db-query', child_of=scope.span) as db_scope:
        # Database operation
        pass

Prometheus Queries

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(node_filesystem_used_bytes / node_filesystem_size_bytes) * 100

# Request rate
rate(http_requests_total[1m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# P95 response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Monitoring Best Practices

Practice Benefit Implementation
Baseline metrics Anomaly detection Track normal patterns
Alert fatigue Actionable alerts Tune thresholds
SLOs Service goals Define objectives
Dashboards Quick visibility Key metrics only
Documentation Troubleshooting Runbooks for alerts
Testing Alert reliability Test alert firing
Retention Cost & compliance Balance requirements
Correlation Root cause Link metrics, logs, traces

SLO/SLI/SLA

Service Level Objectives (SLOs)

services:
  api:
    availability: 99.9%      # SLO: API available 99.9% of time
    response_time: 200ms     # SLO: P95 response time < 200ms
    error_budget: 0.1%       # SLO: Max 0.1% errors allowed

Service Level Indicators (SLIs)

# Availability SLI
sum(rate(http_requests_total{status="200"}[1d])) / sum(rate(http_requests_total[1d]))

# Latency SLI
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1d]))

Summary Table: Monitoring Tools

Tool Purpose Cost Best For
Prometheus Metrics collection Free Time-series metrics
Grafana Visualization Free Dashboards
ELK Stack Log analysis Free Log aggregation
Jaeger Distributed tracing Free Request tracing
AlertManager Alert routing Free Alert management
Datadog Full-stack monitoring Paid Enterprise monitoring
New Relic APM & monitoring Paid Application performance
Splunk Log management Paid Enterprise logging
CloudWatch AWS native Paid AWS-specific metrics
Azure Monitor Azure native Paid Azure resources

Resources