[![logo](https://www.hongsnet.net/images/logo.gif)](https://www.hongsnet.net) # 개요 `Docker Swarm` 환경에서 모니터링시스템을 구축한다. > 구성요소 : Prometheus + cAdvisor + Grafana + AlertManager Deployment 구성은 `Docker Stack`으로 진행한다. ```bash # cat docker-stack.yml version: '3' services: prometheus: image: prom/prometheus deploy: mode: global placement: constraints: [node.hostname == TB2-DOCKER] volumes: - /GLUSTERFS/PROM/DATA/prometheus/data:/prometheus - /GLUSTERFS/PROM/DATA/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /GLUSTERFS/PROM/DATA/prometheus/alert.rules:/etc/prometheus/alert.rules command: - '--config.file=/etc/prometheus/prometheus.yml' - '--web.enable-admin-api' - '--storage.tsdb.retention.time=1y' ports: - '9090:9090' depends_on: - cadvisor cadvisor: image: google/cadvisor:latest deploy: mode: global ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro grafana: image: grafana/grafana deploy: mode: global placement: constraints: [node.hostname == TB2-DOCKER] volumes: - /GLUSTERFS/PROM/DATA/grafana/data:/var/lib/grafana - /GLUSTERFS/PROM/DATA/grafana/grafana.ini:/etc/grafana/grafana.ini environment: - GF_SECURITY_ADMIN_PASSWORD=패스워드 depends_on: - prometheus ports: - "3000:3000" depends_on: - prometheus alertmanager: image: prom/alertmanager deploy: mode: global placement: constraints: [node.hostname == TB2-DOCKER] volumes: - /GLUSTERFS/PROM/DATA/alertmanager:/etc/alertmanager/ depends_on: - prometheus ports: - '9093:9093' ``` > !중요 : cadvisor 를 제외한 구성요소는 다음의 설정에 따라 Manager Node에서만 수행된다. ```python placement: constraints: [node.hostname == TB2-DOCKER] ``` - **prometheus**의 구성내역 ```bash # cat prometheus.yml global: scrape_interval: 15s external_labels: monitor: 'hongsnet' scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['127.0.0.1:9090'] - job_name: 'node-exporter' static_configs: - targets: ['172.24.0.245:9100','172.24.0.151:9100','172.16.0.158:9100','172.16.0.251:9100'] - job_name: 'cadvisor' static_configs: - targets: ['172.24.0.245:8080','172.24.0.151:8080','172.16.0.158:8080','172.16.0.251:8080'] rule_files: - 'alert.rules' alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 ``` 또한 **alert.rules** 내역은 다음과 같다. ```bash # cat alert.rules groups: - name: host rules: - alert: high_cpu_load expr: node_load1 > 1.5 for: 30s labels: severity: warning annotations: summary: "Server under high load" description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}." - alert: high_memory_load expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85 for: 30s labels: severity: warning annotations: summary: "Server memory is almost full" description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}." - alert: high_storage_load expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85 for: 30s labels: severity: warning annotations: summary: "Server storage is almost full" description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}." - name: containers rules: - alert: High Pod Memory expr: sum(container_memory_usage_bytes) > 1 for: 30s labels: severity: critical annotations: summary: "memory usage test" description: "test container is down for more than 30 seconds." - alert: ContainerKilled expr: time() - container_last_seen > 60 for: 1m labels: severity: warning annotations: summary: "Container killed (instance {{ $labels.instance }})" description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" ``` - **alertmanager**의 설정 ```bash # cat alertmanager.yml templates: - '/etc/alertmanager/template/*.tmpl' route: group_by: ['alertname'] group_wait: 30s group_interval: 5m # repeat_interval: 1h receiver: containers routes: - match: severity: critical receiver: containers receivers: - name: containers slack_configs: - api_url: https://chat.hongsnet.net/hooks/XXXXXX channel: '#grafana' ``` 이제 **monitor** 서비스를 다음과 같이 실행한다. ```bash # docker stack deploy -c docker-stack.yml monitor ``` * [ **Manager Node** ] ```bash # docker service ls ID NAME MODE REPLICAS IMAGE PORTS xodct3yxupq6 monitor_alertmanager global 1/1 prom/alertmanager:latest *:9093->9093/tcp zlay4qoq8gg7 monitor_cadvisor global 4/4 google/cadvisor:latest *:8080->8080/tcp pfljlqixrepi monitor_grafana global 1/1 grafana/grafana:latest *:3000->3000/tcp 1kakkg4asokp monitor_prometheus global 1/1 prom/prometheus:latest *:9090->9090/tcp hjsvav9409zy web_hongsnet global 3/3 registry.hongsnet.net/joohan.hong/docker/hongsnet:latest *:80->80/tcp ``` * [ **Worker Nodes** ] ```bash # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 93c7bdfd3dde google/cadvisor:latest "/usr/bin/cadvisor -…" 22 hours ago Up 22 hours 8080/tcp monitor_cadvisor.t3zbiuhkpam480yfqgc78tzgn.oa8wy24n0dm7t11w0lav9k5lu dca11cc1625a registry.hongsnet.net/joohan.hong/docker/hongsnet:latest "/usr/bin/supervisor…" 42 hours ago Up 42 hours 80/tcp web_hongsnet.t3zbiuhkpam480yfqgc78tzgn.yyewdmnoun7jnq1l19jrip590 ```