Commit 652385a6 authored by JooHan Hong's avatar JooHan Hong

monitoring new update1

parent e80c8b86
Pipeline #5296 passed with stages
in 52 seconds
[![logo](https://www.hongsnet.net/images/logo.gif)](https://www.hongsnet.net)
# 개요
`K8s` 환경에서 모니터링시스템을 구축한다.
> 구성요소 : Prometheus + cAdvisor + Grafana + AlertManager
[![logo](https://www.hongsnet.net/images/logo.gif)](https://www.hongsnet.net)
# 개요
`Docker Swarm` 환경에서 모니터링시스템을 구축한다.
> 구성요소 : Prometheus + cAdvisor + Grafana + AlertManager
Deployment 구성은 `Docker Stack`으로 진행한다.
```bash
# cat docker-stack.yml
version: '3'
services:
prometheus:
image: prom/prometheus
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/prometheus/data:/prometheus
- /GLUSTERFS/PROM/DATA/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- /GLUSTERFS/PROM/DATA/prometheus/alert.rules:/etc/prometheus/alert.rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-admin-api'
- '--storage.tsdb.retention.time=1y'
ports:
- '9090:9090'
depends_on:
- cadvisor
cadvisor:
image: google/cadvisor:latest
deploy:
mode: global
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
grafana:
image: grafana/grafana
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/grafana/data:/var/lib/grafana
- /GLUSTERFS/PROM/DATA/grafana/grafana.ini:/etc/grafana/grafana.ini
environment:
- GF_SECURITY_ADMIN_PASSWORD=ghdwngkstjqj
depends_on:
- prometheus
ports:
- "3000:3000"
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/alertmanager:/etc/alertmanager/
depends_on:
- prometheus
ports:
- '9093:9093'
```
> !중요 : cadvisor 를 제외한 구성요소는 다음의 설정에 따라 Manager Node에서만 수행된다.
```python
placement:
constraints: [node.hostname == TB2-DOCKER]
```
- **prometheus**의 구성내역
```bash
# cat prometheus.yml
global:
scrape_interval: 15s
external_labels:
monitor: 'hongsnet'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['172.24.0.245:9100','172.24.0.151:9100','172.16.0.158:9100','172.16.0.251:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['172.24.0.245:8080','172.24.0.151:8080','172.16.0.158:8080','172.16.0.251:8080']
rule_files:
- 'alert.rules'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
```
또한 **alert.rules** 내역은 다음과 같다.
```bash
# cat alert.rules
groups:
- name: host
rules:
- alert: high_cpu_load
expr: node_load1 > 1.5
for: 30s
labels:
severity: warning
annotations:
summary: "Server under high load"
description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_memory_load
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_storage_load
expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- name: containers
rules:
- alert: High Pod Memory
expr: sum(container_memory_usage_bytes) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "memory usage test"
description: "test container is down for more than 30 seconds."
- alert: ContainerKilled
expr: time() - container_last_seen > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Container killed (instance {{ $labels.instance }})"
description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
```
- **alertmanager**의 설정
```bash
# cat alertmanager.yml
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
# repeat_interval: 1h
receiver: containers
routes:
- match:
severity: critical
receiver: containers
receivers:
- name: containers
slack_configs:
- api_url: https://chat.hongsnet.net/hooks/XXXXXX
channel: '#grafana'
```
이제 **monitor** 서비스를 다음과 같이 실행한다.
```bash
# docker stack deploy -c docker-stack.yml monitor
```
* [ **Manager Node** ]
```bash
# docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
xodct3yxupq6 monitor_alertmanager global 1/1 prom/alertmanager:latest *:9093->9093/tcp
zlay4qoq8gg7 monitor_cadvisor global 4/4 google/cadvisor:latest *:8080->8080/tcp
pfljlqixrepi monitor_grafana global 1/1 grafana/grafana:latest *:3000->3000/tcp
1kakkg4asokp monitor_prometheus global 1/1 prom/prometheus:latest *:9090->9090/tcp
hjsvav9409zy web_hongsnet global 3/3 registry.hongsnet.net/joohan.hong/docker/hongsnet:latest *:80->80/tcp
```
* [ **Worker Nodes** ]
```bash
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
93c7bdfd3dde google/cadvisor:latest "/usr/bin/cadvisor -…" 22 hours ago Up 22 hours 8080/tcp monitor_cadvisor.t3zbiuhkpam480yfqgc78tzgn.oa8wy24n0dm7t11w0lav9k5lu
dca11cc1625a registry.hongsnet.net/joohan.hong/docker/hongsnet:latest "/usr/bin/supervisor…" 42 hours ago Up 42 hours 80/tcp web_hongsnet.t3zbiuhkpam480yfqgc78tzgn.yyewdmnoun7jnq1l19jrip590
```
......@@ -19,225 +19,14 @@ Prometheus을 이용한 www.hongsnet.net Container를 모니터링 한다.
<br>
- **k8s**
# 설치 (Each Clusters)
> 현재 진행중
| NO | 클러스터 | 바로가기 | 비고 |
| ------ | ------ | ------ | ------ |
| 1 | K8s | [GO](./INSTALL/K8S/) | |
| 2 | Docker Swarm | [GO](./INSTALL/SWARM/) | |
<br>
- **Swarm**
구성은 Docker Stack을 이용할 것이며, yml 파일의 내역은 다음과 같다.
```bash
# cat docker-stack.yml
version: '3'
services:
prometheus:
image: prom/prometheus
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/prometheus/data:/prometheus
- /GLUSTERFS/PROM/DATA/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- /GLUSTERFS/PROM/DATA/prometheus/alert.rules:/etc/prometheus/alert.rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-admin-api'
- '--storage.tsdb.retention.time=1y'
ports:
- '9090:9090'
depends_on:
- cadvisor
cadvisor:
image: google/cadvisor:latest
deploy:
mode: global
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
grafana:
image: grafana/grafana
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/grafana/data:/var/lib/grafana
- /GLUSTERFS/PROM/DATA/grafana/grafana.ini:/etc/grafana/grafana.ini
environment:
- GF_SECURITY_ADMIN_PASSWORD=ghdwngkstjqj
depends_on:
- prometheus
ports:
- "3000:3000"
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager
deploy:
mode: global
placement:
constraints: [node.hostname == TB2-DOCKER]
volumes:
- /GLUSTERFS/PROM/DATA/alertmanager:/etc/alertmanager/
depends_on:
- prometheus
ports:
- '9093:9093'
```
> !중요 : cadvisor 를 제외한 구성요소는 다음의 설정에 따라 Manager Node에서만 수행된다.
```python
placement:
constraints: [node.hostname == TB2-DOCKER]
```
- **prometheus**의 구성내역
```bash
# cat prometheus.yml
global:
scrape_interval: 15s
external_labels:
monitor: 'hongsnet'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['172.24.0.245:9100','172.24.0.151:9100','172.16.0.158:9100','172.16.0.251:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['172.24.0.245:8080','172.24.0.151:8080','172.16.0.158:8080','172.16.0.251:8080']
rule_files:
- 'alert.rules'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
```
또한 **alert.rules** 내역은 다음과 같다.
```bash
# cat alert.rules
groups:
- name: host
rules:
- alert: high_cpu_load
expr: node_load1 > 1.5
for: 30s
labels:
severity: warning
annotations:
summary: "Server under high load"
description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_memory_load
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_storage_load
expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- name: containers
rules:
- alert: High Pod Memory
expr: sum(container_memory_usage_bytes) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "memory usage test"
description: "test container is down for more than 30 seconds."
- alert: ContainerKilled
expr: time() - container_last_seen > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Container killed (instance {{ $labels.instance }})"
description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
```
- **alertmanager**의 설정
```bash
# cat alertmanager.yml
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
# repeat_interval: 1h
receiver: containers
routes:
- match:
severity: critical
receiver: containers
receivers:
- name: containers
slack_configs:
- api_url: https://chat.hongsnet.net/hooks/XXXXXX
channel: '#grafana'
```
이제 **monitor** 서비스를 다음과 같이 실행한다.
```bash
# docker stack deploy -c docker-stack.yml monitor
```
* [ **Manager Node** ]
```bash
# docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
xodct3yxupq6 monitor_alertmanager global 1/1 prom/alertmanager:latest *:9093->9093/tcp
zlay4qoq8gg7 monitor_cadvisor global 4/4 google/cadvisor:latest *:8080->8080/tcp
pfljlqixrepi monitor_grafana global 1/1 grafana/grafana:latest *:3000->3000/tcp
1kakkg4asokp monitor_prometheus global 1/1 prom/prometheus:latest *:9090->9090/tcp
hjsvav9409zy web_hongsnet global 3/3 registry.hongsnet.net/joohan.hong/docker/hongsnet:latest *:80->80/tcp
```
* [ **Worker Nodes** ]
# Monitroing ITEM (Common)
```bash
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
93c7bdfd3dde google/cadvisor:latest "/usr/bin/cadvisor -…" 22 hours ago Up 22 hours 8080/tcp monitor_cadvisor.t3zbiuhkpam480yfqgc78tzgn.oa8wy24n0dm7t11w0lav9k5lu
dca11cc1625a registry.hongsnet.net/joohan.hong/docker/hongsnet:latest "/usr/bin/supervisor…" 42 hours ago Up 42 hours 80/tcp web_hongsnet.t3zbiuhkpam480yfqgc78tzgn.yyewdmnoun7jnq1l19jrip590
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment