Commit 3aa578f8 authored by JooHan Hong's avatar JooHan Hong

K8s monitoring new init

parent 652385a6
Pipeline #5297 passed with stages
in 46 seconds
...@@ -5,3 +5,754 @@ ...@@ -5,3 +5,754 @@
`K8s` 환경에서 모니터링시스템을 구축한다. `K8s` 환경에서 모니터링시스템을 구축한다.
> 구성요소 : Prometheus + cAdvisor + Grafana + AlertManager > 구성요소 : Prometheus + cAdvisor + Grafana + AlertManager
## System Overview
- CentOS 7 (3.10.0-957.el7.x86_64)
- K8s
```bash
# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:03:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
```
```bash
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cube01 Ready control-plane,master 12d v1.20.4
cube02 Ready <none> 12d v1.20.4
cube03 Ready <none> 12d v1.20.4
cube04 Ready <none> 12d v1.20.4
```
## K8s에 모니터링시스템을 구성하는 방법
- Prometheus Operator
- Helm Chart를 이용한 Package Install
- `Manual`
> 특히 Operator는 정말 간단하게 구성할 수 있으나, K8s에 좀 더 가까워지기 위해 **Manual**로 진행한다. 참고적으로 Manual 구성도 그리 복잡하진 않다.
# Overview
* [ **STEP 1** ] : namespace 생성
* [ **STEP 2** ] : PV,PVC 생성
* [ **STEP 3** ] : Cluster Role 생성
* [ **STEP 4** ] : Prometheus configmap 설정 및 생성
* [ **STEP 5** ] : Prometheus Pod Deployment, Prometheus node-exporter, Prometheus Service 생성 및 배포
> Prometheus node-exporter는 node-exporter와 다르다. **kube-system** namespace에 리소스현황을 제공한다. 따라서 아래에서 각 `Node`에 추가적으로 node-exporter를 설치할 것이다.
* [ **STEP 6** ] : Kube State Metrics 배포
* [ **STEP 7** ] : Grafana 연동 및 DataSource 추가
* [ **STEP 8** ] : AlertManager 배포
> **kubectl** 명령으로 배포 시 **apply**|**create** 인자가 존재하는데, 일반적으로 **최초 실행시**에는 create로 실행하는 것이 원칙이고, 그 이후에는 apply로 인자를 주면 된다.
## namespace 생성
다음과 같이 monitoring이라는 네임스페이스를 생성한다.
```bash
# kubectl create ns monitoring
```
## PV,PVC 생성
공통적으로 아래의 Pod를 생성하면, 모두 메모리에 데이터가 위치하게 된다. 따라서 재시작 시 데이터가 모두 초기화되므로, 다음과 같이 영구볼륨(Persistent Volume)을 생성하여, 이를 디스크에 저장하도록 한다.
### Prometheus의 PV,PVC의 설정을 진행
```bash
# cat prometheus-data.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-pv
namespace: monitoring
labels:
type: local
app: prometheus
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: manual
hostPath:
path: /MONITOR/PROMETHEUS
type: DirectoryOrCreate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- cube02
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc
namespace: monitoring
labels:
type: local
app: prometheus
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
selector:
matchLabels:
app: prometheus
type: local
```
### Grafana의 PV, PVC의 설정을 진행
```bash
# cat grafana-data.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-pv
namespace: monitoring
labels:
type: local
app: grafana
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: manual
hostPath:
path: /MONITOR/GRAFANA
type: DirectoryOrCreate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- cube02
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
labels:
type: local
app: grafana
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
selector:
matchLabels:
app: grafana
type: local
```
### AlertManager의 PV, PVC의 설정을 진행
```bash
# cat alertmanager-data.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: alertmanager-pv
namespace: monitoring
labels:
type: local
app: alertmanager
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: manual
hostPath:
path: /MONITOR/ALERTMANAGER
type: DirectoryOrCreate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- cube02
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager-pvc
namespace: monitoring
labels:
type: local
app: alertmanager
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
selector:
matchLabels:
app: alertmanager
type: local
```
다음과 같이 스크립트를 이용하여, 세 가지 서비스에 대한 PV,PVC를 생성한다.
```bash
# cat data-pv_pvc_create.sh
#!/bin/bash
kubectl create -f promethes-data.yaml
kubectl create -f grafana-data.yaml
kubectl create -f alertmanager-data.yaml
```
만약 삭제하려면, 다음과 같이 진행한다.
```bash
# cat data-pv_pvc_remove.sh
#!/bin/bash
kubectl delete -f promethes-data.yaml
kubectl delete -f grafana-data.yaml
kubectl delete -f alertmanager-data.yaml
```
## Cluster Role 생성 (**최초만 수행**)
다음 과정은 Prometheus가 K8s API에 접근할 수 있는 권한을 부여하는 과정이다. 결과적으로 ClusterRole은 monitoring namespace의 서비스 계정에 대한 권한을 부여하는 것이다.
```bash
# cat prometheus-cluster-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
namespace: monitoring
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: default
namespace: monitoring
```
다음과 같이 생성한다.
```bash
# kubectl create -f prometheus-cluster-role.yaml
```
## Prometheus configmap 설정 및 생성
주요 핵심적인 설정파일은 다음과 같이 두 가지가 존재한다.
- prometheus.yml
- Prometheus의 Core 설정파일이다.
- prometheus.rules
- 수집된 Metric에 대한 조건을 부여하여, 해당되면 AlertManager로 Alert을 발송할 수 있는 내역을 기입하는 설정파일 이다.
> 설정내역이 방대하기 때문에 현재 경로의 **prometheus-config-map.yaml**을 참조한다.
다음과 같이 생성한다.
```bash
# kubectl create -f prometheus-config-map.yaml
```
## Prometheus Pod Deployment, Prometheus node-exporter, Prometheus Service 생성 및 배포
### Prometheus Pod Deployment 배포
Prometheus POD의 **Deployment Object**를 생성한다.
```bash
# cat prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-server
template:
metadata:
labels:
app: prometheus-server
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus/"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
nodeSelector:
kubernetes.io/hostname: cube02
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-server-conf
- name: prometheus-storage-volume
persistentVolumeClaim:
claimName: prometheus-pvc
```
> 생성은 아래의 스크립트에서 수행할 것이다.
### Prometheus node-exporter 배포
K8s의 node-exporter를 구성한다. 단, 여기서는 각 Node에 하나씩만 구성될 수 있도록 `DaemonSet` Object로 구성한다.
```bash
# cat prometheus-node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
k8s-app: node-exporter
spec:
selector:
matchLabels:
k8s-app: node-exporter
template:
metadata:
labels:
k8s-app: node-exporter
spec:
containers:
- image: prom/node-exporter
name: node-exporter
ports:
- containerPort: 9100
protocol: TCP
name: http
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
namespace: kube-system
spec:
ports:
- name: http
port: 9100
nodePort: 31672
protocol: TCP
type: NodePort
selector:
k8s-app: node-exporter
```
> 생성은 아래의 스크립트에서 수행할 것이다.
### Prometheus Service 생성 및 배포
서비스의 노출(Expose)는 **NodePort**를 사용한다.
```bash
# cat prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9090'
spec:
selector:
app: prometheus-server
type: NodePort
ports:
- port: 8080
targetPort: 9090
nodePort: 30003
```
### Start | Stop Script의 생성
아래의 스크립트는 특별한 것이 아니며, 시작|중지를 좀 더 간편하게 하기 위한 것이다.
- **Start Script**
```bash
# cat start-prometheus.sh
#!/bin/bash
kubectl apply -f prometheus-config-map.yaml
kubectl create configmap prometheus-server-conf --from-file prometheus-config-map.yaml
kubectl apply -f prometheus-deployment.yaml
kubectl apply -f prometheus-node-exporter.yaml
kubectl apply -f prometheus-svc.yaml
```
- Stop Script
```bash
# cat stop-prometheus.sh
#!/bin/bash
kubectl delete -f prometheus-svc.yaml
kubectl delete -f prometheus-node-exporter.yaml
kubectl delete -f prometheus-deployment.yaml
kubectl delete -f prometheus-config-map.yaml
kubectl delete configmap prometheus-server-conf
```
Start 스크립트 실행 후 현황은 다음과 같다.
```bash
# kubectl -n monitoring get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-8ggc9 1/1 Running 0 8h 10.244.1.184 cube02 <none> <none>
node-exporter-hlpcp 1/1 Running 0 8h 10.244.2.191 cube03 <none> <none>
node-exporter-ph4bs 1/1 Running 0 8h 10.244.3.219 cube04 <none> <none>
prometheus-deployment-544d8d78b9-qp2rf 1/1 Running 0 8h 10.244.1.183 cube02 <none> <none>
```
또한 configmap의 현황은 다음과 같다.
```bash
# kubectl get configmap
NAME DATA AGE
kube-root-ca.crt 1 12d
prometheus-server-conf 1 8h
```
## Kube State Metrics 배포
> `kube-state-metric`은 Pod에 대한 Resource Metric을 생성하는 타켓이다. 여기서 중요한 부분은 K8s의 Auto-Scaling에서 사용하는 `metric-server`와 전혀 다르다는 것이다. 즉, metric-server는 **HPA,VPA**에 사용되는 서비스이다.
구성되는 파일내역은 다음과 같다.
- kube-state-cluster-role.yaml
- kube-state-deployment.yaml
- kube-state-svcaccount.yaml
- kube-state-svc.yaml
### Start | Stop Script의 생성
아래의 스크립트는 특별한 것이 아니며, 시작|중지를 좀 더 간편하게 하기 위한 것이다.
- **Start Script**
```bash
# cat start-kube-state.sh
#!/bin/bash
kubectl apply -f kube-state-cluster-role.yaml
kubectl apply -f kube-state-deployment.yaml
kubectl apply -f kube-state-svcaccount.yaml
kubectl apply -f kube-state-svc.yaml
```
> 상기 파일의 설정내역은 현재 경로의 파일을 참조한다.
- **Stop Script**
```bash
# cat stop-kube-state.sh
#!/bin/bash
kubectl apply -f kube-state-cluster-role.yaml
kubectl apply -f kube-state-deployment.yaml
kubectl apply -f kube-state-svcaccount.yaml
kubectl apply -f kube-state-svc.yaml
```
위의 Start 스크립트 실행 후 현황은 다음과 같다.
```bash
# kubectl get pod -n kube-system |grep kube-state
kube-state-metrics-5c544bc55-sdzkq 1/1 Running 0 5d20h
```
## Grafana 연동 및 DataSource 추가
Promethues는 K8s의 Metric을 수집하는 시스템이다. 기본적으로 그래프를 제공하지만, 다양한 그래프 Type과 대시보드와 같은 전문적인 솔루션이 아니기 때문에 `Grafana`를 이용하여, 이를 시각화한다.
다음과 같이 Pod의 Deployment와 Service Object를 생성한다.
```bash
# cat grafana.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
name: grafana
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
ports:
- name: grafana
containerPort: 3000
volumeMounts:
- name: grafana-storage-volume
mountPath: /var/lib/grafana
nodeSelector:
kubernetes.io/hostname: cube02
volumes:
- name: grafana-storage-volume
persistentVolumeClaim:
claimName: grafana-pvc
securityContext:
runAsUser: 0
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '3000'
spec:
selector:
app: grafana
type: NodePort
ports:
- port: 3000
targetPort: 3000
nodePort: 30004
```
다음과 같이 실행한다.
```bash
# kubectl create -f grafana.yaml
```
실행 후 현황은 다음과 같다.
```bash
# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
grafana-584c994b47-7bltv 1/1 Running 0 5d19h
node-exporter-8ggc9 1/1 Running 0 8h
node-exporter-hlpcp 1/1 Running 0 8h
node-exporter-ph4bs 1/1 Running 0 8h
prometheus-deployment-544d8d78b9-qp2rf 1/1 Running 0 8h
```
또한 서비스의 현황은 다음과 같다.
```bash
# kubectl get service -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.103.8.163 <none> 3000:30004/TCP 5d19h
prometheus-service NodePort 10.100.176.157 <none> 8080:30003/TCP 8h
```
> 접속 : 주소:**30004**
> Demo URL : https://grafana-k8s-demo.hongsnet.net <- 로그인 정보는 포트폴리오 참조
> Configuration -> DataSource
![grafana_datasource](./images/grafana-datasource.png)
> Save & Test 버튼을 클릭한 후 "**녹색 바탕**의 Data source is working" 결과가 나와야 한다.
## AlertManager 배포
AlertManager는 Promethues의 Rule에 대한 조건이 만족하면, Web-Hook 또는 Email 등으로 Alert을 실질적으로 수행하는 서비스다.
설정파일의 구성은 다음과 같다.
- alertmanager-configmap.yaml
- alert-template-configmap.yaml
> 참조 : https://twofootdog.tistory.com/17
- alertmanager.yaml
- alertmanager-svc.yaml
위 네 개의 설정파일에 대한 내역은 현재 경로의 파일들을 참조하고, 아래는 이 중 Mattermost로 Web-Hook을 발송하는 configmap에 대한 내역이다.
```bash
# cat alertmanager-configmap.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager-config
namespace: monitoring
data:
config.yml: |-
global:
#resolve_timeout: 1m
slack_api_url: 'https://chat.hongsnet.net/hooks/XXXXXXXXX'
route:
receiver: 'slack-notifications'
repeat_interval: 5m
group_wait: 10s
group_interval: 1m
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#grafana'
send_resolved: true
icon_url: https://avatars3.githubusercontent.com/u/3380462
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
{{- if gt (len .CommonLabels) (len .GroupLabels) -}}
{{" "}}(
{{- with .CommonLabels.Remove .GroupLabels.Names }}
{{- range $index, $label := .SortedPairs -}}
{{ if $index }}, {{ end }}
{{- $label.Name }}="{{ $label.Value -}}"
{{- end }}
{{- end -}}
)
{{- end }}
text: >-
#{{ range .Alerts -}}
*Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }}*{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
```
> https://chat.hongsnet.net/hooks/XXXXXXXXX 의 grafana 채널에 Alert을 발송한다.
### Start | Stop Script의 생성
아래의 스크립트는 특별한 것이 아니며, 시작|중지를 좀 더 간편하게 하기 위한 것이다.
- **Start Script**
```bash
# cat start-alertmanager.sh
#!/bin/bash
kubectl apply -f alertmanager-configmap.yaml
kubectl apply -f alert-template-configmap.yaml
kubectl apply -f alertmanager.yaml
kubectl apply -f alertmanager-svc.yaml
```
- **Stop Script**
```bash
# cat stop-alertmanager.sh
#!/bin/bash
kubectl delete -f alertmanager-configmap.yaml
kubectl delete -f alert-template-configmap.yaml
kubectl delete -f alertmanager.yaml
kubectl delete -f alertmanager-svc.yaml
```
실행 후 현황은 다음과 같다.
```bash
# kubectl -n monitoring get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-65f5c498ff-q27wj 1/1 Running 0 14h 10.244.1.90 cube02 <none> <none>
grafana-584c994b47-7bltv 1/1 Running 0 5d20h 10.244.1.199 cube02 <none> <none>
node-exporter-8ggc9 1/1 Running 0 8h 10.244.1.184 cube02 <none> <none>
node-exporter-hlpcp 1/1 Running 0 8h 10.244.2.191 cube03 <none> <none>
node-exporter-ph4bs 1/1 Running 0 8h 10.244.3.219 cube04 <none> <none>
prometheus-deployment-544d8d78b9-qp2rf 1/1 Running 0 8h 10.244.1.183 cube02 <none> <none>
```
또한 서비스의 현황은 다음과 같다.
```bash
# kubectl get service -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager NodePort 10.110.183.32 <none> 9093:30005/TCP 14h
grafana NodePort 10.103.8.163 <none> 3000:30004/TCP 5d20h
prometheus-service NodePort 10.100.176.157 <none> 8080:30003/TCP 8h
```
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: alertmanager-templates
namespace: monitoring
data:
default.tmpl: |
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }}
{{ end }}{{ end }}
{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }}
{{ define "slack.default.pretext" }}{{ end }}
{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "slack.default.iconemoji" }}{{ end }}
{{ define "slack.default.iconurl" }}{{ end }}
{{ define "slack.default.text" }}{{ end }}
{{ define "hipchat.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "hipchat.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }}
{{ define "opsgenie.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "opsgenie.default.description" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "opsgenie.default.source" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.message" }}{{ template "__subject" . }} | {{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "email.default.subject" }}{{ template "__subject" . }}{{ end }}
{{ define "email.default.html" }}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Style and HTML derived from https://github.com/mailgun/transactional-email-templates
The MIT License (MIT)
Copyright (c) 2014 Mailgun
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<head style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<meta name="viewport" content="width=device-width" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<title style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">{{ template "__subject" . }}</title>
</head>
<body itemscope="" itemtype="http://schema.org/EmailMessage" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; height: 100%; line-height: 1.6em; width: 100% !important; background-color: #f6f6f6; margin: 0; padding: 0;" bgcolor="#f6f6f6">
<table style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; background-color: #f6f6f6; margin: 0;" bgcolor="#f6f6f6">
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"></td>
<td width="600" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; display: block !important; max-width: 600px !important; clear: both !important; width: 100% !important; margin: 0 auto; padding: 0;" valign="top">
<div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; max-width: 600px; display: block; margin: 0 auto; padding: 0;">
<table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; border-radius: 3px; background-color: #fff; margin: 0; border: 1px solid #e9e9e9;" bgcolor="#fff">
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 16px; vertical-align: top; color: #fff; font-weight: 500; text-align: center; border-radius: 3px 3px 0 0; background-color: #E6522C; margin: 0; padding: 20px;" align="center" bgcolor="#E6522C" valign="top">
{{ .Alerts | len }} alert{{ if gt (len .Alerts) 1 }}s{{ end }} for {{ range .GroupLabels.SortedPairs }}
{{ .Name }}={{ .Value }}
{{ end }}
</td>
</tr>
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 10px;" valign="top">
<table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<a href="{{ template "__alertmanagerURL" . }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #FFF; text-decoration: none; line-height: 2em; font-weight: bold; text-align: center; cursor: pointer; display: inline-block; border-radius: 5px; text-transform: capitalize; background-color: #348eda; margin: 0; border-color: #348eda; border-style: solid; border-width: 10px 20px;">View in {{ template "__alertmanager" . }}</a>
</td>
</tr>
{{ if gt (len .Alerts.Firing) 0 }}
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Firing | len }}] Firing</strong>
</td>
</tr>
{{ end }}
{{ range .Alerts.Firing }}
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
{{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ if gt (len .Annotations) 0 }}<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
<a href="{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source</a><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
</td>
</tr>
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
{{ if gt (len .Alerts.Firing) 0 }}
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<hr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
</td>
</tr>
{{ end }}
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">[{{ .Alerts.Resolved | len }}] Resolved</strong>
</td>
</tr>
{{ end }}
{{ range .Alerts.Resolved }}
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Labels</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
{{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ if gt (len .Annotations) 0 }}<strong style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">Annotations</strong><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
{{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}<br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />{{ end }}
<a href="{{ .GeneratorURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; color: #348eda; text-decoration: underline; margin: 0;">Source</a><br style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;" />
</td>
</tr>
{{ end }}
</table>
</td>
</tr>
</table>
<div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; clear: both; color: #999; margin: 0; padding: 20px;">
<table width="100%" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<tr style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 12px; vertical-align: top; text-align: center; color: #999; margin: 0; padding: 0 0 20px;" align="center" valign="top"><a href="{{ .ExternalURL }}" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 12px; color: #999; text-decoration: underline; margin: 0;">Sent by {{ template "__alertmanager" . }}</a></td>
</tr>
</table>
</div></div>
</td>
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0;" valign="top"></td>
</tr>
</table>
</body>
</html>
{{ end }}
{{ define "pushover.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "pushover.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "pushover.default.url" }}{{ template "__alertmanagerURL" . }}{{ end }}
slack.tmpl: |
{{ define "slack.devops.text" }}
{{range .Alerts}}{{.Annotations.DESCRIPTION}}
{{end}}
{{ end }}
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: /
prometheus.io/port: '8080'
spec:
selector:
app: alertmanager
type: NodePort
ports:
- port: 9093
targetPort: 9093
nodePort: 30005
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:latest
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: templates-volume
mountPath: /etc/alertmanager-templates
- name: alertmanager-storage-volume
mountPath: /alertmanager
nodeSelector:
kubernetes.io/hostname: cube02
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: templates-volume
configMap:
name: alertmanager-templates
- name: alertmanager-storage-volume
persistentVolumeClaim:
claimName: alertmanager-pvc
securityContext:
runAsUser: 0
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources: ["configmaps", "secrets", "nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
verbs: ["list","watch"]
- apiGroups:
- extensions
resources: ["daemonsets", "deployments", "replicasets", "ingresses"]
verbs: ["list", "watch"]
- apiGroups:
- apps
resources: ["statefulsets", "daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups:
- batch
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups:
- autoscaling
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
- apiGroups:
- authentication.k8s.io
resources: ["tokenreviews"]
verbs: ["create"]
- apiGroups:
- authorization.k8s.io
resources: ["subjectaccessreviews"]
verbs: ["create"]
- apiGroups:
- policy
resources: ["poddisruptionbudgets"]
verbs: ["list", "watch"]
- apiGroups:
- certificates.k8s.io
resources: ["certificatesigningrequests"]
verbs: ["list", "watch"]
- apiGroups:
- storage.k8s.io
resources: ["storageclasses", "volumeattachments"]
verbs: ["list", "watch"]
- apiGroups:
- admissionregistration.k8s.io
resources: ["mutatingwebhookconfigurations", "validatingwebhookconfigurations"]
verbs: ["list", "watch"]
- apiGroups:
- networking.k8s.io
resources: ["networkpolicies"]
verbs: ["list", "watch"]
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: kube-state-metrics
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
containers:
- image: quay.io/coreos/kube-state-metrics:v1.8.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
apiVersion: v1
kind: Service
metadata:
labels:
app: kube-state-metrics
name: kube-state-metrics
namespace: kube-system
spec:
clusterIP: None
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
- name: telemetry
port: 8081
targetPort: telemetry
selector:
app: kube-state-metrics
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-server-conf
labels:
name: prometheus-server-conf
namespace: monitoring
data:
prometheus.rules: |-
groups:
- name: Container CPU Usage
rules:
# cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
# If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
- alert: ContainerCpuUsage
expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 200
for: 2m
labels:
severity: warning
annotations:
summary: "Container CPU usage (instance {{ $labels.instance }})"
description: "Container CPU usage is above 200%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Container Memory Usage
rules:
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
- alert: ContainerMemoryUsage
expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "Container Memory usage (instance {{ $labels.instance }})"
description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Container Volume Usage
rules:
- alert: ContainerVolumeUsage
expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "Container Volume usage (instance {{ $labels.instance }})"
description: "Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Container Volume I/O Usage
rules:
- alert: ContainerVolumeIoUsage
expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "Container Volume IO usage (instance {{ $labels.instance }})"
description: "Container Volume IO usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Container High Throttle Rate
rules:
- alert: ContainerHighThrottleRate
expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "Container high throttle rate (instance {{ $labels.instance }})"
description: "Container is being throttled\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Instnce Down Alert
rules:
- alert: Instance Down
expr: up == 0 and up{job="kubernetes-service-endpoints"} != 0
for: 1m
labels:
#severity: warning
severity: fatal
annotations:
summary: "Instance {{ $labels.instance }} is Down"
description: "To signal, that a target (e.g. instance) is down, we simply check the up metric:"
- name: Hardware alerts
rules:
- alert: Node down
expr: up{job="node_exporter"} == 0
for: 3m
labels:
severity: fatal
annotations:
title: "Node {{ $labels.instance }} is down"
description: "Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down."
- name: Node LoadAverage 15 minute over Alerts
rules:
- alert: Node LoadAverage 15minute Usage(50% Upper)
expr: node_load15 >= 50
for: 1m
labels:
severity: fatal
annotations:
summary: "Instance {{ $labels.instance }} - high load average"
description: "{{ $labels.instance }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 15 minutes."
# 메모리 부족
- name: Node Memory 10% Under Free
rules:
- alert: HostOutOfMemory 10% Under Free
expr: (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)) * 100 <= 10
for: 10s
annotations:
summary: "Host out of memory instance {{ $labels.instance }}"
description: "{{ $labels.instance }} has more than 10% Under of its memory free."
- name: Node DiskFree 20% Under
rules:
- alert: Node DiskSpace 20%Free Under
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 80
for: 1m
labels:
severity: fatal
annotations:
summary: "Instance {{ $labels.instance }} is low on disk space"
description: "{{ $labels.instance }} has only {{ $value }}% free DiskSpace Usage"
- name: Hello-World POD Container Monitoring
rules:
- alert: The number of hello-world POD(containers) is less than 5
expr: count(kube_pod_container_status_running{container="hello-world"}) < 5 or absent(kube_pod_container_status_running{container="hello-world"}) == 1
for: 1m
labels:
severity: fatal
annotations:
summary: "Hello-World POD DOWN"
description: "The number of hello-world POD( {{ $value }} ) is less than 5."
- name: Prometheus is Scraping Exporters Slowly
rules:
- alert: PrometheusTargetScrapingSlow
expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus target scraping slow instance {{ $labels.instance }}"
description: "Prometheus is scraping exporters slowly VALUE = {{ $value }} LABELS: {{ $labels }}"
- name: Prometheus not connected to alertmanager
rules:
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 0m
labels:
severity: critical
annotations:
summary: "Prometheus not connected to alertmanager instance {{ $labels.instance }}"
description: "Prometheus cannot connect the alertmanager VALUE = {{ $value }} LABELS: {{ $labels }}"
- name: Prometheus rule evaluation failures
rules:
- alert: PrometheusRuleEvaluationFailures
expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Prometheus rule evaluation failures instance {{ $labels.instance }}"
description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. VALUE = {{ $value }} LABELS: {{ $labels }}"
- name: Prometheus rule evaluation Slow
rules:
- alert: PrometheusRuleEvaluationSlow
expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus rule evaluation slow instance {{ $labels.instance }}"
description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query. VALUE = {{ $value }} LABELS: {{ $labels }}"
- name: Prometheus notifications Backlog
rules:
- alert: PrometheusNotificationsBacklog
expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Prometheus notifications backlog (instance {{ $labels.instance }})"
description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Prometheus AlertManager notification Failing
rules:
- alert: PrometheusAlertmanagerNotificationFailing
expr: rate(alertmanager_notifications_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Prometheus AlertManager notification failing (instance {{ $labels.instance }})"
description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Prometheus target Empty
rules:
- alert: PrometheusTargetEmpty
expr: prometheus_sd_discovered_targets == 0
for: 0m
labels:
severity: critical
annotations:
summary: "Prometheus target empty (instance {{ $labels.instance }})"
description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Prometheus large Scrape
rules:
- alert: PrometheusLargeScrape
expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus large scrape (instance {{ $labels.instance }})"
description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Prometheus target scrape Duplicate
rules:
- alert: PrometheusTargetScrapeDuplicate
expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Prometheus target scrape duplicate (instance {{ $labels.instance }})"
description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Memory under memory pressure
rules:
- alert: HostMemoryUnderMemoryPressure
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "Host memory under memory pressure (instance {{ $labels.instance }})"
description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node unusual Network throughput IN
rules:
- alert: HostUnusualNetworkThroughputIn
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual network throughput in (instance {{ $labels.instance }})"
description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node unusual Network throughtput OUT
rules:
- alert: HostUnusualNetworkThroughputOut
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual network throughput out (instance {{ $labels.instance }})"
description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node unusal Disk Read Rate
rules:
- alert: HostUnusualDiskReadRate
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk read rate (instance {{ $labels.instance }})"
description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node unusal Disk Write Rate
rules:
- alert: HostUnusualDiskWriteRate
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 2m
labels:
severity: warning
annotations:
summary: "Host unusual disk write rate (instance {{ $labels.instance }})"
description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node High CPU Load
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: "Host high CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node CPU steal Noisy Neighbor
rules:
- alert: HostCpuStealNoisyNeighbor
expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
for: 0m
labels:
severity: warning
annotations:
summary: "Host CPU steal noisy neighbor (instance {{ $labels.instance }})"
description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Context Switching
rules:
# 1000 context switches is an arbitrary number.
# Alert threshold depends on nature of application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
- alert: HostContextSwitching
expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 3000
for: 0m
labels:
severity: warning
annotations:
summary: "Host context switching (instance {{ $labels.instance }})"
description: "Context switching is growing on node (> 3000/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Swap is filling UP
rules:
- alert: HostSwapIsFillingUp
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "Host swap is filling up (instance {{ $labels.instance }})"
description: "Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Network Receive Erros
rules:
- alert: HostNetworkReceiveErrors
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "Host Network Receive Errors (instance {{ $labels.instance }})"
description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Network Transmit Erros
rules:
- alert: HostNetworkTransmitErrors
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "Host Network Transmit Errors (instance {{ $labels.instance }})"
description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Network Interface Saturated
rules:
- alert: HostNetworkInterfaceSaturated
expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8
for: 1m
labels:
severity: warning
annotations:
summary: "Host Network Interface Saturated (instance {{ $labels.instance }})"
description: "The network interface \"{{ $labels.interface }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Conntrack Limit
rules:
- alert: HostConntrackLimit
expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Host conntrack limit (instance {{ $labels.instance }})"
description: "The number of conntrack is approching limit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Clock Skew
rules:
- alert: HostClockSkew
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
for: 2m
labels:
severity: warning
annotations:
summary: "Host clock skew (instance {{ $labels.instance }})"
description: "Clock skew detected. Clock is out of sync.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- name: Node Clock not Synchronising
rules:
- alert: HostClockNotSynchronising
expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
for: 2m
labels:
severity: warning
annotations:
summary: "Host clock not synchronising (instance {{ $labels.instance }})"
description: "Clock not synchronising.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
prometheus.yml: |-
global:
scrape_interval: 5s
evaluation_interval: 5s
rule_files:
- /etc/prometheus/prometheus.rules
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "alertmanager.monitoring.svc:9093"
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'ext-node-exporter'
static_configs:
- targets: ['172.24.0.222:9100', '172.24.0.223:9100', '172.24.0.224:9100', '172.24.0.225:9100']
...@@ -29,4 +29,8 @@ Prometheus을 이용한 www.hongsnet.net Container를 모니터링 한다. ...@@ -29,4 +29,8 @@ Prometheus을 이용한 www.hongsnet.net Container를 모니터링 한다.
# Monitroing ITEM (Common) # Monitroing ITEM (Common)
> Prometheus Alert Rule을 이용한 각종 Metric 기준 및 구성을 진행한다.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment