SkyWalking

1 k8s中部署

1
2
cd /opt/yaml/skywalking
kubectl apply -f .

1.1 skywalking-rbac.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: v1
kind: ServiceAccount
metadata:
name: skywalking
namespace: skywalking
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: skywalking
labels:
app: skywalking
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: skywalking
labels:
app: skywalking
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: skywalking
subjects:
- kind: ServiceAccount
name: skywalking
namespace: skywalking

1.2 skywalking-deployment.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: skywalking
name: skywalking
namespace: skywalking
spec:
replicas: 1
selector:
matchLabels:
app: skywalking
template:
metadata:
labels:
app: skywalking
spec:
containers:
- env:
- name: SW_STORAGE
value: elasticsearch ##存储方式
- name: SW_STORAGE_ES_CLUSTER_NODES
value: '192.168.64.45:30092'
- name: SW_CORE_RECORD_DATA_TTL #记录数据的生命周期(以天为单位)
value: '15'
- name: SW_CORE_METRICS_DATA_TTL #指标数据的生命周期(以天为单位);metricsDataTTL >= recordDataTTL
value: '15'
#- envFrom:
# - prefix: SW_
# configMapRef:
# name: skywalking-cm
image: 192.168.64.33:5000/skywalking/skywalking-oap-server:9.2.0
imagePullPolicy: IfNotPresent
name: skywalking
ports:
- containerPort: 12800
name: http
protocol: TCP
- containerPort: 11800
name: grpc
protocol: TCP
resources:
limits:
cpu: '2'
memory: 2Gi
requests:
cpu: '1'
memory: 2Gi
volumeMounts:
- mountPath: /etc/localtime
name: volume-localtime
volumes:
- hostPath:
path: /etc/localtime
type: ''
name: volume-localtime

1.3 skywalking-service.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Service
metadata:
name: skywalking-svc
namespace: skywalking
labels:
app: skywalking
spec:
type: NodePort
ports:
- name: http
port: 12800
protocol: TCP
targetPort: 12800
- name: grpc
port: 11800
protocol: TCP
targetPort: 11800
nodePort: 32105
selector:
app: skywalking

1.4 skywalking-ui-deployment.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: skywalking-ui
name: skywalking-ui
namespace: skywalking
spec:
replicas: 1
selector:
matchLabels:
app: skywalking-ui
template:
metadata:
labels:
app: skywalking-ui
spec:
containers:
- env:
- name: SW_OAP_ADDRESS
value: "http://skywalking-svc:12800"
image: 192.168.64.33:5000/skywalking/skywalking-ui:9.2.0
imagePullPolicy: IfNotPresent
name: skywalking-ui
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
limits:
cpu: '2'
memory: 1Gi
requests:
cpu: '1'
memory: 1Gi
volumeMounts:
- mountPath: /etc/localtime
name: volume-localtime
volumes:
- hostPath:
path: /etc/localtime
type: ''
name: volume-localtime
---
apiVersion: v1
kind: Service
metadata:
name: skywalking-ui-svc
namespace: skywalking
labels:
app: skywalking-ui
spec:
type: NodePort
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
nodePort: 32104
selector:
app: skywalking-ui

2 使用skywalking-agent探针

1
2
3
4
5
6
7
# 上线到仓库:jenkins-yaml/test/skywalking-agent

# 修改dockerfile
COPY ./test/skywalking-agent /app/skywalking-agent
ENV JVM_OPTS="-javaagent:/app/skywalking-agent/skywalking-agent.jar -Dskywalking.agent.service_name={SERVICE_NAME} -Xss256k -Duser.timezone=Asia/Shanghai -Djava.security.egd=file:/dev/./urandom -Dspring.profiles.active=test -XX:+UseG1GC"

# jenkins构建服务

3 程序改造方式

1
2
3
4
5
6
# 有两种设置 agent 的方法:
1. 将 agent 与程序打包在同一镜像中:实现简单
2. 使用 Kubernetes 的 Sidecar:更加灵活

SW_AGENT_NAME: 对应程序的名字
SW_AGENT_COLLECTOR_BACKEND_SERVICES: skywalking:11800

4 下载skywalking-agent

1
https://archive.apache.org/dist/skywalking/java-agent/8.8.0/

5 收集log

1
java应用添加logback  增加traceid

6 告警

1
2
3
4
5
6
7
8
9
metrecs-name: 指标名称,也是OAL脚本中的指标名,可以配置告警的指标有:服务、实例、端口、服务关系、实例关系、端点关系。支持long,double, int类型
op:操作符
threshold: 阈值
period: 告警规则多久被检查一次,是一个时间窗口
count: 在一个时间窗口内,满足op超过阈值的次数达到count值,就会触发告警
slience-perriod: 在时间N中触发报警后,在N+slience-perriod这段时间内不告警
message: 告警时通知的消息

# 添加webhook

7 自定义链路追踪

1
2
3
4
引入依赖
获取TraceId
@Trace
@Tags

7 图示指标说明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Service
Load(calls/min): 一段时间的每分钟调用数
Sucess Rate(%): 一段时间的请求成功率
Latency(ms): 一段时间的响应延时
Apdex: 一段时间的Apdex性能指标

# Overview
Service Avg Response Time (ms): 服务平均响应时间
Service Apdex:apdex分数折线图
Service Response Time Percentile (ms):百分比延时
Service Load (calls / min):每分钟调用数折线图
Success Rate (%): 成功请求比率折线图
Service Instances Load (calls / min):每个实例每分钟调用数折线图
Slow Service Instance (ms):每个服务实例平均延时
Service Instance Success Rate (%):每个服务实例请求成功率

#Instance 指标
Service Instance Load(CPM - calls per minute):实例每分钟调用数
Service Instance Successful Rate(%):实例调用成功比率
Service Instance Latency(ms):实例响应延时
JVM CPU(java service)%:jvm占用cpu百分比
JVM Memory (java service)(MB):jvm内存占用大小,包含四个指标instance_jvm_memory_heap(堆内存使用)、instance_jvm_memory_heap_max(最大堆内存)、instance_jvm_memory_noheap(直接内存当前使用)、instance_jvm_memory_noheap_max(最大直接内存)
JVM GC Time(ms):jvm垃圾回收时间,包含young gc和old gc。
JVM GC Count:jvm垃圾回收次数,包含young gc count和old gc count
JVM Thread Count(java service)线程数

# Endpoint指标
Endpoint Load in Current Service(CPM / PPM):每个端点(API)每分钟请求数
Slow Endpoints in Current Service(ms):每个端点(API)的平均响应时间最慢top n,单位ms
Successful Rate in Current Service(%):每个端点(API)的请求成功率
Endpoint Load:当前端点每个时间段的请求数据
Endpoint Avg Response Time:当前端点每个时间段的平均请求响应时间
Endpoint Response Time Percentile(ms):当前端点每个时间段的响应时间占比
Endpoint Successful Rate(%):当前端点每个时间段的请求成功率

# Database
Database Avg Response Time(ms):当前数据库平均响应时间,单位ms
Database Access Successful Rate(%):当前数据库访问成功率
Database Traffic(CPM: Calls Per Minute):当前数据库每分钟请求数
Database Access Latency Percentile(ms):数据库不同比例的响应时间,单位ms
Slow Statements(ms):前N个慢查询,单位ms
All Database Loads(CPM: Calls Per Minute):所有数据库中请求量排序
Un-Health Databases:所有数据库不健康排名,请求成功率排名,失败最多的请求在最上

8 性能剖析

1
2
3
4
5
6
7
8
# 性能剖析通过新建任务,对不同端点进行采样,提供更详细的报告。目前看起来,比追踪多了线程栈的信息、慢方法提示
服务名
端点名称
监控时间
监控持续时间
起始监控时间
监控间隔
最大采样数