SkyWalking

1 k8s中部署

cd /opt/yaml/skywalking
kubectl apply -f .

1.1 skywalking-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: skywalking
  namespace: skywalking
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: skywalking 
  labels:
    app: skywalking
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: skywalking
  labels:
    app: skywalking
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: skywalking
subjects:
  - kind: ServiceAccount
    name: skywalking
    namespace: skywalking

1.2 skywalking-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: skywalking
  name: skywalking
  namespace: skywalking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: skywalking
  template:
    metadata:
      labels:
        app: skywalking
    spec:
      containers:
        - env:       
          - name: SW_STORAGE
            value: elasticsearch  ##存储方式
          - name: SW_STORAGE_ES_CLUSTER_NODES
            value: '192.168.64.45:30092'
          - name: SW_CORE_RECORD_DATA_TTL   #记录数据的生命周期(以天为单位)
            value: '15'
          - name: SW_CORE_METRICS_DATA_TTL   #指标数据的生命周期(以天为单位);metricsDataTTL >= recordDataTTL
            value: '15'  
        #- envFrom:
        #  - prefix: SW_
        #    configMapRef: 
        #      name: skywalking-cm                  
          image: 192.168.64.33:5000/skywalking/skywalking-oap-server:9.2.0
          imagePullPolicy: IfNotPresent         
          name: skywalking
          ports:
            - containerPort: 12800
              name: http
              protocol: TCP
            - containerPort: 11800
              name: grpc
              protocol: TCP
          resources:
            limits:
              cpu: '2'
              memory: 2Gi
            requests:
              cpu: '1'
              memory: 2Gi
          volumeMounts:
            - mountPath: /etc/localtime
              name: volume-localtime
      volumes:
        - hostPath:
            path: /etc/localtime
            type: ''
          name: volume-localtime

1.3 skywalking-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: skywalking-svc
  namespace: skywalking
  labels:
    app: skywalking
spec:
  type: NodePort
  ports:
    - name: http
      port: 12800
      protocol: TCP
      targetPort: 12800
    - name: grpc
      port: 11800
      protocol: TCP
      targetPort: 11800
      nodePort: 32105
  selector:
    app: skywalking

1.4 skywalking-ui-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: skywalking-ui
  name: skywalking-ui
  namespace: skywalking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: skywalking-ui
  template:
    metadata:
      labels:
        app: skywalking-ui
    spec:
      containers:
        - env:
            - name: SW_OAP_ADDRESS
              value: "http://skywalking-svc:12800"          
          image: 192.168.64.33:5000/skywalking/skywalking-ui:9.2.0
          imagePullPolicy: IfNotPresent         
          name: skywalking-ui
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
          resources:
            limits:
              cpu: '2'
              memory: 1Gi
            requests:
              cpu: '1'
              memory: 1Gi
          volumeMounts:
            - mountPath: /etc/localtime
              name: volume-localtime
      volumes:
        - hostPath:
            path: /etc/localtime
            type: ''
          name: volume-localtime
---
apiVersion: v1
kind: Service
metadata:
  name: skywalking-ui-svc
  namespace: skywalking
  labels:
    app: skywalking-ui
spec:
  type: NodePort
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
      nodePort: 32104
  selector:
    app: skywalking-ui

2 使用skywalking-agent探针

# 上线到仓库:jenkins-yaml/test/skywalking-agent

# 修改dockerfile
COPY ./test/skywalking-agent /app/skywalking-agent
ENV JVM_OPTS="-javaagent:/app/skywalking-agent/skywalking-agent.jar -Dskywalking.agent.service_name={SERVICE_NAME} -Xss256k -Duser.timezone=Asia/Shanghai -Djava.security.egd=file:/dev/./urandom -Dspring.profiles.active=test  -XX:+UseG1GC"

# jenkins构建服务

3 程序改造方式

# 有两种设置 agent 的方法:
1. 将 agent 与程序打包在同一镜像中:实现简单
2. 使用 Kubernetes 的 Sidecar:更加灵活

SW_AGENT_NAME: 对应程序的名字
SW_AGENT_COLLECTOR_BACKEND_SERVICES: skywalking:11800

4 下载skywalking-agent

https://archive.apache.org/dist/skywalking/java-agent/8.8.0/

5 收集log

java应用添加logback  增加traceid

6 告警

metrecs-name: 指标名称,也是OAL脚本中的指标名,可以配置告警的指标有:服务、实例、端口、服务关系、实例关系、端点关系。支持long,double, int类型
op:操作符
threshold: 阈值
period: 告警规则多久被检查一次,是一个时间窗口
count: 在一个时间窗口内,满足op超过阈值的次数达到count值,就会触发告警
slience-perriod: 在时间N中触发报警后,在N+slience-perriod这段时间内不告警
message: 告警时通知的消息

# 添加webhook

7 自定义链路追踪

引入依赖
获取TraceId
@Trace
@Tags

7 图示指标说明

# Service
Load(calls/min): 一段时间的每分钟调用数
Sucess Rate(%): 一段时间的请求成功率
Latency(ms): 一段时间的响应延时
Apdex: 一段时间的Apdex性能指标

# Overview
Service Avg Response Time (ms): 服务平均响应时间
Service Apdex:apdex分数折线图
Service Response Time Percentile (ms):百分比延时
Service Load (calls / min):每分钟调用数折线图
Success Rate (%): 成功请求比率折线图
Service Instances Load (calls / min):每个实例每分钟调用数折线图
Slow Service Instance (ms):每个服务实例平均延时
Service Instance Success Rate (%):每个服务实例请求成功率

#Instance 指标
Service Instance Load(CPM - calls per minute):实例每分钟调用数
Service Instance Successful Rate(%):实例调用成功比率
Service Instance Latency(ms):实例响应延时
JVM CPU(java service)%:jvm占用cpu百分比
JVM Memory (java service)(MB):jvm内存占用大小,包含四个指标instance_jvm_memory_heap(堆内存使用)、instance_jvm_memory_heap_max(最大堆内存)、instance_jvm_memory_noheap(直接内存当前使用)、instance_jvm_memory_noheap_max(最大直接内存)
JVM GC Time(ms):jvm垃圾回收时间,包含young gc和old gc。
JVM GC Count:jvm垃圾回收次数,包含young gc count和old gc count
JVM Thread Count(java service)线程数

# Endpoint指标
Endpoint Load in Current Service(CPM / PPM):每个端点(API)每分钟请求数
Slow Endpoints in Current Service(ms):每个端点(API)的平均响应时间最慢top n,单位ms
Successful Rate in Current Service(%):每个端点(API)的请求成功率
Endpoint Load:当前端点每个时间段的请求数据
Endpoint Avg Response Time:当前端点每个时间段的平均请求响应时间
Endpoint Response Time Percentile(ms):当前端点每个时间段的响应时间占比
Endpoint Successful Rate(%):当前端点每个时间段的请求成功率

# Database
Database Avg Response Time(ms):当前数据库平均响应时间,单位ms
Database Access Successful Rate(%):当前数据库访问成功率
Database Traffic(CPM: Calls Per Minute):当前数据库每分钟请求数
Database Access Latency Percentile(ms):数据库不同比例的响应时间,单位ms
Slow Statements(ms):前N个慢查询,单位ms
All Database Loads(CPM: Calls Per Minute):所有数据库中请求量排序
Un-Health Databases:所有数据库不健康排名,请求成功率排名,失败最多的请求在最上

8 性能剖析

# 性能剖析通过新建任务,对不同端点进行采样,提供更详细的报告。目前看起来,比追踪多了线程栈的信息、慢方法提示
服务名
端点名称
监控时间
监控持续时间
起始监控时间
监控间隔
最大采样数