0%

基础知识

VLAN基础原理

VLAN(Virtual LAN)通过在以太网帧中插入 802.1Q 标签(Tag) 来区分不同的逻辑网络。

该标签包含一个 VLAN ID(12 位,范围 0-4095),标识数据属于哪个虚拟局域网。

通过 VLAN,多个逻辑网络可以在同一物理链路上隔离传输,避免广播风暴,保证安全和带宽分配。

802.1Q VLAN Tag 帧结构简述

1
2
3
4
5
6
7
8
9
10
+------------+------------+--------------+-------------+
| 目的MAC | 源MAC | 802.1Q Tag | 以太类型 |
+------------+------------+--------------+-------------+

802.1Q Tag 由以下部分组成:
- TPID (Tag Protocol Identifier):固定为0x8100,表示该帧带有 VLAN 标签
- TCI (Tag Control Information):
- PCP (Priority Code Point):3 位,优先级
- DEI (Drop Eligible Indicator):1 位
- VLAN ID:12 位,标识 VLAN 号

实际案例

千兆家庭宽带IPTV+上网单线复用案例

目前网络现状
flowchart TD
    光猫 --> 路由器
    路由器 --> 机顶盒
    路由器 --> 其他设备
方案对比
方案(光猫模式) 路由器 VLAN标签 说明 运营商
桥接(多 VLAN) 主路由(拨号) 光猫桥接,路由器处理 光猫需创建多个桥接接口(如 VLAN 621/835),各桥接到同一物理口由路由器解析 需运营商将光猫配置为桥接模式,并支持 IPTV 与上网 VLAN 同时桥接到一个 LAN 口,提供各业务的 VLAN ID
Trunk 主路由(拨号) 光猫透传,路由器处理 光猫作为 802.1Q Trunk 设备,VLAN 全标签透传 需运营商配置为 Trunk 模式,允许所有业务 VLAN 透传,并提供上网与 IPTV 的 VLAN ID
混合(光猫拨号) 中继/AP 模式 光猫处理,路由器透传 光猫直接拨号+打/解 VLAN,路由器相当于交换机 需要运营上配置为混合模式

在沟通时,建议对运营商明确说明:

“请将光猫配置为桥接模式(或 Trunk 模式),将 IPTV VLAN 和上网 VLAN 都桥接/透传到同一个 LAN 口上,我用自己的路由器拨号并分流。”

单线复用时序图对比
sequenceDiagram
    autonumber
    participant Internet
    participant OpticalModem as 光猫
    participant Router as 路由器
    participant IPTVBox as 机顶盒
    participant Terminal as 上网终端
    rect rgb(220, 240, 255)
    note over Internet, Router: 桥接(多 VLAN)模式:光猫桥接多个 VLAN,路由器拨号
        Internet->>OpticalModem: VLAN 621(上网) + VLAN 835(IPTV)
        OpticalModem->>Router: VLAN 621 + 835 各自桥接到路由器(单线复用)
        Router->>Internet: 使用 VLAN 621 拨号上网
        Router->>Terminal: 解 tag,转发上网数据
        Router->>IPTVBox: 解 tag,转发 IPTV 数据(VLAN 835)
    end
    rect rgb(220, 255, 220)
    note over Internet, Router: Trunk 模式:光猫配置 trunk,所有 VLAN 打标签透传
        Internet->>OpticalModem: VLAN 621 + VLAN 835(802.1Q)
        OpticalModem->>Router: trunk 输出,打标签透传(单线复用)
        Router->>Internet: 使用 VLAN 621 拨号上网
        Router->>Terminal: 解 tag,转发上网数据
        Router->>IPTVBox: 解 tag,转发 IPTV 数据(VLAN 835)
    end
    rect rgb(255, 240, 220)
    note over Internet, Router: 混合模式:光猫拨号并解 VLAN,路由器中继交换
        Internet->>OpticalModem: VLAN 621 + VLAN 835
        OpticalModem->>Internet: 光猫使用 VLAN 621 拨号上网
        OpticalModem->>Router: 解 tag 后下发无标签数据(上网+IPTV)
        Router->>Terminal: 上网数据透传
        Router->>IPTVBox: IPTV 数据透传(无 VLAN 识别)
    end

名词解释

  • 物理卷(PV):物理卷是 LVM 管理的最底层单元,可以是整个磁盘,也可以是某个分区。初始化 PV 时,LVM 会在设备上写入元数据,以便后续跟踪和管理

  • 卷组(VG):卷组是将多个 PV 聚合成一个逻辑存储池的结构。VG 的总容量等于所有 PV 容量之和,后续可在此池内划分逻辑卷

  • 逻辑卷(LV):逻辑卷是最终呈现给操作系统和应用程序的“分区”单元。在 VG 中可以创建多个 LV,并可按需动态调整大小

创建步骤

磁盘分区

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@192-168-10-206 ~]# fdisk -l
Disk /dev/vdb: 100 GiB, 107374182400 bytes, 209715200 sectors #找到需要分区的磁盘
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
[root@192-168-10-206 ~]# fdisk /dev/vdb #分区的磁盘
Welcome to fdisk (util-linux 2.37.4).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0xf274d173.
Command (m for help): n #新建
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p #主分区p
Partition number (1-4, default 1): 1 #分区号默认
First sector (2048-209715199, default 2048): #默认
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-209715199, default 209715199): #默认
Created a new partition 1 of type 'Linux' and of size 100 GiB.
Command (m for help): t #改变toggle类型
Selected partition 1
Hex code or alias (type L to list all): 8e # LVM 的分区代码8e
Changed type of partition 'Linux' to 'Linux LVM'.
Command (m for help): w #写入保存
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
[root@192-168-10-206 ~]# fdisk -l #查看分区成功后的信息
Disk /dev/vdb: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xf274d173
Device Boot Start End Sectors Size Id Type
/dev/vdb1 2048 209715199 209713152 100G 8e Linux LVM

创建物理卷(PV)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@192-168-10-206 ~]# pvcreate /dev/vdb1 #对vdb1进行物理卷创建
Physical volume "/dev/vdb1" successfully created.
[root@192-168-10-206 ~]# pvs #查看所有pv
PV VG Fmt Attr PSize PFree
/dev/vda3 cs_localhost-live lvm2 a-- 48.41g 0 #这是系统盘的
/dev/vdb1 lvm2 --- <100.00g <100.00g #这是创建的
[root@192-168-10-206 ~]# pvdisplay /dev/vdb1 #查看pv详情
"/dev/vdb1" is a new physical volume of "<100.00 GiB"
--- NEW Physical volume ---
PV Name /dev/vdb1
VG Name
PV Size <100.00 GiB
Allocatable NO
PE Size 0
Total PE 0
Free PE 0
Allocated PE 0
PV UUID RPEJye-2Ifc-4T45-g9wU-Rdii-wMwb-q3dZON

创建卷组(VG)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[root@192-168-10-206 ~]# vgcreate vg_data /dev/vdb1 #创建vg,vg_data名字自定义,在/dev/vdb1上创建
Volume group "vg_data" successfully created
[root@192-168-10-206 ~]# vgs #查看所有vg
VG #PV #LV #SN Attr VSize VFree
cs_localhost-live 1 2 0 wz--n- 48.41g 0
vg_data 1 0 0 wz--n- <100.00g <100.00g
[root@192-168-10-206 ~]# pvs #查看所有pv
PV VG Fmt Attr PSize PFree
/dev/vda3 cs_localhost-live lvm2 a-- 48.41g 0
/dev/vdb1 vg_data lvm2 a-- <100.00g <100.00g #这里已经绑定vg了
[root@192-168-10-206 ~]# vgdisplay vg_data #显示vg详情
--- Volume group ---
VG Name vg_data
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size <100.00 GiB
PE Size 4.00 MiB
Total PE 25599
Alloc PE / Size 0 / 0
Free PE / Size 25599 / <100.00 GiB
VG UUID jp9G9F-WLig-GXUy-wytn-ahcC-3Q1F-72Utl1

逻辑卷管理(LV)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# 创建:lvcreate -L <大小> -n <逻辑卷名> <卷组名>
[root@192-168-10-206 ~]# lvcreate -L 10G -n lv_test vg_data
[root@192-168-10-206 ~]# lvs #查看
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv_test vg_data -wi-a----- 10.00g
# 格式化:mkfs.ext4 /dev/<卷组名>/<逻辑卷名>
[root@192-168-10-206 ~]# mkfs.ext4 /dev/vg_data/lv_test
# 临时挂载:mount /dev/<卷组名>/<逻辑卷名> /<挂载目录>
[root@192-168-10-206 ~]# mkdir /mnt/test
[root@192-168-10-206 ~]# mount /dev/vg_data/lv_test /mnt/test
[root@192-168-10-206 ~]# cd /mnt/test/
[root@192-168-10-206 ~]# touch /mnt/test/aa.test
[root@192-168-10-206 ~]# ls /mnt/test/aa.test
/mnt/test/aa.test
# 永久挂载:解决重启失效的问题
[root@192-168-10-206 ~]# UUID=$(blkid -s UUID -o value /dev/vg_data/lv_test) && echo "UUID=$UUID /mnt/test ext4 defaults 0 0" | sudo tee -a /etc/fstab && sudo mount -a
# 扩容:lvextend -L +<扩容大小> -r /dev/<卷组名>/<逻辑卷名>
# 说明:-r 参数就是自动调用 resize2fs 的意思
[root@192-168-10-206 ~]# lvextend -L +1G -r /dev/vg_data/lv_test
Size of logical volume vg_data/lv_test changed from 10.00 GiB (2560 extents) to 11.00 GiB (2816 extents).
File system ext4 found on vg_data/lv_test mounted at /mnt/test.
Extending file system ext4 to 11.00 GiB (11811160064 bytes) on vg_data/lv_test...
resize2fs /dev/vg_data/lv_test
resize2fs 1.46.5 (30-Dec-2021)
Filesystem at /dev/vg_data/lv_test is mounted on /mnt/test; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 2
The filesystem on /dev/vg_data/lv_test is now 2883584 (4k) blocks long.

resize2fs done
Extended file system ext4 on vg_data/lv_test.
Logical volume vg_data/lv_test successfully resized.
# (上一步加了-r参数,这一步就不需要执行了)扩展 ext 文件系统,怎加实际大小,df -h可以查看
[root@192-168-10-206 ~]# resize2fs /dev/vg_data/lv_test
resize2fs 1.46.5 (30-Dec-2021)
The filesystem is already 2883584 (4k) blocks long. Nothing to do!
[root@192-168-10-206 ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv_test vg_gh_db -wi-ao---- 11.00g
[root@192-168-10-206 ~]# df -h
/dev/mapper/vg_gh_db-lv_test 11G 24K 11G 1% /mnt/test
# 缩容:不支持在线缩容
umount /mnt/test # 1.先卸载挂载文件系统
e2fsck -f /dev/vg_data/lv_test # 2.先检查文件系统完整性
resize2fs /dev/vg_data/lv_test 1G # 3.缩小文件系统大小 (不支持在线)
lvreduce -L 15G /dev/vg_data/lv_test # 4.缩小逻辑卷大小 (支持在线)
e2fsck -f /dev/vg_data/lv_test # 5.再次检查文件系统完整性
mount /mnt/test # 6.重新挂载

格式化并挂载

环境说明

类型 版本
k8s(3节点) 1.26.6
8c16g 安装时,限制到1c1g安装,总的资源也不足
变配到16c24g

在线安装

  1. 部署etcd

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    [root@node3 greptime]# helm upgrade \
    --install etcd oci://greptime-registry.cn-hangzhou.cr.aliyuncs.com/charts/etcd \
    --create-namespace \
    --version 11.3.4 \
    -n etcd-cluster --values etcd-values.yaml

    [root@node3 greptime]# kubectl get pod -n etcd-cluster
    NAME READY STATUS RESTARTS AGE
    etcd-0 1/1 Running 0 79s
    etcd-1 1/1 Running 0 79s
    etcd-2 1/1 Running 0 79s

    [root@node3 greptime]# kubectl -n etcd-cluster exec etcd-0 -- etcdctl \
    --endpoints=etcd-0.etcd-headless.etcd-cluster:2379,etcd-1.etcd-headless.etcd-cluster:2379,etcd-2.etcd-headless.etcd-cluster:2379 \
    endpoint status -w table
    +--------------------------------------------------------------------------------+
    | Endpoint | ID | Ver | DB | Ldr | Lrn | Trm | Idx | App | Err |
    |----------------|-----------|-------|-------|-----|-----|-----|-----|-----|-----|
    | etcd-0...:2379 | 680910... | 3.5.21 | 20kB | ✘ | ✘ | 2 | 38 | 38 | |
    | etcd-1...:2379 | d6980d... | 3.5.21 | 20kB | ✔ | ✘ | 2 | 38 | 38 | |
    | etcd-2...:2379 | 12664f... | 3.5.21 | 20kB | ✘ | ✘ | 2 | 38 | 38 | |
    +--------------------------------------------------------------------------------+
  2. 部署minio

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    [root@node3 greptime]# helm upgrade \
    --install minio oci://greptime-registry.cn-hangzhou.cr.aliyuncs.com/charts/minio \
    --create-namespace \
    --version 16.0.10 \
    -n minio \
    --values minio-values.yaml

    [root@node3 greptime]# kubectl get pod -n minio
    NAME READY STATUS RESTARTS AGE
    minio-0 1/1 Running 0 44s
    minio-1 1/1 Running 0 44s
    minio-2 1/1 Running 0 44s
    minio-3 1/1 Running 0 44s
    # 暴露端口
    [root@node3 greptime]# kubectl apply -f minio-NodePort.yaml
  3. 登录http://192.168.10.213:30393/login,账号密码:greptimedbadmin/greptimedbadmin

  4. 创建bucket名称为:greptimedb-bucket

  5. 创建Create Access Keys得到key=JffCiPPRg1CfcI4li582,sec=IippU4XmqqIQBBPcROUi2paeABFbNwhfl6UXgOIM

  6. 部署GreptimeDB Operator

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [root@node3 greptime]# helm upgrade --install \
    greptimedb-operator oci://greptime-registry.cn-hangzhou.cr.aliyuncs.com/charts/greptimedb-operator \
    --version 0.2.21 \
    --namespace greptimedb-admin \
    --create-namespace \
    --values greptimedb-operator-values.yaml
    [root@node3 greptime]# kubectl get pod -n greptimedb-admin
    NAME READY STATUS RESTARTS AGE
    greptimedb-operator-d656cb5c-zkdq9 1/1 Running 0 43s
  7. 部署db集群

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    [root@node3 greptime]# kubectl create namespace greptimedb
    namespace/greptimedb created
    # 创建一个secrets配置Secret: greptimedb / image-pull-secret
    [root@node3 greptime]# kubectl create secret docker-registry image-pull-secret \
    -n greptimedb \
    --docker-server=https://greptime-registry.cn-hangzhou.cr.aliyuncs.com \
    --docker-username=fxkjtkjj@1826388469660986 \
    --docker-password=1P7a9v7M22
    #安装greptimedb,注意k8s资源,8c16g跑不起,缺cpu,变配虚拟机到16c24g
    [root@node3 greptime]# helm upgrade --install greptimedb \
    --create-namespace \
    oci://greptime-registry.cn-hangzhou.cr.aliyuncs.com/charts/greptimedb-cluster \
    --version 0.3.18 \
    -n greptimedb \
    --values greptimedb-cluster-values.yaml
    [root@node3 greptime]# kubectl get po -n greptimedb
    NAME READY STATUS RESTARTS AGE
    greptimedb-datanode-0 2/2 Running 0 2m18s
    greptimedb-datanode-1 2/2 Running 0 2m18s
    greptimedb-datanode-2 2/2 Running 0 2m18s
    greptimedb-flownode-0 2/2 Running 0 2m1s
    greptimedb-frontend-5468844cd-br4sc 2/2 Running 0 2m7s
    greptimedb-frontend-5468844cd-nrhf6 2/2 Running 0 2m7s
    greptimedb-frontend-5468844cd-swbxv 2/2 Running 0 2m7s
    greptimedb-meta-5b74964-6m7nt 2/2 Running 3 (2m54s ago) 20m
    greptimedb-meta-5b74964-7sg9p 2/2 Running 2 (2m41s ago) 20m
    greptimedb-meta-5b74964-gxqn8 2/2 Running 4 (2m50s ago) 20m
    greptimedb-monitor-standalone-0 1/1 Running 1 (9m44s ago) 20m
  8. 部署Dashboard

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    [root@node3 greptime]# helm upgrade --install enterprise-dashboard \
    --create-namespace \
    oci://greptime-registry.cn-hangzhou.cr.aliyuncs.com/charts/enterprise-dashboard \
    -n dashboard \
    --version 0.1.1 \
    --values dashboard-values.yaml
    [root@node3 greptime]# kubectl get pod -n dashboard
    NAME READY STATUS RESTARTS AGE
    enterprise-dashboard-7d75cbff97-q7t87 1/1 Running 0 104s
    # 暴露端口
    [root@node3 greptime]# kubectl apply -f dashboard-NodePort.yaml
  9. 访问http://192.168.10.213:31905/

  10. 验证,也可以在dashboard里面查询

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    # 将 Kubernetes 集群中服务的 4002 端口映射到了本地机器的 4002 端口
    [root@node3 greptime]# kubectl port-forward svc/greptimedb-frontend 4002:4002 -n greptimedb > connections.out &
    [2] 38905
    [1] Exit 127 connections.out
    [root@node3 greptime]# cat connections.out
    Forwarding from 127.0.0.1:4002 -> 4002
    Forwarding from [::1]:4002 -> 4002
    # 安装工具
    [root@node3 greptime]# yum install -y mysql
    # 测试连接
    [root@node3 greptime]# mysql -h 127.0.0.1 -P 4002
    mysql>
    CREATE TABLE monitor (
    host STRING,
    ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP() TIME INDEX,
    cpu FLOAT64 DEFAULT 0,
    memory FLOAT64,
    PRIMARY KEY(host)
    );
    Query OK, 0 rows affected (0.29 sec)

    mysql>
    INSERT INTO monitor
    VALUES
    ("127.0.0.1", 1702433141000, 0.5, 0.2),
    ("127.0.0.2", 1702433141000, 0.3, 0.1),
    ("127.0.0.1", 1702433146000, 0.3, 0.2),
    ("127.0.0.2", 1702433146000, 0.2, 0.4),
    ("127.0.0.1", 1702433151000, 0.4, 0.3),
    ("127.0.0.2", 1702433151000, 0.2, 0.4);
    6 rows in set (0.01 sec)

    mysql> show tables;
    +---------+
    | Tables |
    +---------+
    | monitor |
    | numbers |
    +---------+
    2 rows in set (0.05 sec)
    mysql> select * from monitor;
    +-----------+---------------------+------+--------+
    | host | ts | cpu | memory |
    +-----------+---------------------+------+--------+
    | 127.0.0.1 | 2023-12-13 02:05:41 | 0.5 | 0.2 |
    | 127.0.0.1 | 2023-12-13 02:05:46 | 0.3 | 0.2 |
    | 127.0.0.1 | 2023-12-13 02:05:51 | 0.4 | 0.3 |
    | 127.0.0.2 | 2023-12-13 02:05:41 | 0.3 | 0.1 |
    | 127.0.0.2 | 2023-12-13 02:05:46 | 0.2 | 0.4 |
    | 127.0.0.2 | 2023-12-13 02:05:51 | 0.2 | 0.4 |
    +-----------+---------------------+------+--------+
    6 rows in set (0.04 sec)

相关配置文件

etcd-values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
global:
security:
allowInsecureImages: true

image:
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
repository: bitnami/etcd
tag: 3.5.21-debian-12-r5

replicaCount: 3

resources:
requests:
cpu: '2'
memory: 2Gi
limits:
cpu: '4'
memory: 4Gi

persistence:
storageClass: nfscs
size: 8Gi

auth:
rbac:
create: false
token:
enabled: false

autoCompactionMode: "periodic"
autoCompactionRetention: "1h"

extraEnvVars:
- name: ETCD_QUOTA_BACKEND_BYTES
value: "8589934592"
- name: ETCD_ELECTION_TIMEOUT
value: "2000"

minio-values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
global:
security:
allowInsecureImages: true

image:
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
repository: bitnami/minio
tag: 2025.4.22-debian-12-r1

auth:
rootUser: greptimedbadmin
rootPassword: "greptimedbadmin"

resources:
requests:
cpu: "2"
memory: 2Gi
limits:
cpu: "5"
memory: 5Gi

affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/instance: minio

extraEnvVars:
- name: MINIO_REGION
value: "ap-southeast-1"

statefulset:
replicaCount: 4

mode: distributed

persistence:
storageClass: nfssc
size: 200Gi

minio-NodePort.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: v1
kind: Service
metadata:
name: minio
namespace: minio
spec:
type: NodePort
ports:
- port: 9000 # Service 端口
targetPort: 9000 # Pod 端口
nodePort: 30090 # 映射在 Node 上的端口
selector:
app.kubernetes.io/name: minio

greptimedb-operator-values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
image:
# 镜像仓库
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
# 镜像名称
repository: greptime/greptimedb-operator
# 镜像拉取策略
imagePullPolicy: IfNotPresent
# 镜像标签
tag: v0.2.2
# 镜像拉取密钥(如需认证)
pullSecrets: []

# 附加标签
additionalLabels: {}

serviceAccount:
# 是否创建服务账号
create: true
# 服务账号注解
annotations: {}
# 指定服务账号名称(为空则自动生成)
name: ""

crds:
# 安装 CRDs
install: true
# 卸载时保留 CRDs
keep: true
# CRD 注解
annotations: {}
# CRD 附加标签
additionalLabels: {}

# 副本数
replicas: 1

resources:
# 资源限制
limits:
cpu: "2"
memory: 2Gi
# 资源请求
requests:
cpu: 500m
memory: 512Mi

rbac:
# 启用 RBAC
create: true

apiServer:
# 启用 API Server
enabled: true
# API Server 端口
port: 8081
# 启用 metrics-server 的 PodMetrics 获取
podMetrics:
enabled: true

# 命名覆盖
nameOverride: ""
fullnameOverride: ""

# 调度相关配置
nodeSelector: {}
tolerations: []
affinity: {}

greptimedb-cluster-values.yaml

缩减了资源,对主机进行了编配

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
image:
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
repository: fxkjtkjj/greptimedb-enterprise
tag: ent-20250427-1745751842-e45e60bf
pullSecrets:
- image-pull-secret #这个参数每生效
additionalLabels: {}

initializer:
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
repository: greptime/greptimedb-initializer
tag: v0.2.2
image:
pullSecrets:
- image-pull-secret #在后面每个节点单独加,不然拉取镜像会没权限

datanode:
image:
pullSecrets:
- image-pull-secret #在后面每个节点单独加,不然拉取镜像会没权限
replicas: 3
podTemplate:
main:
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.greptime.io/component: greptimedb-datanode
topologyKey: kubernetes.io/hostname
weight: 1
storage:
storageClassName: nfssc
storageSize: 30Gi
storageRetainPolicy: Retain

frontend:
image:
pullSecrets:
- image-pull-secret #在后面每个节点单独加,不然拉取镜像会没权限
enabled: true
replicas: 3
podTemplate:
main:
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.greptime.io/component: greptimedb-frontend
topologyKey: kubernetes.io/hostname
weight: 1

meta:
image:
pullSecrets:
- image-pull-secret #在后面每个节点单独加,不然拉取镜像会没权限
replicas: 3
etcdEndpoints: "etcd.etcd-cluster.svc.cluster.local:2379"
podTemplate:
main:
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.greptime.io/component: greptimedb-meta
topologyKey: kubernetes.io/hostname
weight: 1

objectStorage:
credentials:
accessKeyId: "JffCiPPRg1CfcI4li582"
secretAccessKey: "IippU4XmqqIQBBPcROUi2paeABFbNwhfl6UXgOIM"
s3:
bucket: "greptimedb-bucket"
region: "ap-southeast-1"
root: "greptimedb-data"
endpoint: "http://minio.minio:9000"

monitoring:
image:
pullSecrets:
- image-pull-secret #在后面每个节点单独加,不然拉取镜像会没权限
enabled: true
standalone:
base:
main:
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
datanodeStorage:
fs:
storageClassName: nfssc
storageSize: 100Gi
vector:
registry: greptime-registry.cn-hangzhou.cr.aliyuncs.com
repository: timberio/vector
tag: 0.46.1-debian
pullSecrets:
- image-pull-secret
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: "1Gi"

dashboard-values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
replicaCount: 1

image:
repository: greptime-registry.cn-hangzhou.cr.aliyuncs.com/greptime/dashboard-apiserver
pullPolicy: IfNotPresent
tag: "770ed916-20250409110716"

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

logLevel: debug

operatorAddr: greptimedb-operator.greptimedb-admin.svc.cluster.local:8081
servicePort: 19095

serviceAccount:
create: true
annotations: {}
name: ""

podAnnotations: {}

podSecurityContext: {}
# fsGroup: 2000

securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000

service:
type: ClusterIP
port: 19095
annotations: {}

resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi

nodeSelector: {}

tolerations: []

affinity: {}

dashboard-NodePort.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: v1
kind: Service
metadata:
name: enterprise-dashboard
namespace: dashboard # 请替换为实际命名空间
spec:
type: NodePort
ports:
- port: 19095 # Service 内部访问端口
targetPort: 19095 # Pod 监听端口
nodePort: 31905 # 映射到 Node 上的端口,范围30000-32767
selector:
app.kubernetes.io/name: enterprise-dashboard

环境说明

类型 版本
物理机 ARM
基础系统 Rocky-9.5-aarch64-minimal.iso
kk(kubekey) 3.1.9
k8s 1.26.6

在线安装

整合步骤

k8s在线安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 配置DNS(可选,视网络情况)
nmcli connection modify enp3s0 ipv4.dns "192.168.10.200 8.8.8.8"
nmcli connection up enp3s0
# 允许ssh(安装系统时可以勾选允许root用户ssh),及远程env
echo 'AcceptEnv LANG LC_*' | tee -a /etc/ssh/sshd_config.d/01-permitrootlogin.conf
systemctl restart sshd
# 安装必须要的依赖
dnf install -y conntrack socat tar
# 生成配置文件config-sample.yaml,然后进行修改,文件说明放文尾
./kk create config
# 安装harbor私有证书(可选)
sudo cp ghspace-ca.crt /etc/pki/ca-trust/source/anchors/
sudo update-ca-trust extract
sudo mkdir -p /etc/containerd/certs.d/harbor.ghspace.cn
sudo cp ghspace-ca.crt /etc/containerd/certs.d/harbor.ghspace.cn/ca.crt
# 国内环境
export KKZONE=cn
# 安装集群
./kk create cluster -f config-sample.yaml

nfs在线安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 安装nfs
dnf install -y nfs-utils
# 创建挂载目录
mkdir -p /data/nfs
# lsblk查看硬盘,格式化sda
mkfs.ext4 /dev/sda
# 挂载
mount /dev/sda /data/nfs
# 永久挂载
echo "/dev/sda /data/nfs ext4 defaults 0 0" | sudo tee -a /etc/fstab
# 配置nfs,允许192.168.10.0网段访问
echo "/data/nfs 192.168.10.0/24(rw,sync,no_root_squash)" | sudo tee -a /etc/exports
# 启用nfs
systemctl enable --now nfs-server
# 放行防火墙
firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload

k8s安装 NFS CSI 驱动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm repo update
# 国内大概率下载镜像失败
helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system
# 查看是否安装成功
kubectl --namespace=kube-system get pods --selector="app.kubernetes.io/instance=csi-driver-nfs" --watch
# 查看pod启动详情
kubectl -n kube-system describe pod csi-nfs-node-xxx
# 卸载,重新安装
helm uninstall csi-driver-nfs -n kube-system
# 下载模板
helm template csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system > nfs-driver.yaml
# 查看支持那些参数和变量
helm show values csi-driver-nfs/csi-driver-nfs
# 查看模板里面的镜像
grep 'image:' nfs-driver.yaml
image: "registry.k8s.io/sig-storage/livenessprobe:v2.15.0"
image: "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.13.0"
image: "registry.k8s.io/sig-storage/nfsplugin:v4.11.0"
image: "registry.k8s.io/sig-storage/csi-provisioner:v5.2.0"
image: "registry.k8s.io/sig-storage/csi-resizer:v1.13.1"
image: "registry.k8s.io/sig-storage/csi-snapshotter:v8.2.0"
image: "registry.k8s.io/sig-storage/livenessprobe:v2.15.0"
image: "registry.k8s.io/sig-storage/nfsplugin:v4.11.0"
# 解决镜像拉取不下来
# 先从其他地方拉取到镜像,然后上传到k8s的私有仓库

离线安装

离线制作

k8s artifact制作

  1. 拷贝$HOME/.kube/config到kk的机器上(kk本来就在集群节点上面忽略这个步骤)
  2. 添加域名映射,因为config里面是用的安装时候配置的域名,因此需要执行echo "192.168.10.219 lb.k8s.local" >> /etc/hosts,具体是什么值可以在k8s的节点的hosts文件看到(kk本来就在集群节点上面忽略这个步骤)
  3. 生成 manifest-sample.yaml,执行./kk create manifest --kubeconfig config
  4. 导出kubekey-artifact.tar.gz执行 export KKZONE=cn./kk artifact export -m manifest-sample.yaml

k8s 系统依赖制作

  1. 在k8s节点执行,生成离线包kk-rpms.tar.gz,带repodata/元数据和rpm包。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # 安装 createrepo 工具(生成本地 yum 仓库时需要),dnf-plugins-core 下载包用,可能系统已经自带
    dnf install -y createrepo dnf-plugins-core
    # 创建目录保存依赖包
    mkdir -p ~/kk-rpms
    cd ~/kk-rpms
    # 下载 conntrack socat tar 及其依赖
    dnf download --resolve --alldeps conntrack socat tar
    # 不建议,容易丢失依赖 如果没有依赖,或者不想下载依赖,注意这里时用reinstall,不是install,因为机器上已经安装了这些包
    # dnf reinstall --downloadonly --downloaddir=. conntrack socat tar
    ll # 检查下载的 rpm 文件
    # 生成本地 yum 仓库元数据(方便离线使用)
    createrepo .
    ll # 检查生成的repodata目录
    # 打包 rpm 包和 repo 元数据
    cd ..
    tar -czf kk-rpms.tar.gz kk-rpms

自定义整合资源包

  1. 最后将文件整合到如下目录结构

    1
    2
    3
    4
    5
    6
    kk-offline-install
    ├── config-sample.yaml # 集群安装配置文件,按需修改集群ip及主从结构
    ├── kk # kk命令可执行文件
    ├── kk-rpms.tar.gz # rpm离线包
    ├── kubekey-artifact.tar.gz # k8s离线制品文件
    └── manifest-sample.yaml # 离线制品配置清单(用不到留着,用于查看安装包的环境)
  2. 压缩得到最终的安装离线包,执行tar -czf kk-offline-install.tar.gz kk-offline-install

离线安装

  1. 找个跳板机,或者规划安装k8s的其中一台机器,上传kk-offline-install.tar.gz,然后执行解压tar -xzf kk-rpms.tar.gz,注意因为rocky 9.5迷你版本没有tar命令,可以再外部进行解压,再上传。

  2. 在所有的k8s节点配置ssh的远程env

    1
    2
    3
    # 允许ssh(安装系统时可以勾选允许root用户ssh),及远程env
    echo 'AcceptEnv LANG LC_*' | tee -a /etc/ssh/sshd_config.d/01-permitrootlogin.conf
    systemctl restart sshd
  3. 安装系统依赖,解压tar -xzf kk-rpms.tar.gz上传要安装的k8s节点上,放在这个目录~root)目录,然后执行

    1
    2
    3
    4
    # 在要安装的k8s节点上都要执行
    dnf config-manager --add-repo file:///root/kk-rpms
    echo "gpgcheck=0" >> /etc/yum.repos.d/root_kk-rpms.repo
    dnf install -y conntrack socat tar --disablerepo="*" --enablerepo="root_kk-rpms"
  4. 修改仓库配置,如果有外部仓库(支持docker registry和harbor),按实际的填,如果没有修改vi config-sample.yaml如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    spec:
    roleGroups:
    registry: #指定私有仓库节点
    - node3
    registry: #离线部署时,仓库必须要配置本地仓库或者外部的镜像仓库,用于存放和拉取镜像
    privateRegistry: "dockerhub.k8s.local" #不要加http
    auths:
    "dockerhub.k8s.local":
    username: admin
    password: Harbor12345
    skipTLSVerify: true # 如果是自签证书,开启跳过验证,或者自己拷贝私有证书,安装到本地机器
    namespaceOverride: kubesphereio #这里必须要覆写,不然会拉取不到镜像
    registryMirrors: []
    insecureRegistries: [] #不要想着用http,kk的默认部署只会暴露https端口
  5. 初始化私有镜像仓库,执行./kk init registry -f config-sample.yaml -a kubekey-artifact.tar.gz

  6. (可选步骤)kk不在k8s其中的一个节点才执行这一步,添加域名映射echo "192.168.10.213 dockerhub.k8s.local" >> /etc/hosts,具体是什么内容可以在其他k8s节点的hosts文件查看。

  7. 推送镜像到私有仓库,执行./kk artifact image push -f config-sample.yaml -a kubekey-artifact.tar.gz

  8. 执行./kk create cluster -f config-sample.yaml -a kubekey-artifact.tar.gz 进行离线集群安装

离线安装优化(自动化脚本)

配置文件附录

config-sample.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
- {name: node1, address: 192.168.10.219, internalAddress: 192.168.10.219, user: root, password: "gh@2025", arch: arm64}
- {name: node2, address: 192.168.10.226, internalAddress: 192.168.10.226, user: root, password: "gh@2025", arch: arm64}
- {name: node3, address: 192.168.10.213, internalAddress: 192.168.10.213, user: root, password: "gh@2025", arch: arm64}
roleGroups:
etcd:
- node1
- node2
- node3
control-plane:
- node1
- node2
- node3
worker:
- node1
- node2
- node3
registry:
- node3
controlPlaneEndpoint:
## Internal loadbalancer for apiservers
# internalLoadbalancer: haproxy
domain: lb.k8s.local
address: "192.168.10.219"
port: 6443
kubernetes:
version: v1.26.6
clusterName: cluster.local
autoRenewCerts: true
containerManager: containerd
etcd:
type: kubekey
network:
plugin: calico
kubePodsCIDR: 10.233.64.0/18
kubeServiceCIDR: 10.233.0.0/18
## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
multusCNI:
enabled: false
registry: #离线部署时,仓库必须要配置本地仓库或者外部的镜像仓库,用于存放和拉取镜像
privateRegistry: "dockerhub.k8s.local"
auths:
"dockerhub.k8s.local":
username: admin
password: Harbor12345
skipTLSVerify: true # 如果是自签证书,开启跳过验证
namespaceOverride: kubesphereio #这里必须要覆写,不然会拉取不到镜像
registryMirrors: []
insecureRegistries: []
addons: []

踩坑过程

  1. 无法访问网络,ping www.baidu.com提示找不到。

    原因:DNS的问题

    解决:执行

    1
    2
    nmcli connection modify enp3s0 ipv4.dns "8.8.8.8 183.221.253.100"
    nmcli connection up enp3s0
  2. KK执安装提示该错误:failed to get SSH session: ssh: setenv failed

    原因:Rocky有更细的权限控制

    解决:开启允许setenv,执行

    1
    2
    echo 'AcceptEnv LANG LC_*' | tee -a /etc/ssh/sshd_config.d/01-permitrootlogin.conf
    systemctl restart sshd
  3. kk执安装提示如下错误:

    1
    2
    04:18:13 EDT [ERRO] node1: conntrack is required.
    04:18:13 EDT [ERRO] node1: socat is required.

    原因:Rocky缺少conntracksocat

    解决:安装依赖,执行dnf install -y conntrack socat

  4. kk执安装提示该错误:/bin/bash: line 1: tar: command not found: Process exited with status 127

    原因:Rocky缺少tar

    解决:安装依赖,执行dnf install -y tar

  5. kk执行离线安装时,提示如下错误

    1
    FATA[0000] pulling image: failed to pull and unpack image "dockerhub.k8s.local/kubesphere/pause:3.9": failed to resolve reference "dockerhub.k8s.local/kubesphere/pause:3.9": failed to do request: Head "https://dockerhub.k8s.local/v2/kubesphere/pause/manifests/3.9": tls: failed to verify certificate: x509: certificate signed by unknown authority: Process exited with status 1

    原因:自建的仓库用https不安全,没证书导致

    解决:在 config-sample.yaml中添加

    1
    2
    3
    spec:
    registry:
    insecureRegistries: ["dockerhub.k8s.local"]
  6. 执行推送镜像命令./kk artifact image push -f config-sample.yaml -a kubekey-artifact.tar.gz错误如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    Getting image source signatures
    trying to reuse blob sha256:ad5042aba4ea93ceb67882c49eb3fb8b806ffa201c5c6f0f90071702f09a9192 at destination: pinging container registry dockerhub.k8s.local: Get "https://dockerhub.k8s.local/v2/": x509: certificate signed by unknown authority
    20:51:14 EDT success: [LocalHost]
    20:51:14 EDT [CopyImagesToRegistryModule] Push multi-arch manifest to private registry
    20:51:14 EDT message: [LocalHost]
    get manifest list failed by module cache
    20:51:14 EDT failed: [LocalHost]
    error: Pipeline[ArtifactImagesPushPipeline] execute failed: Module[CopyImagesToRegistryModule] exec failed:
    failed: [LocalHost] [PushManifest] exec failed after 1 retries: get manifest list failed by module cache
    # 测试有没有开放80,没有开放80
    [root@192-168-10-87 kk-offline-install]# curl http://dockerhub.k8s.local/v2/_catalog
    curl: (7) Failed to connect to dockerhub.k8s.local port 80: Connection refused
    # 测试https
    [root@192-168-10-87 kk-offline-install]# curl https://dockerhub.k8s.local/v2/_catalog
    curl: (60) SSL certificate problem: unable to get local issuer certificate
    More details here: https://curl.se/docs/sslcerts.html

    curl failed to verify the legitimacy of the server and therefore could not
    establish a secure connection to it. To learn more about this situation and
    how to fix it, please visit the web page mentioned above.

    原因:执行kk的机器没有仓库的私有证书

    解决:

    方案一:将仓库registry节点的/etc/docker/certs.d/dockerhub.k8s.local/ca.crt拷贝到执行kk的机器的/etc/pki/ca-trust/source/anchors/这个目录,然后执行:

    1
    2
    3
    4
    [root@192-168-10-87 anchors]# update-ca-trust extract
    # 验证,没有提示证书问题了
    [root@192-168-10-87 kk-offline-install]# curl https://dockerhub.k8s.local/v2/_catalog
    {"repositories":[]}

    方案二:开启跳过证书校验

    1
    2
    3
    4
    5
    6
    7
    8
    spec:
    registry:
    privateRegistry: "dockerhub.k8s.local"
    auths:
    "dockerhub.k8s.local":
    username: admin
    password: Harbor12345
    skipTLSVerify: true # 如果是自签证书,开启跳过验证
  7. 配置registry启用http仓库时,根据官方文档示例用下面的配置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    registry:
    privateRegistry: "http://dockerhub.k8s.local"
    auths:
    "dockerhub.k8s.local":
    username: admin
    password: Harbor12345
    plainHTTP: true #启用http
    namespaceOverride: ""
    registryMirrors: []
    insecureRegistries: ["dockerhub.k8s.local"]

    原因:发现并没有用http端口部署

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    [root@node3 ~]# ss -tlnp | grep -E '80|443'
    LISTEN 0 32768 *:443 *:* users:(("registry",pid=5664,fd=3))
    [root@node3 ~]# ^C
    [root@node3 ~]# ps aux | grep registry
    root 5664 0.0 0.1 121796 25028 ? Ssl 21:29 0:00 /usr/local/bin/registry serve /etc/kubekey/registry/config.yaml
    root 6251 0.0 0.0 6116 1920 pts/0 S+ 21:34 0:00 grep --color=auto registry
    [root@node3 ~]# cat /etc/kubekey/registry/config.yaml
    version: 0.1
    log:
    fields:
    service: registry
    storage:
    cache:
    layerinfo: inmemory
    filesystem:
    rootdirectory: /mnt/registry
    http:
    addr: :443
    tls:
    certificate: /etc/ssl/registry/ssl/http:.pem
    key: /etc/ssl/registry/ssl/http:-key.pem

    解决:改回https部署

  8. 运行kk离线安装命令时,提示如下错误

    1
    FATA[0000] pulling image: rpc error: code = NotFound desc = failed to pull and unpack image "dockerhub.k8s.local/kubesphere/pause:3.9": failed to resolve reference "dockerhub.k8s.local/kubesphere/pause:3.9": dockerhub.k8s.local/kubesphere/pause:3.9: not found: Process exited with status 1

    原因:dockerhub.k8s.local/kubesphere/pause:3.9找不到,因为离线镜像时,里面存的实际镜像名是dockerhub.k8s.local/kubesphereio/pause:3.9,多了个io

    解决:修改config-sample.yaml,然后重新执行kk离线安装命令

    1
    2
    3
    spec:
    registry:
    namespaceOverride: kubesphereio #添加名称覆写
  9. 修改manifest-sample.yaml文件,在spec:下添加如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    spec:
    preInstall:
    - name: install-local-rpms
    commands:
    - echo "===> 解压本地 RPM 离线包..."
    - mkdir -p /opt/kk-rpms
    - if [ ! -d /opt/kk-rpms/repodata ]; then
    echo "===> 第一次解压 kk-rpms.tar.gz";
    tar -xzf ./kk-rpms.tar.gz -C /opt/kk-rpms;
    else
    echo "===> RPM 仓库已存在,跳过解压";
    fi

    - echo "===> 配置本地 yum 仓库..."
    - if ! dnf repolist | grep -q "kk-local"; then
    echo -e "[kk-local]\nname=KK Offline Repo\nbaseurl=file:///opt/kk-rpms\nenabled=1\ngpgcheck=0" > /etc/yum.repos.d/kk-local.repo;
    else
    echo "===> KK 本地仓库已配置,跳过";
    fi

    - echo "===> 安装 conntrack socat tar(若未安装)..."
    - dnf install -y conntrack socat tar || echo "===> 忽略已安装组件"

    原因:经测试preInstall没用,根本不会执行

    解决:可能需要自己写脚本进行手动操作安装,用sehll会手动输入密码,及同意host key,因此可以采用go写脚本,直接编译可以执行文件,达到一键部署。

  10. 运行kk安装时报错

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #kk安装错误信息
    failed: [node3] [RestartETCD] exec failed after 3 retries: start etcd failed: Failed to exec command: sudo -E /bin/bash -c "systemctl daemon-reload && systemctl restart etcd && systemctl enable etcd"
    Job for etcd.service failed because a timeout was exceeded.
    See "systemctl status etcd.service" and "journalctl -xeu etcd.service" for details.: Process exited with status 1
    #查看etcd详细信息的错误
    [root@192-168-10-30 ~]# journalctl -xeu etcd | tail -50
    Jul 04 10:34:38 node1 etcd[41618]: {"level":"warn","ts":"2025-07-04T10:34:38.880411+0800","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"192.168.10.31:41466","server-name":"","error":"remote error: tls: bad certificate"}
    Jul 04 10:34:38 node1 etcd[41618]: {"level":"warn","ts":"2025-07-04T10:34:38.886267+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"6808450bb7874830","rtt":"0s","error":"tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, ::1, 192.168.10.219, 192.168.10.226, 192.168.10.213, not 192.168.10.31"}
    Jul 04 10:34:38 node1 etcd[41618]: {"level":"warn","ts":"2025-07-04T10:34:38.915113+0800","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"192.168.10.32:51858","server-name":"","error":"remote error: tls: bad certificate"}

    原因:kk曾经安装过其他集群,加--debug,可以看到,他是从本地复制的,本地还是之前集群的安装证书信息

    1
    2
    3
    4
    5
    6
    [root@192-168-10-87 ~]# ./kk create cluster -f config-sample.yaml --debug
    22:42:41 EDT scp local file /root/kubekey/pki/etcd/node-node1-key.pem to remote /tmp/kubekey/etc/ssl/etcd/ssl/node-node1-key.pem success
    [root@192-168-10-87 ~]# ll /root/kubekey/pki/etcd/
    total 80
    -rw-------. 1 root root 1679 Jul 3 22:42 admin-node1-key.pem
    -rw-r--r--. 1 root root 1375 Jul 3 22:42 admin-node1.pem

    解决:清除和kk脚本的同级目录kubekey

我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。

安装

  1. PyCharm IDEA中拉取代码git@github.com:FunAudioLLM/CosyVoice.git

  2. PyCharm IDEA中的终端里面拉取子项目,在终端里面执行git submodule update --init --recursive

  3. 下载安装Miniconda,选择下载Miniconda Installers,默认安装即可。

  4. PyCharm IDEA配置Conda,在Settings->Project:CosyVoice->Python Interpreter中点击Add Interpreter->Add Local Interpreter界面里面填入

    1
    2
    3
    4
    Type: Conda
    Python: 3.12
    Name: CosyVoice
    Path to conda: C:\ProgramData\Miniconda3\condabin\conda.bat
  5. PyCharm IDEA中点击左下角的Terminal,然后再窗口标签的+旁,点击·=下拉箭头,选择Command Prompt终端,注意括号里面要是CosyVoice,不然会将包安装到系统环境。例如:(CosyVoice) D:\workspace\CosyVoice>

  6. 安装库,在Command Prompt终端中执行pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

  7. 安装FFmpeg,在Command Prompt终端中执行conda install ffmpeg -c conda-forge

    1
    2
    #额外非必要步骤,只是这个工具可以从视频提取音频,可以拿到后面的素材模板
    ffmpeg -i input.mkv -q:a 0 -map a output.mp3
  8. 下载大模型,在PyCharm IDEA中新建download_model.py,然后在文件上右键运行RUN 'download_model'

    1
    2
    3
    4
    5
    6
    7
    8
    from modelscope import snapshot_download

    model_dir = snapshot_download(
    'iic/CosyVoice2-0.5B',
    local_dir='pretrained_models/CosyVoice2-0.5B'
    )

    print(f'Model downloaded to: {model_dir}')
  9. PyCharm IDEA打开项目根目录中的webui.py文件,在代码中找到if __name__ == '__main__':这一行,然后点击前面的运行符号。

  10. 运行成功后,访问http://localhost:8000/,然后设置如下参数,点击`生成音频`

    1
    2
    3
    4
    输入合成文本:要转换音频的文字
    选择推理模式:3s极速复刻
    选择prompt音频文件:上传要模仿的样本音频
    输入prompt文本:样本音频对应的文字

部署

1
2
3
4
5
6
7
8
9
10
11
12
13
# 创建python环境
conda create -n CosyVoice630 python=3.10 -y
# 激活
conda activate CosyVoice630
# 下载依赖
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# 克隆
conda create -n CosyVoice630_vllm --clone CosyVoice630
# 激活新的
conda activate CosyVoice630_vllm
# 安装vllm
pip install vllm==v0.9.0 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
python vllm_example.py

环境准备

  1. 禁用 nouveau 驱动(NVIDIA 官方推荐)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    # 查看 nouveau 是否加载
    lsmod | grep nouveau
    lsmod | grep nvidia #检查是否有老版本
    # 有卸载
    sudo /usr/bin/nvidia-uninstall
    sudo rm -rf /usr/local/cuda*
    sudo rm -rf /lib/modules/$(uname -r)/kernel/drivers/video/nvidia*
    sudo rm -rf /etc/modprobe.d/nvidia*
    # 黑名单禁用 nouveau 驱动(两条配置)
    sudo bash -c "echo 'blacklist nouveau' > /etc/modprobe.d/blacklist-nouveau.conf"
    sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
    # 查看确认配置
    cat /etc/modprobe.d/blacklist-nouveau.conf
    # 重建 initramfs
    sudo dracut --force
    # 再次检查 nouveau 是否加载
    lsmod | grep nouveau
    # 重启系统生效
    sudo reboot
  2. 安装 CUDA 12.1

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # 下载 CUDA 12.1 安装包
    wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
    # 运行安装包(根据提示完成安装)
    sudo sh cuda_12.1.1_530.30.02_linux.run
    # 验证 CUDA 安装
    nvidia-smi
    # 添加 CUDA bin 到 PATH
    echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
    # 添加 CUDA lib64 到 LD_LIBRARY_PATH
    echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
    # 使配置立即生效
    source ~/.bashrc
    # 验证 nvcc
    nvcc --version
  3. 安装 Miniconda(Python 发行版)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 下载 Miniconda 安装脚本
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    # 执行安装脚本(按提示操作)
    bash Miniconda3-latest-Linux-x86_64.sh
    # 初始化 conda 环境变量
    source /root/miniconda3/etc/profile.d/conda.sh
    # 配置envs目录,防止根目录爆满
    vi /etc/bashrc
    # 末尾添加
    export CONDA_ENVS_PATH=/data/envs
    # 立即生效
    source /etc/bashrc
  4. 安装环境依赖

    1
    2
    dnf update   #apt-get update
    dnf install sox sox-devel -y #apt-get install sox libsox-dev -y

常见问题

  1. 安装依赖时WeTextProcessing安装失败提示pynini错误时

    解决:执行conda install -c conda-forge pynini=2.1.5,版本号可以再错误日志里面查看

    原因:用 conda 安装的这种包一般是编译好的二进制包,不需要自己编译,避免了你之前 pip 安装时遇到的编译错误

    详细错误信息:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    (ai-tts) D:\workspace\ai-tts>pip install WeTextProcessing==1.0.3
    Collecting WeTextProcessing==1.0.3
    Downloading WeTextProcessing-1.0.3-py3-none-any.whl.metadata (7.2 kB)
    Collecting pynini==2.1.5 (from WeTextProcessing==1.0.3)
    Downloading pynini-2.1.5.tar.gz (627 kB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 627.6/627.6 kB 106.3 kB/s eta 0:00:00
    Preparing metadata (setup.py) ... done
    Collecting importlib-resources (from WeTextProcessing==1.0.3)
    Downloading importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
    Requirement already satisfied: Cython>=0.29 in c:\users\lenovo\.conda\envs\ai-tts\lib\site-packages (from pynini==2.1.5->WeTextProcessing==1.0.3) (3.1.2)
    Downloading WeTextProcessing-1.0.3-py3-none-any.whl (2.0 MB)
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 58.2 kB/s eta 0:00:00
    Downloading importlib_resources-6.5.2-py3-none-any.whl (37 kB)
    Building wheels for collected packages: pynini
    DEPRECATION: Building 'pynini' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour
    change. A possible replacement is to use the standardized build interface by setting the --use-pep517 option, (possibly combined with --no-build-isolation), or adding a pyproject.toml file to the source tree of 'pynini'. Discussion can be found at https://github.com/pypa/pip/issues/6334
    Building wheel for pynini (setup.py) ... error
    error: subprocess-exited-with-error

    × python setup.py bdist_wheel did not run successfully.
    exit code: 1
    ╰─> [59 lines of output]
    C:\Users\Lenovo\.conda\envs\ai-tts\lib\site-packages\setuptools\dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated.
    !!

    ********************************************************************************
    Please consider removing the following classifiers in favor of a SPDX license expression:

    License :: OSI Approved :: Apache Software License

    See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
    ********************************************************************************

    !!
    self._finalize_license_expression()
    running bdist_wheel
    running build
    running build_py
    creating build\lib.win-amd64-cpython-310\pynini
    copying pynini\__init__.py -> build\lib.win-amd64-cpython-310\pynini
    creating build\lib.win-amd64-cpython-310\pywrapfst
    copying pywrapfst\__init__.py -> build\lib.win-amd64-cpython-310\pywrapfst
  2. 因网络原因安装conda install -c conda-forge pynini=2.1.5失败

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    # ---有外网的环境-----------
    # 模拟相同的python环境
    conda create -n ai-tts python=3.10 -y
    conda activate ai-tts
    # 执行无外网环境安装不了的包(只区分liunx x86还是arm,不区分liunx具体系统及版本)
    conda install -c conda-forge pynini=2.1.5
    # 创建离线包的目录
    mkdir /mnt/d/ai/offline-env
    # 导出离线包的下载地址
    conda list --explicit > /mnt/d/ai/offline-env/ai-tts.txt
    cd offline-env/
    # 下载
    wget -i ai-tts.txt
    # 压缩离线包
    tar czvf offline-env.tar.gz offline-env/
    # --------------无外网环境------------------
    # 解压离线包
    tar -xzf offline-env.tar.gz
    cd offline-env/
    # 进入离线目录,离线安装
    conda install --offline --use-local *.tar.bz2

参考:

https://doupoa.site/archives/581

环境

Kunpeng 920服务器,cpu架构是aarch64(arm)。

安装

  1. zstack下载地址找到对应系统的镜像,我这里是鲲鹏920的,cpu架构是aarch64的,因此选择4.8.24版本的ky10sp3 aarch64 iso

  2. 然后通过Rufus(windows)或balenaetcher(Mac)烧录上面下载的镜像到u盘。

  3. 服务器启动时,按[.]键进入biso,开启CPU VT和超线程HT选项(该cpu没找到该配置),设置电源策略为性能模式(非节能模式)

  4. 插入u盘到服务器上,重启按F2进入系统盘选择,需要输入密码,密码在服务器标签上,选择U盘启动。

  5. 选择zstack安装

  6. 在安装界面,配置自动定义分区,分区信息如下,可以删除自动分区的信息,新建下面分区

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 只有一个磁盘的情况:
    /boot/efi 1024MiB
    /boot 1024MiB
    /. 65.5TiB #剩余所有空间
    # 安装好之后磁盘的输出信息如下
    [root@exxk ~]# lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sda 8:0 0 65.5T 0 disk
    ├─sda1 8:1 0 1G 0 part /boot/efi
    ├─sda2 8:2 0 1G 0 part /boot
    └─sda3 8:3 0 65.5T 0 part
    └─zstack-root 252:0 0 65.5T 0 lvm /
  7. 在安装界面,关闭网卡,设置hostname。(这里其实可以界面添加bond0,但是不添加,后面在命令行添加)

  8. 在安装界面,选择计算节点安装模式。(不选择安装管理节点,后面通过命令进行升级管理节点)

  9. 在安装界面,设置root密码,然后点击安装即可。

  10. 安装完等待重启成功,拔掉u盘

  11. 输入密码,登录进系统,配置bond,具体命令如下:

    1
    2
    3
    4
    5
    6
    7
    #创建bond0
    zs-bond-ab -c bond0
    # zs-nic-to-bond -a bond0 [网卡名] #将插入网线的网卡添加至bond0
    zs-nic-to-bond -a bond0 enp125s0f0
    zs-vlan-c bond0 10
    #zs-network-setting -b bond0.10 [ip] [掩码] [网关] #创建网桥br_bond0指定网络IP、掩码和网关
    zs-network-setting -b bond0.10 192.168.10.254 55.255.255.0 192.168.10.1
  12. 检查网络是否正常,是否能ping同网关等,局域网是否能通过ssh连接到服务器,如果不能,检查交换机配置是否正常,交换机需设置trunk类型,出入目标端口是否设置正确等。相关命令参考:port trunk permit vlan 10

  13. 然后执行bash /opt/zstack-installer.bin -E安装升级为管理节点。

  14. 因为cpu是amd的,没有免费许可证,免费许可证只支持一个x86的物理计算机,因此需要先申请证书,然后在右上角我的头像->许可证管理->下载请求码根据请求码去申请证书,然后在该页面点击上次许可证

  15. 然后根据引导,依次创建区域、集群、添加物理机、添加镜像服务器、添加主存储、创建计算规格、添加镜像、创建二层网络、创建三层网络。

nginx代理配置https域名访问

  1. 运营管理->访问控制->控制台代理界面点击设置控制台代理地址可以修改为域名zstack.exxkspace.cn和zstack访问平台设置一致,端口可以改可以不该,但是端口一定要和nginx配置的端口一致。

  2. nginx部署配置文件,采用docker-compose部署,配置如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    services:
    nginx:
    container_name: nginx
    image: nginx:latest
    restart: always
    network_mode: bridge
    ports:
    - '80:80'
    - '443:443'
    - '4900:4900' #这句很关键,要和控制台配置结合用
    volumes:
    - /etc/localtime:/etc/localtime
    - ./nginx:/etc/nginx
  3. zstack的nginx域名关键配置部分

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name zstack.exxkspace.cn; #zstack的代理地址

    ssl_certificate certs/exxkspace.cn.crt;
    ssl_certificate_key certs/exxkspace.cn.key.nopass;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 3600s;
    ssl_session_tickets off;

    location / {
    include conf.d/proxy.default;
    proxy_pass http://192.168.10.254:5000;
    }

    # web控制台用443
    location /websockify {
    proxy_pass http://192.168.10.254:443; #这里的端口也要和控制台的端口配合使用
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $http_connection;
    proxy_set_header Host $host;
    proxy_read_timeout 86400;
    }

    error_page 500 502 503 504 /50x.html;
    location = /50x.html {
    root /usr/share/nginx/html;
    }
    }

    # web控制台用4900
    server {
    listen 4900 ssl;
    server_name zstack.exxkspace.cn; #控制台设置的域名地址

    ssl_certificate certs/exxkspace.cn.crt;
    ssl_certificate_key certs/exxkspace.cn.key.nopass;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 3600s;
    ssl_session_tickets off;

    location / {
    proxy_pass http://192.168.10.254:4900; #这里的端口也要和控制台的端口配合使用
    include conf.d/proxy.default;
    }
    }

参考:

ZStack Cloud用户手册

PromQL 基础用法

  • 指标名(metric name): Prometheus 采集的数据都有一个指标名,比如:node_cpu_seconds_total
  • 标签过滤(label filter):用 {key="value"} 过滤标签,例如:node_cpu_seconds_total{mode=”idle”, instance=”host1”}
  • 时间范围选择器:时间序列可以按区间查询,例如:rate(node_cpu_seconds_total[5m]),这里 rate() 是对 5 分钟内的增长速率求平均。
  • 其他就是一些常见函数:max, min, count, stddev, stdvar,irate(),avg by(...)….

zstack prometheus配置分析

zstack的监控告警设计图:

Collected Exporter安装地址: https://github.com/prometheus/collectd_exporter

查看普罗的配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
[root@exxk ~]# cat /usr/local/zstack/prometheus/conf.yaml 
global:
scrape_interval: 20s #每隔 20秒 从所有配置的目标(targets)抓取一次指标数据,实际就是20秒掉一次/metrics接口
scrape_timeout: 5s #抓取每个目标时的超时时间,最多等 5 秒。
evaluation_interval: 10s # 每 10秒 重新计算一次告警规则(alert rules)和记录规则(recording rules)
rule_files: #里面可以是记录规则也可以是告警规则
- /usr/local/zstack/prometheus/rules/zwatch.rule.yml #规则文件,主要用于表达式聚合
- /usr/local/zstack/prometheus/rules/hosts/collectd.rule.yml #规则文件,主要用于表达式聚合
scrape_configs:
- job_name: management-server-exporter
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- /usr/local/zstack/prometheus/discovery/management-node/*.json
refresh_interval: 10s
- job_name: baremetal-pxeserver-exporter
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- /usr/local/zstack/prometheus/discovery/pxeserver/*.json
refresh_interval: 10s
- job_name: backup-storage-exporter
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- /usr/local/zstack/prometheus/discovery/backupStorage/*.json
refresh_interval: 10s
- job_name: vrouter-exporter
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- /usr/local/zstack/prometheus/discovery/vrouter/*.json
refresh_interval: 10s
- job_name: custom-metrics-pushgateway
scrape_interval: 10s
scrape_timeout: 5s
honor_labels: true
static_configs:
- targets:
- 192.168.10.2:9091
- job_name: collectd
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- /usr/local/zstack/prometheus/discovery/hosts/*.json #配置动态采集设备的ip和端口信息
refresh_interval: 10s #10秒刷新一次配置

查看普罗的采集指标

1
2
3
4
5
6
7
[root@exxk ~]# cat /usr/local/zstack/prometheus/discovery/hosts/8623bf76e14c4509abd4202ee717e9ad-192-168-10-254.json 
[{"targets":["192.168.10.2:9103","192.168.10.2:9100","192.168.10.2:7069"],"labels":{"hostUuid":"8623bf76e14c4509abd4202ee717e9ad"}},{"targets":["192.168.10.2:9092"],"labels":{}}]
#具体指标内容可以通过ip加端口加/metrics
http://192.168.10.2:9100/metrics #node_exporter 采集的
http://192.168.10.2:9103/metrics #collectd 采集的
http://192.168.10.2:7069/metrics #应该是zstack自己写的一个python采集的
http://192.168.10.2:9092/metrics #pushgateway 的

查看记录规则文件

定义一个 Recording Rule(记录规则),它的目的是将一个复杂的表达式的结果保存成一个新的时间序列(即指标名),方便后续查询和告警使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@exxk ~]# cat /usr/local/zstack/prometheus/rules/zwatch.rule.yml
groups:
- name: zwatch.rule ## 告警或记录规则组的名称,里面部分指标会依赖基础指标collectd.rule.yml
rules:
- record: ZStack:BaremetalVM::OperatingSystemNetworkOutPackets #定义一个新的时间序列名称(指标)
expr: irate(bm_node_network_transmit_packets{vmUuid!=""}[10m]) #表达式
- record: ZStack:BaremetalVM::DiskFreeCapacityInPercent
expr: ((bm_node_filesystem_avail{vmUuid!=""} + 1) / (bm_node_filesystem_size{vmUuid!=""} + 1)) * 100
.....
[root@exxk ~]# cat /usr/local/zstack/prometheus/rules/hosts/collectd.rule.yml
groups:
- name: collectd ## 记录一些基础指标的转换
rules:
- record: collectd:collectd_virt_virt_cpu_total
expr: irate(collectd_virt_virt_cpu_total[10m]) / 1e+07
- record: collectd:collectd_virt_virt_vcpu
expr: irate(collectd_virt_virt_vcpu[10m]) / 1e+07

zStack报警器

页面位置:平台运维->云平台监控->报警器,查看列表的接口

http://192.168.10.2:5000/graphql?gql=zwatchAlarmList

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"data": {
"zwatchAlarmList": {
"list": [
{
"uuid": "5z6gsgkc5kccpylj9ocgbd647p2700b7",
"name": "Average CPU Utilization of Hosts",
"zhName": "物理机平均CPU使用率",
"description": null,
"period": 300,
"namespace": "ZStack/Host",
"metricName": "CPUAverageUsedUtilization", #这里可以得到指标的名称,对应zwatch.rule.yml
"threshold": 80, #阈值
"repeatCount": -1,
"repeatInterval": 1800,
"enableRecovery": false,
"emergencyLevel": "Important",
"comparisonOperator": "GreaterThanOrEqualTo", #条件,<>=
"eventName": null,
"thirdpartyPlatformName": "-",
"state": "Enabled",
"status": "OK",
"topicNum": 1,
"createDate": "Apr 2, 2025 12:49:41 PM",
"lastOpDate": "Apr 2, 2025 12:49:41 PM",
"actions": [
{
"alarmUuid": "5z6gsgkc5kccpylj9ocgbd647p2700b7",
"subscriptionUuid": null,
"actionUuid": "e7d6f5e23bb74e99a2777126078b551c",
"actionType": "sns",
"__typename": "AlarmActions"
}
],
"labels": [],
"userTag": null,
"owner": {
"uuid": "36c27e8ff05c4780bf6d2fa65700f22e",
"name": "admin",
"__typename": "BasicOwner"
},
"platform": null,
"__typename": "ZWatchAlarmVO"
},
......
],
"total": 4,
"__typename": "ZWatchAlarmVoResp"
}
}
}

得到物理机报警器的具体表达式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1. 物理机平均CPU使用率:  
# ≥ 80% , 并持续 5分钟  严重
record:ZStack:Host::CPUAverageUsedUtilization
复杂expr:avg by(hostUuid) ((sum by(hostUuid) (100 - collectd_cpu_percent{hostUuid!="",type="idle"}) / sum by(hostUuid) (collectd_cpu_percent{hostUuid!=""})) * 100)

#collectd_cpu_percent是由collectd提供的指标,如果不想安装collectd,需要将collectd转换成node_exporter里面的指标
复杂expr: avg by(hostUuid) ((sum by(hostUuid, cpu) (node_cpu_seconds_total{mode!="idle", hostUuid!=""})/sum by(hostUuid, cpu) (node_cpu_seconds_total{hostUuid!=""})) * 100)

2. 物理机内存已用百分比:
# ≥ 80% , 并持续 5分钟  严重
record:ZStack:Host::MemoryUsedInPercent
复杂expr:100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

3. 物理机内存已用百分比 kvm
# ≥ 80% , 并持续 5分钟  严重
record:ZStack:Host::MemoryUsedCapacityPerHostInPercent

4. 物理机根盘使用率报警器
# ≥ 80% , 并持续 10分钟 紧急
record:ZStack:Host::DiskRootUsedCapacityInPercent
复杂expr:(sum by(hostUuid) (node_filesystem_size{fstype!="rootfs",hostUuid!="",mountpoint="/"} - node_filesystem_avail{fstype!="rootfs",hostUuid!="",mountpoint="/"}) / sum by(hostUuid) (node_filesystem_size{fstype!="rootfs",hostUuid!="",mountpoint="/"})) * 100

后续可以直接通过record进行promql查询。

根据zstack的配置,模拟搭建一个prometheus

docker-compose.yml

1
2
3
4
5
6
7
8
9
10
11
services:
prometheus:
image: prom/prometheus:v2.37.1
container_name: prometheus
ports:
- 9090:9090
command:
- --config.file=/etc/prometheus/config/prometheus.yml
volumes:
- ./config:/etc/prometheus/config
- /etc/localtime:/etc/localtime:ro

配置目录结构如下

1
2
3
4
5
6
7
8
9
10
.
├── config
│   ├── collectd.rule.yml #基础规则文件
│   ├── hosts
│   │   ├── PM_exxk_192.168.10.2.json
│   │   ├── PM_xxx_xx.xx.xx.xx.json #其他节点配置文件,文件命名规则,PM_主机名_ip.json
│   │   └── .... #更多
│   ├── prometheus.yml #普罗配置文件
│   └── zwatch.rule.yml #普罗规则聚合文件
└── docker-compose.yml

PM_exxk_192.168.10.2.json

1
[{"targets":["192.168.10.2:9100"],"labels":{"hostUuid":"PM_exxk_192.168.10.2"}}]

collectd.rule.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
groups:
- name: collectd
rules:
- record: collectd:collectd_virt_virt_cpu_total
expr: irate(collectd_virt_virt_cpu_total[10m]) / 1e+07
- record: collectd:collectd_virt_virt_vcpu
expr: irate(collectd_virt_virt_vcpu[10m]) / 1e+07
- record: collectd:collectd_virt_memory
expr: collectd_virt_memory
- record: collectd:collectd_disk_disk_octets_read
expr: irate(collectd_disk_disk_octets_0[10m])
- record: collectd:collectd_disk_disk_octets_write
expr: irate(collectd_disk_disk_octets_1[10m])
- record: collectd:collectd_disk_disk_ops_read
expr: irate(collectd_disk_disk_ops_0[10m])
- record: collectd:collectd_disk_disk_ops_write
expr: irate(collectd_disk_disk_ops_1[10m])
- record: collectd:collectd_disk_disk_time_read
expr: irate(collectd_disk_disk_time_0[10m])
- record: collectd:collectd_disk_disk_time_write
expr: irate(collectd_disk_disk_time_1[10m])
- record: collectd:collectd_interface_if_errors_rx
expr: irate(collectd_interface_if_errors_0[10m])
- record: collectd:collectd_interface_if_errors_tx
expr: irate(collectd_interface_if_errors_1[10m])
- record: collectd:collectd_interface_if_octets_rx
expr: irate(collectd_interface_if_octets_0[10m])
- record: collectd:collectd_interface_if_octets_tx
expr: irate(collectd_interface_if_octets_1[10m])
- record: collectd:collectd_interface_if_packets_rx
expr: irate(collectd_interface_if_packets_0[10m])
- record: collectd:collectd_interface_if_packets_tx
expr: irate(collectd_interface_if_packets_1[10m])
- record: collectd:collectd_memory
expr: collectd_memory
- record: collectd:collectd_cpu_percent
expr: collectd_cpu_percent
- record: collectd:wmi_cpu_time_total
expr: wmi_cpu_time_total
- record: collectd:collectd_virt_disk_octets_read
expr: irate(collectd_virt_disk_octets_0[10m])
- record: collectd:collectd_virt_disk_octets_write
expr: irate(collectd_virt_disk_octets_1[10m])
- record: collectd:collectd_virt_disk_ops_read
expr: irate(collectd_virt_disk_ops_0[10m])
- record: collectd:collectd_virt_disk_ops_write
expr: irate(collectd_virt_disk_ops_1[10m])
- record: collectd:collectd_virt_if_dropped_read
expr: irate(collectd_virt_if_dropped_0[10m])
- record: collectd:collectd_virt_if_dropped_write
expr: irate(collectd_virt_if_dropped_1[10m])
- record: collectd:collectd_virt_if_errors_rx
expr: irate(collectd_virt_if_errors_0[10m])
- record: collectd:collectd_virt_if_errors_tx
expr: irate(collectd_virt_if_errors_1[10m])
- record: collectd:collectd_virt_if_octets_rx
expr: irate(collectd_virt_if_octets_0[10m])
- record: collectd:collectd_virt_if_octets_tx
expr: irate(collectd_virt_if_octets_1[10m])
- record: collectd:collectd_virt_if_packets_rx
expr: irate(collectd_virt_if_packets_0[10m])
- record: collectd:collectd_virt_if_packets_tx
expr: irate(collectd_virt_if_packets_1[10m])
- expr: node_filesystem_free_bytes
record: node_filesystem_free
- expr: node_filesystem_avail_bytes
record: node_filesystem_avail
- expr: node_filesystem_size_bytes
record: node_filesystem_size

zwatch.rule.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
groups:
- name: zwatch.rule
rules:
- record: ZStack:Host::ReclaimedMemoryInBytes
expr: clamp_min(sum(collectd_virt_memory{hostUuid!="", type="max_balloon"}) by (hostUuid) - on(hostUuid) sum(collectd_virt_memory{hostUuid!="", type="actual_balloon"}) by (hostUuid), 0)
- record: ZStack:Host::DiskCapacityInBytes
expr: node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}
- record: ZStack:Host::DiskAllReadBytes
expr: sum(irate(collectd_disk_disk_octets_0{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::CPUAverageWaitUtilization
expr: avg((sum(collectd_cpu_percent{type="wait", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100) by (hostUuid)
- record: ZStack:Host::DiskAllWriteOps
expr: sum(irate(collectd_disk_disk_ops_1{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::NetworkAllOutBytesByServiceType
expr: sum(irate(host_network_all_out_bytes_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::DiskFreeCapacityInPercent
expr: ((node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} + 1) / (node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} + 1)) * 100
- record: ZStack:Host::CPUAverageUsedUtilization
# 需要安装collectd才能用 collectd_cpu_percent这个指标
# expr: avg((sum(100 - collectd_cpu_percent{type="idle", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100) by (hostUuid)
# 这里采用node_exporter采集的指标
expr: avg by(hostUuid) ((sum by(hostUuid, cpu) (node_cpu_seconds_total{mode!="idle", hostUuid!=""})/sum by(hostUuid, cpu) (node_cpu_seconds_total{hostUuid!=""})) * 100)
- record: ZStack:Host::DiskRootUsedCapacityInBytes
expr: sum(node_filesystem_size{hostUuid!="", fstype!="rootfs",mountpoint="/"} - node_filesystem_avail{hostUuid!="", fstype!="rootfs",mountpoint="/"}) by(hostUuid)
- record: ZStack:Host::NetworkOutPackets
expr: irate(collectd_interface_if_packets_1{hostUuid!=""}[10m])
- record: ZStack:Host::NetworkOutDropped
expr: irate(collectd_interface_if_dropped_1{hostUuid!=""}[10m])
- record: ZStack:Host::DiskWriteBytesWwid
expr: irate(collectd_disk_disk_octets_1[10m:]) * on (disk, hostUuid) group_left(wwid) node_disk_wwid
- record: ZStack:Host::CPUAllIdleUtilization
expr: (sum(collectd_cpu_percent{type="idle", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100
- record: ZStack:Host::NetworkAllInBytes
expr: sum(irate(host_network_all_in_bytes{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::NetworkInDropped
expr: irate(collectd_interface_if_dropped_0{hostUuid!=""}[10m])
- record: ZStack:Host::DiskUsedCapacityInPercent
expr: (((node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} - node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) + 1) / (node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} + 1)) * 100
- record: ZStack:Host::DiskWriteBytes
expr: irate(collectd_disk_disk_octets_1{hostUuid!=""}[10m])
- record: ZStack:Host::DiskZStackUsedCapacityInPercent
expr: (sum(zstack_used_capacity_in_bytes) by(hostUuid) / sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)) * 100
- record: ZStack:Host::NetworkInErrors
expr: irate(collectd_interface_if_errors_0{hostUuid!=""}[10m])
- record: ZStack:Host::DiskRootUsedCapacityInPercent
expr: (sum(node_filesystem_size{hostUuid!="", fstype!="rootfs",mountpoint="/"} - node_filesystem_avail{hostUuid!="", fstype!="rootfs",mountpoint="/"}) by(hostUuid) / sum(node_filesystem_size{hostUuid!="", fstype!="rootfs",mountpoint="/"}) by(hostUuid)) * 100
- record: ZStack:Host::NetworkAllInPackets
expr: sum(irate(host_network_all_in_packages{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::DiskAllUsedCapacityInBytes
expr: sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} - node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)
- record: ZStack:Host::DiskWriteOps
expr: irate(collectd_disk_disk_ops_1{hostUuid!=""}[10m])
- record: ZStack:Host::NetworkAllOutBytes
expr: sum(irate(host_network_all_out_bytes{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::VolumeGroupUsedCapacityInPercent
expr: ((vg_size - vg_avail + 1) / (vg_size + 1)) * 100
- record: ZStack:Host::NetworkAllInErrors
expr: sum(irate(host_network_all_in_errors{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::NetworkAllInBytesByServiceType
expr: sum(irate(host_network_all_in_bytes_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::NetworkAllOutErrorsByServiceType
expr: sum(irate(host_network_all_out_errors_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::NetworkOutErrors
expr: irate(collectd_interface_if_errors_1{hostUuid!=""}[10m])
- record: ZStack:Host::NetworkAllInPacketsByServiceType
expr: sum(irate(host_network_all_in_packages_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::DiskTransUsedCapacityInBytes
expr: sum(node_filesystem_size{hostUuid!="", fstype!="rootfs"} - node_filesystem_avail{hostUuid!="", fstype!="rootfs"}) by(hostUuid) - sum(zstack_used_capacity_in_bytes) by(hostUuid)
- record: ZStack:Host::NetworkAllInErrorsByServiceType
expr: sum(irate(host_network_all_in_bytes_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::DiskUsedCapacityInBytes
expr: node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} - node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}
- record: ZStack:Host::DiskReadOpsWwid
expr: irate(collectd_disk_disk_ops_0[10m:]) * on (disk, hostUuid) group_left(wwid) node_disk_wwid
- record: ZStack:Host::DiskLatencyWwid
expr: (delta(collectd_disk_disk_io_time_0[1m]) + delta(collectd_disk_disk_io_time_1[1m])+1) / (delta(collectd_disk_disk_ops_0[1m]) + delta(collectd_disk_disk_ops_1[1m])+1) * on (disk, hostUuid) group_left(wwid) node_disk_wwid
- record: ZStack:Host::CPUAllUsedUtilization
expr: clamp((sum(100 - collectd_cpu_percent{type="idle", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100, 0, 100)
- record: ZStack:Host::DiskReadBytesWwid
expr: irate(collectd_disk_disk_octets_0[10m:]) * on (disk, hostUuid) group_left(wwid) node_disk_wwid
- record: ZStack:Host::CPUAverageIdleUtilization
expr: avg((sum(collectd_cpu_percent{type="idle", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100) by (hostUuid)
- record: ZStack:Host::CPUAverageUserUtilization
expr: avg((sum(collectd_cpu_percent{type="user", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100) by (hostUuid)
- record: ZStack:Host::MemoryUsedBytes
expr: node_memory_MemTotal_bytes-node_memory_MemAvailable_bytes
- record: ZStack:Host::MemoryFreeInPercent
expr: 100 * (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)
- record: ZStack:Host::NetworkOutBytes
expr: irate(collectd_interface_if_octets_1{hostUuid!=""}[10m])
- record: ZStack:Host::NetworkAllOutPacketsByServiceType
expr: sum(irate(host_network_all_out_packages_by_service_type{hostUuid!=""}[10m])) by(hostUuid, service_type)
- record: ZStack:Host::DiskAllWriteBytes
expr: sum(irate(collectd_disk_disk_octets_1{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::DiskTotalCapacityInBytes
expr: sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)
- record: ZStack:Host::MemoryUsedInPercent
expr: 100 * (1 - (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))
- record: ZStack:Host::DiskLatency
expr: (delta(collectd_disk_disk_io_time_0[1m]) + delta(collectd_disk_disk_io_time_1[1m])+1) / (delta(collectd_disk_disk_ops_0[1m]) + delta(collectd_disk_disk_ops_1[1m])+1)
- record: ZStack:Host::DiskReadBytes
expr: irate(collectd_disk_disk_octets_0{hostUuid!=""}[10m])
- record: ZStack:Host::DiskWriteOpsWwid
expr: irate(collectd_disk_disk_ops_1[10m]) * on (disk, hostUuid) group_left(wwid) node_disk_wwid
- record: ZStack:Host::NetworkInPackets
expr: irate(collectd_interface_if_packets_0{hostUuid!=""}[10m])
- record: ZStack:Host::CPUUsedUtilization
expr: abs(100 - collectd_cpu_percent{type="idle", hostUuid!=""})
- record: ZStack:Host::DiskAllFreeCapacityInPercent
expr: (sum(node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid) / sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)) * 100
- record: ZStack:Host::DiskTransUsedCapacityInPercent
expr: (sum(node_filesystem_size{hostUuid!="", fstype!="rootfs"} - node_filesystem_avail{hostUuid!="", fstype!="rootfs"}) by(hostUuid) - sum(zstack_used_capacity_in_bytes) by(hostUuid)) / sum(node_filesystem_size{hostUuid!="", fstype!="rootfs"}) by(hostUuid) * 100
- record: ZStack:Host::DiskAllReadOps
expr: sum(irate(collectd_disk_disk_ops_0{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::NetworkAllOutPackets
expr: sum(irate(host_network_all_out_packages{hostUuid!=""}[10m])) by(hostUuid)
- record: ZStack:Host::NetworkInBytes
expr: irate(collectd_interface_if_octets_0{hostUuid!=""}[10m])
- record: ZStack:Host::DiskReadOps
expr: irate(collectd_disk_disk_ops_0{hostUuid!=""}[10m])
- record: ZStack:Host::CPUAverageSystemUtilization
expr: avg((sum(collectd_cpu_percent{type="system", hostUuid!=""}) by(hostUuid) / sum(collectd_cpu_percent{hostUuid!=""}) by(hostUuid)) * 100) by (hostUuid)
- record: ZStack:Host::DiskAllUsedCapacityInPercent
expr: (sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"} - node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid) / sum(node_filesystem_size{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)) * 100
- record: ZStack:Host::DiskAllFreeCapacityInBytes
expr: sum(node_filesystem_avail{hostUuid!="", fstype!~"proc|tmpfs|rootfs|ramfs|iso9660|rpc_pipefs", mountpoint!~"/tmp/zs-.*"}) by(hostUuid)
- record: ZStack:Host::NetworkAllOutErrors
expr: sum(irate(host_network_all_out_errors{hostUuid!=""}[10m])) by(hostUuid)

启动普罗docker-compose up -d

写个代码简单模拟下zStack的告警

规则配置枚举类:AlarmRuleEnum.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
package com.example.prometheus_demo;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

public enum AlarmRuleEnum {

CPU_UTILIZATION(
"CPUAverageUsedUtilization",
"ZStack/Host",
"Important",
"Average CPU Utilization of Hosts",
"GreaterThanOrEqualTo",
80, //测试时可以把阈值调小才能触发
300,
1800 //测试时可以把这个告警周期调小,才能更频繁触发
), //物理机平均CPU使用率

MEMORY_UTILIZATION(
"MemoryUsedInPercent",
"ZStack/Host",
"Important",
"Host Memory Utilization",
"GreaterThanOrEqualTo",
80,
300,
1800
), //理机内存已用百分比

DISK_USAGE(
"DiskRootUsedCapacityInPercent",
"ZStack/Host",
"Emergent",
"Host Root Volume Utilization",
"GreaterThanOrEqualTo",
80,
600,
1800
); //物理机根盘使用率报警器

private final String metricName; //指标名称
private final String namespace; //命名空间
private final String emergencyLevel; //报警级别
private final String name; //名称
private final String comparisonOperator; //条件,大于小于等等
private final int threshold; //阈值
private final int period; //持续时间
private final int repeatInterval; //报警间隔

AlarmRuleEnum(String metricName, String namespace, String emergencyLevel,
String name, String comparisonOperator, int threshold, int period, int repeatInterval) {
this.metricName = metricName;
this.namespace = namespace;
this.emergencyLevel = emergencyLevel;
this.name = name;
this.comparisonOperator = comparisonOperator;
this.threshold = threshold;
this.period = period;
this.repeatInterval = repeatInterval;
}

//todo 省略了get方法

public static AlarmRuleEnum fromMetricName(String metricName) {
for (AlarmRuleEnum rule : values()) {
if (rule.getMetricName().equals(metricName)) {
return rule;
}
}
return null;
}

//GreaterThan > ,GreaterThanOrEqualTo >= ,LessThan< ,LessThanOrEqualTo<=
private static final Map<String, String> OPERATOR_MAP;
static {
Map<String, String> map = new HashMap<>();
map.put("GreaterThan", ">");
map.put("GreaterThanOrEqualTo", ">=");
map.put("LessThan", "<");
map.put("LessThanOrEqualTo", "<=");
OPERATOR_MAP = Collections.unmodifiableMap(map);
}


// ✅ 生成 PromQL 表达式 avg_over_time(ZStack:Host::CPUAverageUsedUtilization[5m]) > 80
public String getPromQl() {
String prefix = namespace.replace("/", ":") + "::";
int minutes = period / 60;
// 映射判断符号
String operatorSymbol = OPERATOR_MAP.getOrDefault(comparisonOperator, ">");
return String.format("avg_over_time(%s%s[%dm]) %s %d", prefix, metricName, minutes,operatorSymbol, threshold);
}

//获取完整的指标名称,唯一
public String getFullMetricName(){
return namespace.replace("/", ":") + "::"+metricName;
}

}

告警调度类:AlarmScheduler.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
package com.example.prometheus_demo;

import cn.hutool.http.HttpUtil;
import cn.hutool.json.JSONArray;
import cn.hutool.json.JSONObject;
import cn.hutool.json.JSONUtil;
import com.baomidou.mybatisplus.core.conditions.query.LambdaQueryWrapper;
import com.baomidou.mybatisplus.core.toolkit.Wrappers;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.scheduling.TaskScheduler;
import org.springframework.scheduling.concurrent.ThreadPoolTaskScheduler;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.util.Date;
import java.util.UUID;
import java.util.concurrent.ScheduledFuture;


@Component
public class AlarmScheduler {

private static final Logger log = LoggerFactory.getLogger(AlarmScheduler.class);

@Resource
private SysAlarmLogService alarmLogService;

//todo 这个需要删除,有定义好的配置文件--------------------------------------------------------------
private static final String PROMETHEUS_URL = "http://127.0.0.1:9090/api/v1/query?query=";

/**
* Spring 提供的线程调度器,支持定时和固定频率执行任务。
*/
private final TaskScheduler scheduler;

public AlarmScheduler() {
// 初始化线程池调度器(用于执行多个定时任务)
ThreadPoolTaskScheduler taskScheduler = new ThreadPoolTaskScheduler();
taskScheduler.setPoolSize(5); // 设置线程池大小
taskScheduler.setThreadNamePrefix("alarm-task-");
taskScheduler.initialize();
this.scheduler = taskScheduler;
}

/**
* 初始化时自动调度所有枚举中定义的规则。
* 多节点部署时建议在这里加分布式锁,避免重复调度。
*/
@PostConstruct
public void scheduleAll() {
log.debug("开始初始化告警规则调度...");
for (AlarmRuleEnum rule : AlarmRuleEnum.values()) {
scheduleRule(rule);
}
log.debug("所有告警规则已初始化调度。");
}

/**
* 为某个告警规则启动调度任务。
* @param rule 告警规则
*/
private void scheduleRule(AlarmRuleEnum rule) {
long intervalMs = rule.getRepeatInterval() * 1000L; //转换成毫秒

// 🚨 多节点部署建议此处加分布式锁控制,如:Redis + Redisson tryLock
// if (!DistributedLock.tryLock(rule.getMetricName())) return;
log.debug("调度告警规则: {},执行周期:{} 毫秒", rule.getMetricName(), intervalMs);
ScheduledFuture<?> future = scheduler.scheduleAtFixedRate(() -> check(rule), intervalMs);
//如果需要取消或者停止任务,可以将future存到变量里面,提供方法就可以进行其他操作了
}

private void check(AlarmRuleEnum rule) {
log.info("正在检查告警规则: {}", rule.getMetricName());
try {
//todo 这部分需要替换,代码里面已经有封装的工具类,参考物理机的监控统计接口---------------------------------------------
String encodedQuery = URLEncoder.encode(rule.getPromQl(), String.valueOf(StandardCharsets.UTF_8));
String url = PROMETHEUS_URL + encodedQuery;
String response = HttpUtil.get(url);
JSONObject json = JSONUtil.parseObj(response);
//todo ---------------替换结束-------------------------------------------------------------------------------
log.debug("告警promql: {} ", rule.getPromQl());
JSONArray results = json.getJSONObject("data").getJSONArray("result");
if (results != null && !results.isEmpty()) {
for (Object obj : results) {
JSONObject item = (JSONObject) obj;
String hostUuid = item.getJSONObject("metric").getStr("hostUuid"); //hostUuid里面数据格式为:PM_主机名_ip
if (hostUuid!=null&&hostUuid.startsWith("PM_")) { //ip是PM_开头的,才是额外的物理机,其他物理机由zStack自己的告警进行触发,防止告警重复
String[] parts= hostUuid.split("_");
String hostName=parts[1];
String ip=parts[2]; //获取ip
// 判断是否已有相同 instance、alarmUuid 且状态为 Alarm 的记录
LambdaQueryWrapper<SysAlarmLog> query = Wrappers.<SysAlarmLog>lambdaQuery()
.eq(SysAlarmLog::getSourceUuid, ip)
.eq(SysAlarmLog::getAlarmUuid, rule.getFullMetricName())
.eq(SysAlarmLog::getAlarmStatus, "Alarm");

SysAlarmLog existing = alarmLogService.getOne(query, false);
if (existing != null) {
existing.setTimes(existing.getTimes() == null ? 2 : existing.getTimes() + 1);
// existing.setReadStatus(0); //1 未读取,0已读取,todo 是否需要重置为未读
existing.setFirstTime(new Date()); //更新时间
alarmLogService.updateById(existing);
} else {
SysAlarmLog log = new SysAlarmLog();
log.setUuid(UUID.randomUUID().toString());
log.setSource(2); // 其他物理机
log.setSourceUuid(ip); //来源数据uuid
log.setAlarmName(rule.getName());
log.setAlarmStatus("Alarm"); //报警器状态 Alarm 已告警 OK 监控中
log.setAlarmUuid(rule.getFullMetricName()); //报警器uuid取报警器指标全名,相对唯一
log.setComparisonOperator(rule.getComparisonOperator());
log.setContext(item.toString()); //资源信息,直接放报文内容
log.setEmergencyLevel(rule.getEmergencyLevel());
log.setMetricName(rule.getMetricName());
log.setNamespace(rule.getNamespace());
log.setPeriod(rule.getPeriod());
log.setReadStatus(1); //1 未读取,0已读取
log.setThreshold(rule.getThreshold());
log.setCreateTime(new Date());
log.setFirstTime(new Date());
log.setResourceName(hostName);
log.setResourceUuid(ip);
log.setTimes(1); //报警次数
log.setType("alarm"); //告警消息类型 event 事件报警器 alarm 资源报警器
alarmLogService.save(log);
}
}
}
}
log.info("告警规则 {} 检查完成", rule.getMetricName());
} catch (Exception e) {
// 建议使用日志组件代替
e.printStackTrace();
log.error("告警规则 {} 检查中断", rule.getMetricName(), e);
}
}
}

简介

一个API 管理平台,主要用于帮助企业发布、管理、监控、保护和分析其 API。简单来说,它是一个用来集中管理你对外开放的接口(API)的系统。

主要作用和用途:

✅ 1. API 网关功能

充当所有 API 请求的入口,提供统一接入点,做鉴权、限流、日志、安全控制等。

✅ 2. API 发布与文档

让后端开发者可以通过控制台把服务注册为 API,并给每个 API 添加说明文档、版本控制等。

✅ 3. 开发者门户(Dev Portal)

开发者可以登录门户网站,浏览、订阅、测试你提供的 API,就像逛 API 商店一样。

✅ 4. 流量控制(限流)

比如可以设置某个用户每分钟只能请求多少次,避免系统被滥用。

✅ 5. 安全控制

支持 OAuth2、JWT、Basic Auth 等认证方式,保证 API 安全。

✅ 6. API 分析与监控

可集成 Elasticsearch、Prometheus、Grafana 等,实现接口调用量、失败率、延迟等数据的可视化分析。

安装

1
docker run -it -p 8280:8280 -p 8243:8243 -p 9443:9443 --name api-manager wso2/wso2am:4.5.0-alpine

使用

默认用户名密码是admin/admin

名称 地址 作用
发布者门户 https://localhost:9443/publisher 发布api到网关
系统配置门户 https://localhost:9443/carbon 管理配置界面
开发者门户 https://localhost:9443/devportal api的发现与订阅

参考

docker hub官方镜像地址

官方文档

容器化部署比较:

特性 TensorFlow PyTorch PaddlePaddle
镜像大小 CPU 版本:1.2 GB - 1.5 GB
GPU 版本:2.5 GB - 3 GB
CPU 版本:1 GB - 1.5 GB
GPU 版本:2 GB - 2.5 GB
CPU 版本:1.5 GB - 2 GB
GPU 版本:3.5 GB - 4 GB
镜像完善性 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐
部署工具支持 TensorFlow Serving,TFLite 等 TorchServe,ONNX Runtime Paddle Serving,Paddle Inference
硬件适配 GPU/TPU/CPU,多硬件兼容 GPU/CPU,部分支持 TensorRT GPU/CPU,国产硬件适配强
生态成熟度 全球广泛使用,企业支持完善 学术界主流,生产支持逐渐完善 国内企业支持度高,中文友好
学习曲线 较陡峭 较平缓 平缓,中文文档友好
docker hub tensorflow/tensorflow
⭐️2.7K ⏬50M+
pytorch/pytorch
⭐️1.3K ⏬10M+
paddlepaddle/paddle
⭐️126 ⏬500K+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#开发测试用
apt-get update
apt-get install -y --no-install-recommends git
apt-get install -y --no-install-recommends gcc g++ #运行时要用,不加执行python会报错
python3 -m pip install paddlepaddle #cpu
python3 -m pip install paddlepaddle-gpu #gpu 包含cpu

#python3 -m pip install -U pip setuptools

apt-get install -y software-properties-common
apt-get install -y nvidia-utils-560

git clone https://github.com/PaddlePaddle/PaddleDetection.git
cd PaddleDetection/
python3 -m pip install -r requirements.txt
python3 -m pip install scikit-learn

python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml --image_file=demo/000000014439.jpg

export CUDA_VISIBLE_DEVICES=0
python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml --device=GPU --image_file=demo/000000014439.jpg


python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/examples/infer_cfg_human_attr.yml --device=GPU --rtsp rtsp://admin:hcytech@2020@172.16.80.138:554/video1

python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/examples/infer_cfg_human_attr.yml --device=GPU --rtsp rtsp://admin:hcytech@2020@172.16.80.138:554/video1 --pushurl rtsp://172.16.10.202:8554/video1


python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/examples/infer_cfg_human_attr.yml --rtsp rtsp://admin:hcytech@2020@172.16.80.138:554/video1 --pushurl rtmp://172.16.10.102:30521/live/a

python3 deploy/pipeline/pipeline.py --config deploy/pipeline/config/examples/infer_cfg_human_attr.yml --rtsp rtsp://admin:hcytech@2020@172.16.80.138:554/video1 --pushurl rtsp://172.16.10.202:8554/video1

rtmp://172.16.10.102:30521/live/a

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH


ls /usr/lib | grep lib

root@cudadiy-5c4556d558-22lx2:/home/PaddleDetection# find / -name libcudnn.so*
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn.so.9
root@cudadiy-5c4556d558-22lx2:/home/PaddleDetection# find / -name libcublas.so*
/usr/local/cuda-12.6/targets/x86_64-linux/lib/libcublas.so.12
/usr/local/cuda-12.6/targets/x86_64-linux/lib/libcublas.so.12.6.4.1

ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0 /usr/lib/libcudnn.so
ln -s /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcublas.so.12.0.2.224 /usr/lib/libcublas.so
ldconfig


ExternalError: CUDNN error(3000), CUDNN_STATUS_NOT_SUPPORTED.

python3 tools/export_model.py -c configs/rec/PP-OCRv4/ch_PP-OCRv4_rec_hgnet.yml -o Global.pretrained_model=./ch_PP-OCRv4_rec_server_train Global.save_inference_dir=./inference/ch_PP-OCRv4_server_rec/


export PATH=/usr/local/cuda-12.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH
1
2
RuntimeError: (PreconditionNotMet) Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion.
[Hint: cudnn_dso_handle should not be null.] (at /paddle/paddle/phi/backends/dynload/cudnn.cc:64)

https://developer.nvidia.com/rdp/cudnn-archive

ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.6.2-gpu-cuda12.0-cudnn8.9-trt8.6

harbor.hcytech.dev/ai/paddle:gpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# FROM nvidia/cuda:11.8.0-base-ubuntu20.04 这个没有cudnn,用不了gpu
# FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 这个cudnn报错
# FROM nvidia/cuda:12.6.3-cudnn-runtime-ubuntu20.04
FROM nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu20.04

WORKDIR /home/PaddleDetection

# demo用于测试,非必要,后期可以删除
COPY demo /home/PaddleDetection/demo
# python代码目录
COPY deploy /home/PaddleDetection/deploy
COPY requirements.txt /home/PaddleDetection/

# 设置时区,避免安装过程中的交互
ENV TZ=Asia/Shanghai
# gcc g++ 在运行推理时需要,ffmpeg推流用
RUN apt-get update && apt-get install -y --no-install-recommends \
tzdata python3.10 python3-pip gcc g++ ffmpeg && \
ln -sf /usr/share/zoneinfo/$TZ /etc/localtime && \
echo $TZ > /etc/timezone && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# paddle环境
RUN python3 -m pip install --no-cache-dir paddlepaddle-gpu && \
python3 -m pip install --no-cache-dir -r requirements.txt && \
rm -rf /root/.cache

https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist

https://docs.nvidia.com/cuda/archive/12.0.1/cuda-quick-start-guide/index.html#id8

CUDNN 12.0.1 :https://developer.nvidia.com/cuda-toolkit-archive

cuDNN 8.9.1 :https://developer.nvidia.com/rdp/cudnn-archive

1
2
3
4
5
6
7
8
FROM ubuntu:20.04

# 设置时区,避免安装tzdata过程中的交互
ENV TZ=Asia/Shanghai

RUN apt-get update && apt-get install -y --no-install-recommends \
tzdata libxml2 \
sh cuda_12.0.1_525.85.12_linux.run --silent
  1. cuda_12.0.1_525.85.12_linux.run依赖libxml2,libxml2依赖tzdata,tzdata需要设置ENV TZ=Asia/Shanghai

    1
    ./cuda-installer: error while loading shared libraries: libxml2.so.2: cannot open shared object file: No such file or directory
  2. gcc g++

    1
    Failed to verify gcc version. See log at /var/log/cuda-installer.log for details.
  3. Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory

    1
    2
    3
    4
    5
    6
    root@cudadiy-59b97d6588-rpb2m:/usr/local/cuda/lib64# ls
    libcudart.so.12 libcudart.so.12.0.146
    root@cudadiy-59b97d6588-rpb2m:/usr/local/cuda/lib64# apt-get update && apt-get install -y libcublas-12-0 libcublas-dev-12-0
    root@cudadiy-59b97d6588-rpb2m:/usr/local/cuda/lib64# ls
    libcublas.so libcublas.so.12.0.2.224 libcublasLt.so.12 libcublasLt_static.a libcudart.so.12 libnvblas.so libnvblas.so.12.0.2.224
    libcublas.so.12 libcublasLt.so libcublasLt.so.12.0.2.224 libcublas_static.a libcudart.so.12.0.146 libnvblas.so.12 stubs

硬件进行隔离(硬隔离)

需要显卡支持 GPU 虚拟化(如 vGPU)或 MIG(Multi-Instance GPU)功能。

软件进行隔离和共享(软隔离)

硬件不支持可以采用。如:NVIDIA GeForce RTX 3050

方式一:Aliyun gpushare-scheduler-extender

特点:允许多个 Pod 共享同一个 GPU 的资源,通过限制显存使用实现隔离。

方式二:Time-Slicing(时间分片)

时间分片是一种通过 GPU 驱动实现的调度机制,允许多个进程按时间片轮流使用 GPU。

特点:串行执行,简单易用,隔离性强,资源利用率低,延迟增加。

适合场景:边缘计算、轻量级推理任务。

安装:

前提是已经安装k8s-device-plugin

  1. 添加如下配置,在kuboard管理界面的nvidia-device-plugin空间内创建configmap(也可以通过k8s命令创建),名称为gpu-share-configs,key为: time-slicing.yaml,值如下:

    1
    2
    3
    4
    5
    6
    7
    8
    version: v1
    sharing:
    timeSlicing:
    renameByDefault: true #重命名gpu资源名称,为了区分是不是共享gpu,eg:nvidia.com/gpu将被重命名nvidia.com/gpu.shared,限制limits的时候就要用nvidia.com/gpu.shared
    failRequestsGreaterThanOne: true #限制一个容器只能请求一个cpu,这样所有容器都是公平的时间分配
    resources:
    - name: nvidia.com/gpu
    replicas: 10 #表示GPU资源被分成n*10个逻辑单元,n代表gpu个数
  2. 应用配置,在原来安装插件的命令上增加--set config.default=time-slicing.yaml--set config.name=gpu-share-configs,详细命令如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --version 0.17.0 \
    --set runtimeClassName=nvidia \
    --set config.default=time-slicing.yaml \
    --set config.name=gpu-share-configs \
    --kubeconfig /etc/rancher/k3s/k3s.yaml
    exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --version 0.17.0 \
    --set runtimeClassName=nvidia \
    --set config.default=time-slicing.yaml \
    --set config.name=gpu-share-configs \
    --kubeconfig /etc/rancher/k3s/k3s.yaml
    # 验证是否成功
    exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com
    nvidia.com/cuda.driver-version.full=560.35.03
    nvidia.com/cuda.driver-version.major=560
    nvidia.com/cuda.driver-version.minor=35
    nvidia.com/cuda.driver-version.revision=03
    nvidia.com/cuda.driver.major=560
    nvidia.com/cuda.driver.minor=35
    nvidia.com/cuda.driver.rev=03
    nvidia.com/cuda.runtime-version.full=12.6
    nvidia.com/cuda.runtime-version.major=12
    nvidia.com/cuda.runtime-version.minor=6
    nvidia.com/cuda.runtime.major=12
    nvidia.com/cuda.runtime.minor=6
    nvidia.com/gfd.timestamp=1734514101
    nvidia.com/gpu.compute.major=8
    nvidia.com/gpu.compute.minor=6
    nvidia.com/gpu.count=1
    nvidia.com/gpu.family=ampere
    nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996
    nvidia.com/gpu.memory=4096
    nvidia.com/gpu.mode=graphics
    nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU
    nvidia.com/gpu.replicas=10 #这里变成了10了
    nvidia.com/gpu.sharing-strategy=time-slicing #这里用时间分片了
    nvidia.com/mig.capable=false
    nvidia.com/mps.capable=false
    nvidia.com/vgpu.present=false
    nvidia.com/gpu: 0
    nvidia.com/gpu.shared: 10
    nvidia.com/gpu: 0
    nvidia.com/gpu.shared: 10
    nvidia.com/gpu 0 0
    nvidia.com/gpu.shared 0 0
  3. 使用,在部署容器时,如果要限制gpu使用,需要修改将nvidia.com/gpu修改为nvidia.com/gpu.shared

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    spec:
    containers:
    - command:
    - tail
    - '-f'
    - /dev/null
    image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04'
    imagePullPolicy: IfNotPresent
    name: cuda
    resources:
    limits:
    nvidia.com/gpu.shared: '1' #默认为nvidia.com/gpu需改为nvidia.com/gpu.shared
  4. 经测试,如果配置nvidia.com/gpu.shared,超过10个,就无法创建pod了,没配置可以超过10个。

方式三:MPS

存在问题:运行业务时会提示gpu忙碌

MPS 是 NVIDIA 提供的 GPU 多进程共享服务,允许多个 CUDA 进程同时在 GPU 上执行。

特点:并行执行,高资源利用率,性能提升,可控分配,复杂性增加,隔离性较弱,兼容不是很强(使用前要验证CUDA内核是否支持MPS)。

适合场景:多任务训练、大规模高效推理。

安装:

前提是已经安装k8s-device-plugin

  1. 添加如下配置,在kuboard管理界面的nvidia-device-plugin空间内创建configmap(也可以通过k8s命令创建),名称为gpu-share-configs,key为: mps.yaml,值如下:

    1
    2
    3
    4
    5
    6
    7
    version: v1
    sharing:
    mps:
    renameByDefault: true #重命名gpu资源名称,为了区分是不是共享gpu,eg:nvidia.com/gpu将被重命名nvidia.com/gpu.shared,限制limits的时候就要用nvidia.com/gpu.shared
    resources:
    - name: nvidia.com/gpu
    replicas: 10 #表示GPU总内存被分成n*10份,n代表gpu个数
  2. 应用配置,在原来安装插件的命令上增加--set config.default=mps.yaml--set config.name=gpu-share-configs,详细命令如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --version 0.17.0 \
    --set runtimeClassName=nvidia \
    --set config.default=mps.yaml \
    --set config.name=gpu-share-configs \
    --kubeconfig /etc/rancher/k3s/k3s.yaml
    exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --version 0.17.0 \
    --set runtimeClassName=nvidia \
    --set config.default=mps.yaml \
    --set config.name=gpu-share-configs \
    --kubeconfig /etc/rancher/k3s/k3s.yaml
    # 验证是否成功
    exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com
    nvidia.com/cuda.driver-version.full=560.35.03
    nvidia.com/cuda.driver-version.major=560
    nvidia.com/cuda.driver-version.minor=35
    nvidia.com/cuda.driver-version.revision=03
    nvidia.com/cuda.driver.major=560
    nvidia.com/cuda.driver.minor=35
    nvidia.com/cuda.driver.rev=03
    nvidia.com/cuda.runtime-version.full=12.6
    nvidia.com/cuda.runtime-version.major=12
    nvidia.com/cuda.runtime-version.minor=6
    nvidia.com/cuda.runtime.major=12
    nvidia.com/cuda.runtime.minor=6
    nvidia.com/gfd.timestamp=1734573325
    nvidia.com/gpu.compute.major=8
    nvidia.com/gpu.compute.minor=6
    nvidia.com/gpu.count=1
    nvidia.com/gpu.family=ampere
    nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996
    nvidia.com/gpu.memory=4096
    nvidia.com/gpu.mode=graphics
    nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU
    nvidia.com/gpu.replicas=10 #这里变成了10了
    nvidia.com/gpu.sharing-strategy=mps #这里用mps了
    nvidia.com/mig.capable=false
    nvidia.com/mps.capable=true
    nvidia.com/vgpu.present=false
    nvidia.com/gpu: 0
    nvidia.com/gpu.shared: 10
    nvidia.com/gpu: 0
    nvidia.com/gpu.shared: 10
    nvidia.com/gpu 0 0
    nvidia.com/gpu.shared 0 0
  3. 使用,在部署容器时,如果要限制gpu使用,需要修改将nvidia.com/gpu修改为nvidia.com/gpu.shared

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    spec:
    containers:
    - command:
    - tail
    - '-f'
    - /dev/null
    image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04'
    imagePullPolicy: IfNotPresent
    name: cuda
    resources:
    limits:
    nvidia.com/gpu.shared: '1' #默认为nvidia.com/gpu需改为nvidia.com/gpu.shared
  4. 经测试,如果配置nvidia.com/gpu.shared,超过10个,就无法创建pod了。

方式四:IMEX

GPU虚拟化,需要硬件GPU支持虚拟化,目前有的显卡不支持,暂不考虑。