阿里云kk一键搭建k8s

kk多节点安装

准备

  1. 准备一个弹性伸缩,用于管理ECS虚拟机

  2. 伸缩组实例配置好ecs的配置(如果是学习可以采用抢占式虚拟机节约成本)

    Eg: ecs.hfc6.large(ecs.c7a.largeamd也可以了)抢占式2vCPU+4GiB+centos7.9 64为,可以挂载同一个共享数据盘,用于存储配置数据

    配置证书,采用证书cer登录

  3. 端口要求:安全组开放(因为用的一个安全组,所以组内连通策略:组内互通)因此不需要设置

    服务 端口
    ssh 22 TCP
    etcd 2379-2380 TCP
    apiserver 6443 TCP
    calico 9099-9100 TCP
    bgp 179 TCP
    nodeport 30000-32767 TCP
    master 10250-10258 TCP
    dns 53 TCP/UDP
    local-registry(离线环境需要) 5000 TCP
    local-apt(离线环境需要) 5080 TCP
    rpcbind( 使用 NFS 时需要) 111 TCP
    ipip(Calico 需要使用 IPIP 协议) IPENCAP / IPIP
    metrics-server 8443 TCP

只安装kubernetes

1
2
3
4
5
6
7
8
9
10
11
yum update -y
yum install -y socat conntrack ebtables ipset
export KKZONE=cn
#手动下载KK,上传阿里云(命令下载不下来)
./kk create config #生产配置文件config-sample.yaml
#修改config-sample.yaml,配置节点信息,密码采用证书登录模式,配置address
#上传证书,注意将证书改成400只读权限
chmod 400 k8s.cer
export KKZONE=cn
#创建kubernetes集群
./kk create cluster -f config-kubernetes-1.23.7.yaml | tee kk.log

安装kubernetes+kubesphere

1
2
3
yum update -y
./kk create config --with-kubesphere #生产配置文件config-sample.yaml
export KKZONE=cn && ./kk create cluster -f config-kubesphere3.3.0-kubernetes1.23.7.yaml | tee kk.log

添加/删除节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#没有之前的部署文件,通过下面命令生产部署文件,文件名sample.yaml,我修改了名字为add-node.yaml
./kk create config --from-cluster
#修改,完善部署文件(主要是节点ip,负载,etcd),添加新节点的配置
#新节点安装依赖
yum install -y socat conntrack ebtables ipset
#执行添加节点
./kk add nodes -f add-node.yaml
#删除节点名是node4的节点
./kk delete node node4 -f add-node.yaml
#高可用
#1. 删除了node3虚拟机
#2. 添加了一个node4虚拟机
#3. kk的add-node.yaml节点信息修改为node1,node2,node4,(node3的配置删除)注意etcd和master要同步修改最新的节点名称
#4. 删除etcd的错误节点
export ETCDCTL_API=2;
export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-node1.pem';
export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-node1-key.pem';
export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';
#查看成员
/usr/local/bin/etcdctl --endpoints=https://192.168.14.16:2379,https://192.168.14.17:2379 member list
#根据成员id,删除成员
/usr/local/bin/etcdctl --endpoints=https://192.168.14.16:2379,https://192.168.14.17:2379 member remove cab52e8fded2eefe
#5. 重新执行添加节点命令
./kk add nodes -f add-node.yaml
#6. 清理不可用的节点,因为连接不上,加上参数--delete-emptydir-data --ignore-daemonsets
kubectl cordon node3
kubectl drain node3 --force --ignore-daemonsets --delete-emptydir-data
kubectl delete node node3
#查看节点,错误节点已删除
kubectl get nodes

添加/删除污点

1
2
3
4
#给节点 node1 增加一个污点,它的键名是 key1,键值是 value1,效果是 NoSchedule。 这表示只有拥有和这个污点相匹配的容忍度的 Pod 才能够被分配到 node1 这个节点。
kubectl taint nodes node1 key1=value1:NoSchedule
#删除污点
kubectl taint nodes node1 key1=value1:NoSchedule-

注意事项

  • 能修改deployment优先修改deployment,如果不能修改再修改pod的配置文件,因为pod修改之后会重启消失。

常见问题

  1. kk安装多节点集群的时候,报如下错误:

    [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s [kubelet-check] Initial timeout of 40s passed.

    解决:amd的主机有问题,换了一个inter芯片的主机就OK了,新解决方法安装前,执行yum update -y

  2. 一个节点宕机后,添加一个新的节点替代,执行kk添加节点命令时报如下错误:

    etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-node1.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-node1-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://192.168.14.16:2379,https://192.168.14.17:2379,https://192.168.14.20:2379 cluster-health | grep -q 'cluster is healthy'

    解决:先删除etcd的异常节点,在重新执行kk添加etcd及master节点

  3. 执行kubectl drain node3 --force --ignore-daemonsets --delete-emptydir-data删除节点时报如下错误:

    I0705 16:41:20.004286 18301 request.go:665] Waited for 1.14877279s due to client-side throttling, not priority and fairness, request: GET:https://lb.kubesphere.local:6443/api/v1/namespaces/kubesphere-monitoring-system/pods/alertmanager-main-1

    解决:强制取消,执行kubectl delete node node3即可

  4. 重建节点之后,新加的节点无法调度,报如下错误:

    0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Insufficient cpu.

    解决:在集群节点查看污点,然后执行kubectl taint nodes node4 node-role.kubernetes.io/master=:NoSchedule-删除污点

  5. 重建节点之后组件监控prometheus-k8s容器事件提示如下错误:

    MountVolume.NewMounter initialization failed for volume "pvc-60891ee0-ba6c-4df4-b381-6e542b27d3a7" : path "/var/openebs/local/pvc-60891ee0-ba6c-4df4-b381-6e542b27d3a7" does not exist

    解决:在master节点执行,以下方法并不能解决,待验证存储卷是否是分布式的?

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #在/etc/kubernetes/manifests/kube-apiserver.yaml
    #spec:
    # containers:
    # - command:
    # - kube-apiserver
    # - -–feature-gates=RemoveSelfLink=fals #添加该行
    vim /etc/kubernetes/manifests/kube-apiserver.yaml
    #应用配置
    kubectl apply -f /etc/kubernetes/manifests/kube-apiserver.yaml
  6. 使用amd主机安装kubesphere,一直卡在Please wait for the installation to complete:,查看pod日志,发现calico-node-4hgbb的pod提示如下错误:

    1
    2
    3
    4
      Type     Reason     Age                    From     Message
    ---- ------ ---- ---- -------
    Warning Unhealthy 3m18s (x440 over 66m) kubelet (combined from similar events): Readiness probe failed: 2022-07-06 02:39:53.164 [INFO][4974] confd/health.go 180: Number of node(s) with BGP peering established = 2
    calico/node is not ready: felix is not ready: readiness probe reporting 503

    参考: kubernetes v1.24.0 install failed with calico node not ready #1282

    解决:resolved by change calico version, maybe calico verison should update from v3.20.0 to v3.23.0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #删除calico相关pod
    kubectl -n kube-system get all |grep calico | awk '{print $1}' | xargs kubectl -n kube-system delete
    #获取3.23新版本
    wget https://docs.projectcalico.org/archive/v3.23/manifests/calico.yaml
    #重新安装calico
    kubectl apply -f calico.yaml
    #calico虽然正常了,但是后续重新用kk安装又回回到不正常状态,注意不要修改pod的配置文件,要修改deployment。
    #最终解决在安装集群前执行
    yum update -y
  7. 系统组件->监控->prometheus-k8s->事件->错误日志:0/3 nodes are available: 3 Insufficient cpu.

    解决:修改工作负载->有状态副本集->prometheus-k8s

    总结:requests.cpu设置为0.5代表一个cpu的一半,0.5等价于500m,读做”500 millicpu”(五百毫核)

    官方说明:Kubernetes 中的资源单位

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    #重启之后需要重新修改
    containers:
    - name: prometheus
    resources:
    limits:
    cpu: '4'
    memory: 16Gi
    requests:
    cpu: 200m #修改为20m
    memory: 400Mi #修改为40Mi
  8. 执行kubectl top node提示error: Metrics API not available错误

    解决:1.未安装修改kubesphere部署配置文件,已安装登录kubesphere点击定制资源定义->ClusterConfiguration->ks-installer修改。

    1
    2
    metrics_server:
    enabled: false #设置为true
  9. calico/node is not ready: felix is not ready: readiness probe reporting 503

    再次尝试之后calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused

  10. 记一次Error from server (BadRequest): container "calico-node" in pod "calico-node-bfmgs" is waiting to start: PodInitializing

    解决及排查过程:(参考:Kubernetes Installation Tutorial: Kubespray)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    [root@master ~]# kubectl get nodes
    master Ready control-plane 158d v1.26.4
    node-102 NotReady control-plane 42h v1.26.4
    ....
    [root@master ~]# kubectl get pods -o wide -n kube-system
    calico-node-2zbtg 0/1 Init:CrashLoopBackOff
    [root@master ~]# kubectl describe pod -n kube-system calico-node-2zbtg
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning BackOff 15s (x7 over 79s) kubelet Back-off restarting failed container install-cni in pod calico-node-2zbtg_kube-system(337004cf-9136-48ac-bc6b-eb897bd2c806)
    [root@master ~]# kubectl logs -n kube-system calico-node-2zbtg
    Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), flexvol-driver (init)
    Error from server (BadRequest): container "calico-node" in pod "calico-node-2zbtg" is waiting to start: PodInitializing
    # --------- 删除pod会马上重启一个 -----------------------------------------------------------
    [root@master ~]# kubectl delete pod -n kube-system calico-node-2zbtg
    # ----------- 在对应节点通过时间找到尸体容器 ---------------------------------------------------
    [root@node-102 ~]# crictl ps -a
    fc19864603510 628dd70880410 About a minute ago Exited install-cni
    [root@node-102 ~]# crictl logs fc19864603510
    time="2024-01-14T12:26:57Z" level=info msg="Running as a Kubernetes pod" source="install.go:145"
    2024-01-14 12:26:57.761 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
    .....
    2024-01-14 12:26:57.964 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.24.5

    2024-01-14 12:26:57.964 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
    W0114 12:26:57.964140 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
    time="2024-01-14T12:26:57Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG_FILE" source="install.go:340"
    time="2024-01-14T12:26:57Z" level=fatal msg="open /host/etc/cni/net.d/calico.conflist.template: no such file or directory" source="install.go:344"
    // ----得到精准的错误信息 open /host/etc/cni/net.d/calico.conflist.template: no such file or directory
    // 这个文件找不到,就从master节点拷贝一个过来
    [root@node-102 ~]# cd /etc/cni/net.d/
    [root@node-102 net.d]# ls
    calico-kubeconfig
    # -----------------------------拷贝到有问题的节点之后么,删除pod,加速重启 ----------------------------
    [root@master net.d]# kubectl delete pod calico-node-bfmgs -n kube-system
    #-----------------------------再次查看就已经有了,而且节点加入成功,各项pod正常------------------------
    [root@node-102 net.d]# ls
    10-calico.conflist calico.conflist.template calico-kubeconfig
  11. k8s启动服务提示Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5bc537d4925604f98d12ec576b90eeee0534402c6fb32fc31920a763051e6589": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

    解决:原因服务器时间不对,导致授权异常,排查那个节点的时间不对,重新同步时间,然后重启对应节点的calico服务

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    [root@node-102 ~]# timedatectl
    Local time: 一 2024-01-15 20:33:23 CST
    Universal time: 一 2024-01-15 12:33:23 UTC
    RTC time: 一 2024-01-15 06:43:54
    Time zone: Asia/Shanghai (CST, +0800)
    NTP enabled: yes
    NTP synchronized: no
    RTC in local TZ: no
    DST active: n/a
    [root@node-102 ~]# chronyc makestep
    200 OK
    [root@master ~]# kubectl get pods -o wide -n kube-system
    calico-node-rlxg5 1/1 Running 0 26h 172.16.10.192 node-192
    [root@master ~]# kubectl delete pod -n kube-system calico-node-rlxg5

附录

config-kubernetes-1.23.7.yaml部署文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
- {name: node1, address: 192.168.14.1, internalAddress: 192.168.14.1, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node2, address: 192.168.14.2, internalAddress: 192.168.14.2, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node3, address: 192.168.14.3, internalAddress: 192.168.14.3, user: root, privateKeyPath: "/exxk/k8s.cer"}
roleGroups:
etcd:
- node[1:3]
master:
- node[1:3]
worker:
- node[1:3]
controlPlaneEndpoint:
## Internal loadbalancer for apiservers
# internalLoadbalancer: haproxy

domain: lb.kubesphere.local
address: "192.168.14.1"
port: 6443
kubernetes:
version: v1.23.7
clusterName: cluster.local
autoRenewCerts: true
containerManager: docker
etcd:
type: kubekey
network:
plugin: calico
kubePodsCIDR: 10.233.64.0/18
kubeServiceCIDR: 10.233.0.0/18
## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
multusCNI:
enabled: false
registry:
privateRegistry: ""
namespaceOverride: ""
registryMirrors: []
insecureRegistries: []
addons: []

config-kubesphere3.3.0-kubernetes1.23.7.yaml部署文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
- {name: node1, address: 192.168.14.16, internalAddress: 192.168.14.16, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node2, address: 192.168.14.17, internalAddress: 192.168.14.17, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node3, address: 192.168.14.18, internalAddress: 192.168.14.18, user: root, privateKeyPath: "/exxk/k8s.cer"}
roleGroups:
etcd:
- node[1:3]
control-plane:
- node[1:3]
worker:
- node[1:3]
controlPlaneEndpoint:
# Internal loadbalancer for apiservers
# internalLoadbalancer: haproxy
domain: lb.kubesphere.local
address: "192.168.14.16"
port: 6443
kubernetes:
version: v1.23.7
clusterName: cluster.local
autoRenewCerts: true
containerManager: docker
etcd:
type: kubekey
network:
plugin: calico
kubePodsCIDR: 10.233.64.0/18
kubeServiceCIDR: 10.233.0.0/18
## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
multusCNI:
enabled: false
registry:
privateRegistry: ""
namespaceOverride: ""
registryMirrors: []
insecureRegistries: []
addons: []

---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
name: ks-installer
namespace: kubesphere-system
labels:
version: v3.3.0
spec:
persistence:
storageClass: ""
authentication:
jwtSecret: ""
zone: ""
local_registry: ""
namespace_override: ""
# dev_tag: ""
etcd:
monitoring: false
endpointIps: localhost
port: 2379
tlsEnable: true
common:
core:
console:
enableMultiLogin: true
port: 30880
type: NodePort
# apiserver:
# resources: {}
# controllerManager:
# resources: {}
redis:
enabled: false
volumeSize: 2Gi
openldap:
enabled: false
volumeSize: 2Gi
minio:
volumeSize: 20Gi
monitoring:
# type: external
endpoint: http://prometheus-operated.kubesphere-monitoring-system.svc:9090
GPUMonitoring:
enabled: false
gpu:
kinds:
- resourceName: "nvidia.com/gpu"
resourceType: "GPU"
default: true
es:
# master:
# volumeSize: 4Gi
# replicas: 1
# resources: {}
# data:
# volumeSize: 20Gi
# replicas: 1
# resources: {}
logMaxAge: 7
elkPrefix: logstash
basicAuth:
enabled: false
username: ""
password: ""
externalElasticsearchHost: ""
externalElasticsearchPort: ""
alerting:
enabled: false
# thanosruler:
# replicas: 1
# resources: {}
auditing:
enabled: false
# operator:
# resources: {}
# webhook:
# resources: {}
devops:
enabled: false
# resources: {}
jenkinsMemoryLim: 2Gi
jenkinsMemoryReq: 1500Mi
jenkinsVolumeSize: 8Gi
jenkinsJavaOpts_Xms: 1200m
jenkinsJavaOpts_Xmx: 1600m
jenkinsJavaOpts_MaxRAM: 2g
events:
enabled: false
# operator:
# resources: {}
# exporter:
# resources: {}
# ruler:
# enabled: true
# replicas: 2
# resources: {}
logging:
enabled: false
logsidecar:
enabled: true
replicas: 2
# resources: {}
metrics_server:
enabled: false
monitoring:
storageClass: ""
node_exporter:
port: 9100
# resources: {}
# kube_rbac_proxy:
# resources: {}
# kube_state_metrics:
# resources: {}
# prometheus:
# replicas: 1
# volumeSize: 20Gi
# resources: {}
# operator:
# resources: {}
# alertmanager:
# replicas: 1
# resources: {}
# notification_manager:
# resources: {}
# operator:
# resources: {}
# proxy:
# resources: {}
gpu:
nvidia_dcgm_exporter:
enabled: false
# resources: {}
multicluster:
clusterRole: none
network:
networkpolicy:
enabled: false
ippool:
type: none
topology:
type: none
openpitrix:
store:
enabled: false
servicemesh:
enabled: false
istio:
components:
ingressGateways:
- name: istio-ingressgateway
enabled: false
cni:
enabled: false
edgeruntime:
enabled: false
kubeedge:
enabled: false
cloudCore:
cloudHub:
advertiseAddress:
- ""
service:
cloudhubNodePort: "30000"
cloudhubQuicNodePort: "30001"
cloudhubHttpsNodePort: "30002"
cloudstreamNodePort: "30003"
tunnelNodePort: "30004"
# resources: {}
# hostNetWork: false
iptables-manager:
enabled: true
mode: "external"
# resources: {}
# edgeService:
# resources: {}
terminal:
timeout: 600

add-node.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
##You should complete the ssh information of the hosts
- {name: node1, address: 192.168.14.16, internalAddress: 192.168.14.16, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node2, address: 192.168.14.17, internalAddress: 192.168.14.17, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node3, address: 192.168.14.18, internalAddress: 192.168.14.18, user: root, privateKeyPath: "/exxk/k8s.cer"}
- {name: node4, address: 192.168.14.19, internalAddress: 192.168.14.19, user: root, privateKeyPath: "/exxk/k8s.cer"}
roleGroups:
etcd:
- node[1:3]
master:
- node1
- node2
- node3
worker:
- node1
- node2
- node3
- node4
controlPlaneEndpoint:
##Internal loadbalancer for apiservers
#internalLoadbalancer: haproxy

##If the external loadbalancer was used, 'address' should be set to loadbalancer's ip.
domain: lb.kubesphere.local
address: "192.168.14.16"
port: 6443
kubernetes:
version: v1.23.7
clusterName: cluster.local
proxyMode: ipvs
masqueradeAll: false
maxPods: 110
nodeCidrMaskSize: 24
network:
plugin: calico
kubePodsCIDR: 10.233.64.0/18
kubeServiceCIDR: 10.233.0.0/18
registry:
privateRegistry: ""