问题描述

很奇怪的问题,昨天k8s还是运行良好的,今天突然发现coredns出现了CrashLoopBackOff错误,感觉就很莫名其妙。

[root@k8s-master ~]# kubectl get pod --all-namespaces
NAMESPACE     NAME                                 READY   STATUS             RESTARTS   AGE
demo          echo-8467949b65-mt4vk                1/1     Running            1          26h
demo          mysql-kllxq                          1/1     Running            0          24h
demo          myweb-skz97                          1/1     Running            0          25h
demo          myweb-wkh5v                          1/1     Running            0          25h
kong          ingress-kong-6c5ccb454d-2bdwd        2/2     Running            2          26h
kube-system   coredns-6d56c8448f-fnrhj             0/1     CrashLoopBackOff   277        10d
kube-system   coredns-6d56c8448f-rdp7k             0/1     CrashLoopBackOff   291        10d
kube-system   etcd-k8s-master                      1/1     Running            1          10d
kube-system   kube-apiserver-k8s-master            1/1     Running            1          10d
kube-system   kube-controller-manager-k8s-master   1/1     Running            1          10d
kube-system   kube-proxy-qjpz9                     1/1     Running            1          10d
kube-system   kube-proxy-zjfct                     1/1     Running            1          10d
kube-system   kube-scheduler-k8s-master            1/1     Running            1          10d
kube-system   weave-net-4drlg                      2/2     Running            4          10d
kube-system   weave-net-p4ssv                      2/2     Running            3          10d

看了一下系统日志发现报错了。

[root@k8s-master ~]# journalctl -f
-- Logs begin at Mon 2020-09-28 04:26:34 EDT. --
Oct 09 05:49:37 k8s-master kubelet[9525]: E1009 05:49:37.981916    9525 pod_workers.go:191] Error syncing pod 26d77ac4-b3ae-44a7-985b-0e09d1f56db1 ("coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "back-off 5m0s restarting failed container=coredns pod=coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"
Oct 09 05:49:45 k8s-master kubelet[9525]: E1009 05:49:45.160995    9525 summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Oct 09 05:49:46 k8s-master kubelet[9525]: I1009 05:49:46.980815    9525 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: d88f62784988c941e36dcf2560d419dac25affb60cf4ec0b7685aaab25ccffb6
Oct 09 05:49:46 k8s-master kubelet[9525]: E1009 05:49:46.981215    9525 pod_workers.go:191] Error syncing pod d57c2e1a-9218-41db-94ea-a9138f9aece9 ("coredns-6d56c8448f-rdp7k_kube-system(d57c2e1a-9218-41db-94ea-a9138f9aece9)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "back-off 5m0s restarting failed container=coredns pod=coredns-6d56c8448f-rdp7k_kube-system(d57c2e1a-9218-41db-94ea-a9138f9aece9)"
Oct 09 05:49:48 k8s-master kubelet[9525]: I1009 05:49:48.981132    9525 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: cde9c6695350926627609e9b551d1030e6e0fe0db5a757ff3505a0ae1c204909
Oct 09 05:49:48 k8s-master kubelet[9525]: E1009 05:49:48.981908    9525 pod_workers.go:191] Error syncing pod 26d77ac4-b3ae-44a7-985b-0e09d1f56db1 ("coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "back-off 5m0s restarting failed container=coredns pod=coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"
Oct 09 05:49:55 k8s-master kubelet[9525]: E1009 05:49:55.172786    9525 summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Oct 09 05:49:57 k8s-master kubelet[9525]: I1009 05:49:57.980694    9525 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: d88f62784988c941e36dcf2560d419dac25affb60cf4ec0b7685aaab25ccffb6
Oct 09 05:49:57 k8s-master kubelet[9525]: E1009 05:49:57.981070    9525 pod_workers.go:191] Error syncing pod d57c2e1a-9218-41db-94ea-a9138f9aece9 ("coredns-6d56c8448f-rdp7k_kube-system(d57c2e1a-9218-41db-94ea-a9138f9aece9)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "back-off 5m0s restarting failed container=coredns pod=coredns-6d56c8448f-rdp7k_kube-system(d57c2e1a-9218-41db-94ea-a9138f9aece9)"
Oct 09 05:49:59 k8s-master kubelet[9525]: I1009 05:49:59.981047    9525 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: cde9c6695350926627609e9b551d1030e6e0fe0db5a757ff3505a0ae1c204909
Oct 09 05:49:59 k8s-master kubelet[9525]: E1009 05:49:59.981823    9525 pod_workers.go:191] Error syncing pod 26d77ac4-b3ae-44a7-985b-0e09d1f56db1 ("coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "back-off 5m0s restarting failed container=coredns pod=coredns-6d56c8448f-fnrhj_kube-system(26d77ac4-b3ae-44a7-985b-0e09d1f56db1)"
Oct 09 05:50:05 k8s-master kubelet[9525]: E1009 05:50:05.182903    9525 summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"

咦~,什么鬼?经过一番google得知是因为kubelet 启动时,会执行节点资源统计,需要 systemd 中开启对应的选项

  • CPUAccounting:是否开启该 unit 的 CPU 使用统计,bool 类型,可配置 true 或者 false。
  • MemoryAccounting:是否开启该 unit 的 Memory 使用统计,bool 类型,可配置 true 或者 false。

如果不设置这两项,kubelet 是无法执行该统计命令,导致 kubelet 一直报上面的错误信息。

问题解决

解决上面问题也很简单,直接编辑 systemd 中的 kubelet 服务配置文件中,添加 CPU 和 Memory 配置,可以按下面操作进行更改。

(一)修改配置文件

开启对应的选项,CPUAccountingMemoryAccounting

[root@k8s-master kubelet.service.d]# vim /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
[root@k8s-master kubelet.service.d]# cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
# tianxiaoyong@baosight.com 2020-10-09 Fixed
CPUAccounting=true     # 添加 CPUAccounting=true 选项,开启 systemd CPU 统计功能
MemoryAccounting=true  # 添加 MemoryAccounting=true 选项,开启 systemd Memory 统计功能
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/sysconfig/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

(二)重启服务

重启 kubelet 服务,让 kubelet 重新加载配置。

[root@k8s-master kubelet.service.d]# systemctl daemon-reload
[root@k8s-master kubelet.service.d]# systemctl restart kubelet

(三)验证修改

[root@k8s-master kubelet.service.d]# journalctl -u kubelet -n 10
-- Logs begin at Mon 2020-09-28 04:26:34 EDT, end at Fri 2020-10-09 06:32:55 EDT. --
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.534617    9095 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "config-volume" (UniqueName: "kubernetes.io/co
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.534640    9095 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "coredns-token-8ppmx" (UniqueName: "kubernetes
Oct 09 06:32:54 k8s-master kubelet[9095]: W1009 06:32:54.534705    9095 empty_dir.go:453] Warning: Failed to clear quota on /var/lib/kubelet/pods/d57c2e1a-9218-41db-94ea-a9138f9aece9/volumes/kubernetes.io~
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.534892    9095 operation_generator.go:788] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/d57c2e1a-9218-41db-94ea-a9138f9aece9
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.544869    9095 operation_generator.go:788] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/d57c2e1a-9218-41db-94ea-a9138f9aece9-co
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.634838    9095 reconciler.go:319] Volume detached for volume "coredns-token-8ppmx" (UniqueName: "kubernetes.io/secret/d57c2e1a-9218-41db-94ea-a9138f
Oct 09 06:32:54 k8s-master kubelet[9095]: I1009 06:32:54.634860    9095 reconciler.go:319] Volume detached for volume "config-volume" (UniqueName: "kubernetes.io/configmap/d57c2e1a-9218-41db-94ea-a9138f9ae
Oct 09 06:32:54 k8s-master kubelet[9095]: W1009 06:32:54.987640    9095 pod_container_deletor.go:79] Container "5f7afdf7f0996eba93abe2c01aa5709a08bb502499e091732bdaf41adab3d804" not found in pod's containe
Oct 09 06:32:55 k8s-master kubelet[9095]: weave-cni: error removing interface "eth0": no such file or directory
Oct 09 06:32:55 k8s-master kubelet[9095]: I1009 06:32:55.996961    9095 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: b88f3f0e593d537ad341670468d80fed6616eb83f693ee77fecb3ba00e
lines 1-11/11 (END)

可以看到,日志没有类似的报错了。

[root@k8s-master kubelet.service.d]# kubectl get pod -n kube-system
NAME                                 READY   STATUS             RESTARTS   AGE
coredns-6d56c8448f-fnrhj             0/1     CrashLoopBackOff   289        10d
coredns-6d56c8448f-rdp7k             0/1     CrashLoopBackOff   303        10d
etcd-k8s-master                      1/1     Running            1          10d
kube-apiserver-k8s-master            1/1     Running            1          10d
kube-controller-manager-k8s-master   1/1     Running            1          10d
kube-proxy-qjpz9                     1/1     Running            1          10d
kube-proxy-zjfct                     1/1     Running            1          10d
kube-scheduler-k8s-master            1/1     Running            1          10d
weave-net-4drlg                      2/2     Running            4          10d
weave-net-p4ssv                      2/2     Running            3          10d

他妈滴,怎么还是CrashLoopBackOff,看下日志。

[root@k8s-master ~]# kubectl logs coredns-8686dcc4fd-8xh55 -n kube-system
.:53
2020-10-09T03:01:48.526Z [INFO] CoreDNS-1.3.1
2020-10-09T03:01:48.526Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2020-10-09T03:01:48.526Z [INFO] plugin/reload: Running configuration MD5 = 599b9eb76b8c147408aed6a0bbe0f669
2020-10-09T03:01:54.529Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:42377->223.5.5.5:53: i/o timeout
2020-10-09T03:01:57.529Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:60887->114.114.114.114:53: i/o timeout
2020-10-09T03:01:59.530Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:56568->114.114.114.114:53: i/o timeout
2020-10-09T03:02:00.531Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:53437->114.114.114.114:53: i/o timeout
2020-10-09T03:02:02.531Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:51351->114.114.114.114:53: i/o timeout
2020-10-09T03:02:05.533Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:56134->114.114.114.114:53: i/o timeout
2020-10-09T03:02:08.535Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:36700->114.114.114.114:53: i/o timeout
2020-10-09T03:02:11.537Z [ERROR] plugin/errors: 2 6215460374710620394.9107374617267187049. HINFO: read udp 10.32.1.2:60315->223.5.5.5:53: i/o timeout

删掉对应Pod让Kube-Controller-Manage重新去加载一下试试看。

[root@k8s-master kubelet.service.d]# kubectl delete pod coredns-6d56c8448f-fnrhj -n kube-system
pod "coredns-6d56c8448f-fnrhj" deleted
[root@k8s-master kubelet.service.d]# kubectl delete pod coredns-6d56c8448f-rdp7k -n kube-system
pod "coredns-6d56c8448f-rdp7k" deleted
[root@k8s-master kubelet.service.d]# kubectl get pod -n kube-system
NAME                                 READY   STATUS    RESTARTS   AGE
coredns-6d56c8448f-cnnfm             1/1     Running   0          25s
coredns-6d56c8448f-s9nbh             1/1     Running   0          11s
etcd-k8s-master                      1/1     Running   1          10d
kube-apiserver-k8s-master            1/1     Running   1          10d
kube-controller-manager-k8s-master   1/1     Running   1          10d
kube-proxy-qjpz9                     1/1     Running   1          10d
kube-proxy-zjfct                     1/1     Running   1          10d
kube-scheduler-k8s-master            1/1     Running   1          10d
weave-net-4drlg                      2/2     Running   4          10d
weave-net-p4ssv                      2/2     Running   3          10d

大功告成。

最后修改:2020 年 11 月 08 日 12 : 28 PM
如果觉得我的文章对你有用,请随意赞赏