为了部署有状态服务,单独给k8s部署了一套ceph块存储集群,本文记录了k8s集群对接外部ceph集群的方案和问题。期间,还是遇见不少坑,好在都解决了。

环境准备

我们使用的k8s和ceph环境见:
https://blog.51cto.com/leejia/2495558
https://blog.51cto.com/leejia/2499684

静态持久卷

每次需要使用存储空间,需要存储管理员先手动在存储上创建好对应的image,然后k8s才能使用。

创建ceph secret

需要给k8s添加一个访问ceph的secret,主要用于k8s来给rbd做map。
1,在ceph master节点执行如下命令获取admin的经过base64编码的key(生产环境可以创建一个给k8s使用的专门用户):

# ceph auth get-key client.admin | base64
QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==

2,在k8s通过manifest创建secret

# vim ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==

# kubectl apply -f ceph-secret.yaml

创建image

默认情况下,ceph创建之后使用的默认pool为rdb。使用如下命令在安装ceph的客户端或者直接在ceph master节点上创建image:

# rbd create image1 -s 1024
# rbd info rbd/image1
rbd image 'image1':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.374d6b8b4567
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

创建持久卷

在k8s上通过manifest创建:

# vim pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: ceph-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  rbd:
    monitors:
      - 172.18.2.172:6789
      - 172.18.2.178:6789
      - 172.18.2.189:6789
    pool: rbd
    image: image1
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
  persistentVolumeReclaimPolicy: Retain

# kubectl apply -f pv.yaml
persistentvolume/ceph-pv created

# kubectl get pv
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
ceph-pv   1Gi        RWO,ROX        Retain           Available                                   76s

主要指令使用说明如下:
1,accessModes:

RWO:ReadWriteOnce,仅允许单个节点挂载进行读写;
ROX:ReadOnlyMany,允许多个节点挂载且只读;
RWX:ReadWriteMany,允许多个节点挂载进行读写;

2,fsType

如果PersistentVolumes的VolumeMode为Filesystem,那么此字段指定挂载卷时应该使用的文件系统。如果卷尚未格式化,并且支持格式化,此值将用于格式化卷。

3,persistentVolumeReclaimPolicy:

回收策略有三种:
Delete:对于动态配置的PersistentVolumes来说,默认回收策略为 “Delete”。这表示当用户删除对应的 PersistentVolumeClaim 时,动态配置的volume将被自动删除。

Retain:如果volume包含重要数据时,适合使用“Retain”策略。使用 “Retain” 时,如果用户删除 PersistentVolumeClaim,对应的 PersistentVolume 不会被删除。相反,它将变为 Released 状态,表示所有的数据可以被手动恢复。

Recycle: 如果用户删除 PersistentVolumeClaim,则删除卷上的数据,卷不会删除。

创建持久卷声明

在k8s上通过manifest创建:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi

# kubectl apply -f pvc.yaml

当创建好claim之后,k8s会匹配最合适的pv将其绑定到claim,持久卷的容量需要满足claim的要求+卷的模式必须包含claim中指定的访问模式。故如上的pvc会绑定到我们刚创建的pv上。

查看pvc的绑定:

# kubectl get pvc
NAME         STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-claim   Bound    ceph-pv   1Gi        RWO,ROX                       13m

pod使用持久卷

在k8s上通过manifest创建:

vim cat ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod
spec:
  containers:
  - name: ceph-ubuntu
    image: phusion/baseimage
    command: ["sh", "/sbin/my_init"]
    volumeMounts:
    - name: ceph-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-mnt
    persistentVolumeClaim:
      claimName: ceph-claim

# kubectl apply -f ubuntu.yaml
pod/ceph-pod created

检测pod的状态,发现一只处于ContainerCreating阶段,然后通过describe日志发现有报错:

# kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
ceph-pod                 0/1     ContainerCreating   0          75s

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
  Warning  FailedMount  48m (x6 over 75m)     kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[default-token-tlsjd ceph-mnt]: timed out waiting for the condition
  Warning  FailedMount  8m59s (x45 over 84m)  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : fail to check rbd image status with: (executable file not found in $PATH), rbd output: ()
  Warning  FailedMount  3m13s (x23 over 82m)  kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[ceph-mnt default-token-tlsjd]: timed out waiting for the condition

出现这个问题是因为k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令,使用阿里云的ceph repo给每台机器安装之后,又发现新的报错:

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:18.575338 7f0171c3ed80 -1 did not load config file, using default settings.
2020-06-02 17:12:18.603861 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
rbd: sysfs write failed
2020-06-02 17:12:18.620447 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address
  Warning  FailedMount  15s  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:19.257006 7fc330e14d80 -1 did not load config file, using default settings.

只能继续查资料找原因,发现有2个问题需要解决:
1),发现是由于k8s集群和ceph集群 kernel版本不一样,k8s集群的kernel版本较低,rdb块存储的一些feature 低版本kernel不支持,需要disable。通过如下命令disable:

# rbd info rbd/image1
rbd image 'image1':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.374d6b8b4567
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

# rbd  feature disable rbd/image1 exclusive-lock object-map fast-diff deep-flatten  

2),找不到key的报错是由于k8s节点要和ceph交互以把image映射到本机,需要每台k8s节点的/etc/ceph目录都要放置ceph.client.admin.keyring文件,映射的时候做认证使用。故给每个节点创建了/etc/ceph目录,写脚本放置了一下key文件。

# scp /etc/ceph/ceph.client.admin.keyring root@k8s-node:/etc/ceph

查看pod状态,终于run起来了:

# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-pod                 1/1     Running   0          29s

进入ubuntu系统查看挂载项,发现image已经挂载和格式化好:

# kubectl exec ceph-pod -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -hT
Filesystem              Type     Size  Used Avail Use% Mounted on
overlay                 overlay   50G  3.6G   47G   8% /
tmpfs                   tmpfs     64M     0   64M   0% /dev
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0               ext4     976M  2.6M  958M   1% /mnt
/dev/mapper/centos-root xfs       50G  3.6G   47G   8% /etc/hosts
shm                     tmpfs     64M     0   64M   0% /dev/shm
tmpfs                   tmpfs    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/acpi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/scsi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/firmware

在ceph-pod这个pod运行的节点上,通过df命令看rbd的挂载:

# df -hT|grep rbd
/dev/rbd0               ext4      976M  2.6M  958M   1% /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/rbd-image-image2

动态持久卷

不需要存储管理员干预,使k8s使用的存储image创建自动化,即根据使用需要可以动态申请存储空间并自动创建。需要先定义一个或者多个StorageClass,每个StorageClass都必须配置一个provisioner,用来决定使用哪个卷插件分配PV。然后,StorageClass资源指定持久卷声明请求StorageClass时使用哪个provisioner来在对应存储创建持久卷。

k8s官方提供了支持的卷插件:https://kubernetes.io/zh/docs/concepts/storage/storage-classes/

创建一个普通用户来给k8s做rdb的映射

在ceph集群中创建一个k8s专用的pool和用户:

# ceph osd pool create kube 8192
# ceph auth get-or-create client.kube mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=kube' -o ceph.client.kube.keyring

在k8s集群创建kube用户的secret:

# ceph auth get-key client.kube|base64
QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==

# vim ceph-kube-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-kube-secret
  namespace: default
data:
  key: QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==
type:
  kubernetes.io/rbd

# kubectl create -f ceph-kube-secret.yaml
# kubectl get secret
NAME                  TYPE                                  DATA   AGE
ceph-kube-secret      kubernetes.io/rbd                     1      68s

创建StorageClass或者使用已经创建好的StorageClass

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: default
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"

# kubectl apply -f sc.yaml
# kubectl get storageclass
NAME                 PROVISIONER         RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-rbd (default)   kubernetes.io/rbd   Delete          Immediate           false                  6s

主要指令使用说明如下:
1,storageclass.beta.kubernetes.io/is-default-class
如果设置为true,则为默认的storageclasss。pvc申请存储,如果没有指定storageclass,则从默认的storageclass申请。
2,adminId:ceph客户端ID,用于在ceph 池中创建映像。默认是 “admin”。
3,userId:ceph客户端ID,用于映射rbd镜像。默认与adminId相同。
4,imageFormat:ceph rbd镜像格式,“1” 或者 “2”。默认值是 “1”。
5,imageFeatures:这个参数是可选的,只能在你将imageFormat设置为 “2” 才使用。 目前支持的功能只是layering。默认是 “",没有功能打开。

创建持久卷声明

由于我们已经指定了默认的storageclass,故可以直接创建pvc。创建完成处于pending状态,当使用的时候才会触发provisioner创建:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-sc-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 500Mi

# kubectl apply -f pvc.yaml
persistentvolumeclaim/ceph-sc-claim created

# kubectl get pvc
NAME            STATUS    VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Pending                                        ceph-rbd       50s

创建pvc之后,发现pvc没有成功绑定pv,一直处于pending状态,然后我们查看pvc的报错信息,发现如下问题:

# kubectl describe pvc  ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  5s (x7 over 103s)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to get admin secret from ["default"/"ceph-secret"]: failed to get secret from ["default"/"ceph-secret"]: Cannot get secret of type kubernetes.io/rbd

通过如上报错我们知道应该是k8s的controller获取ceph的admin secret失败了。由于我们创建的cepe-secret这个secret在default namespace下面,而controller在kube-system下面故没有权限获取,所以我们在kube-system下面创建cepe-secret,删除pvc和storageclass资源,然后更新storageclass配置之后重新创建storageclass和pvc资源:

# cat ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
  namespace: kube-system
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
type:
  kubernetes.io/rbd

# kubectl apply -f ceph-secret.yaml
# kubectl get secret ceph-secret -n kube-system
NAME          TYPE                DATA   AGE
ceph-secret   kubernetes.io/rbd   1      19m

# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

# kubectl describe  pvc ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason              Age                  From                         Message
  ----     ------              ----                 ----                         -------
  Warning  ProvisioningFailed  33s (x59 over 116m)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to create rbd image: executable file not found in $PATH, command output:

发现依然绑定pv失败,继续找问题。我们在k8s集群的每台节点都安装了ceph-common,为什么rbd命令还是没有找到呢。通过查询分析,发下原因如下:
k8s使用stroageclass动态申请ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。
我们来根据官方建议使用外部的rbd-provisioner来提供服务,如下操作再k8s的master上执行:

# git clone https://github.com/kubernetes-incubator/external-storage.git
# cd external-storage/ceph/rbd/deploy
# sed -r -i "s/namespace: [^ ]+/namespace: kube-system/g" ./rbac/clusterrolebinding.yaml ./rbac/rolebinding.yaml
# kubectl -n kube-system apply -f ./rbac

# kubectl describe deployments.apps -n kube-system rbd-provisioner
Name:               rbd-provisioner
Namespace:          kube-system
CreationTimestamp:  Wed, 03 Jun 2020 18:59:14 +0800
Labels:             <none>
Annotations:        deployment.kubernetes.io/revision: 1
Selector:           app=rbd-provisioner
Replicas:           1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:       Recreate
MinReadySeconds:    0
Pod Template:
  Labels:           app=rbd-provisioner
  Service Account:  rbd-provisioner
  Containers:
   rbd-provisioner:
    Image:      quay.io/external_storage/rbd-provisioner:latest
    Port:       <none>
    Host Port:  <none>
    Environment:
      PROVISIONER_NAME:  ceph.com/rbd
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   rbd-provisioner-c968dcb4b (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  6m5s  deployment-controller  Scaled up replica set rbd-provisioner-c968dcb4b to 1

修改storageclass的provisioner为我们新增加的provisioner:

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: ceph.com/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"

# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

等待provisioner分配存储和pvc绑定pv,大概3min时间。终于绑定成功了:

# kubectl get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Bound    pvc-0b92a433-adb0-46d9-a0c8-5fbef28eff5f   2Gi        RWO            ceph-rbd       7m49s

pod使用持久卷

创建pod并查看挂载状态:

# vim ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-sc-pod
spec:
  containers:
  - name: ceph-sc-ubuntu
    image: phusion/baseimage
    command: ["/sbin/my_init"]
    volumeMounts:
    - name: ceph-sc-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-sc-mnt
    persistentVolumeClaim:
      claimName: ceph-sc-claim

# kubectl apply -f ubuntu.yaml
# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-sc-pod              1/1     Running   0          24s

# kubectl exec ceph-sc-pod -it  sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   50G  3.8G   47G   8% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0                2.0G  6.0M  1.9G   1% /mnt
/dev/mapper/centos-root   50G  3.8G   47G   8% /etc/hosts
shm                       64M     0   64M   0% /dev/shm
tmpfs                    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    2.9G     0  2.9G   0% /proc/acpi
tmpfs                    2.9G     0  2.9G   0% /proc/scsi
tmpfs                    2.9G     0  2.9G   0% /sys/firmware

经过这么多的折腾,总算成功对接外部的ceph

总结

1,k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令。
2,k8s使用stroageclass动态create ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。

参考

https://kubernetes.io/zh/docs/concepts/storage/storage-classes/
https://kubernetes.io/zh/docs/concepts/storage/volumes/
https://groups.google.com/forum/#!topic/kubernetes-sig-storage-bugs/4w42QZxboIA

来源:https://blog.51cto.com/leejia/2501080


码神部落- 版权声明 1、本主题所有言论和图片纯属会员个人意见,与码神部落立场无关。
2、本站所有主题由该帖子作者发表,该帖子作者爷万人跪拜码神部落享有帖子相关版权。
3、码神部落管理员和版主有权不事先通知发贴者而删除本文。
4、其他单位或个人使用、转载或引用本文时必须同时征得该帖子作者爷万人跪拜码神部落的同意。
5、帖子作者须承担一切因本文发表而直接或间接导致的民事或刑事法律责任。
6、本帖部分内容转载自其它媒体,但并不代表本站赞同其观点和对其真实性负责。
7、如本帖侵犯到任何版权问题,请立即告知本站,本站将及时予与删除并致以最深的歉意。

最新回复 (0)
    • 码神部落
      2
        立即登录 立即注册 GitHub登录
返回
发新帖
作者最近主题: