Cloud/Kubernetes
Ceph Repair 해보기
퐁스
2020. 7. 28. 17:01
테스트 도중 ceph의 동작이 이상해서 상태를 확인해보았다. HEALTH_ERR로 떠있고 에러 상태였는데, repair 하는 방법을 찾아 작성해본다.
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph -s
cluster:
id: 1f5b2a70-e56d-4feb-8def-4e9aeff2f58b
health: HEALTH_ERR
506 scrub errors
Possible data damage: 4 pgs inconsistent
1/3 mons down, quorum b,e
services:
mon: 3 daemons, quorum b,e (age 13m), out of quorum: d
mgr: a(active, since 12m)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 13m), 3 in (since 13m)
data:
pools: 3 pools, 48 pgs
objects: 36.44k objects, 135 GiB
usage: 339 GiB used, 1.0 TiB / 1.4 TiB avail
pgs: 44 active+clean
4 active+clean+inconsistent
io:
client: 2.9 KiB/s rd, 28 KiB/s wr, 2 op/s rd, 3 op/s wr
모니터 한 대가 다운되어 있는 상태이고, 4개의 pgs가 데미지를 입었다고 한다. 자세히 본다.
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_ERR 506 scrub errors; Possible data damage: 4 pgs inconsistent; 1/3 mons down, quorum b,e
OSD_SCRUB_ERRORS 506 scrub errors
PG_DAMAGED Possible data damage: 4 pgs inconsistent
pg 3.0 is active+clean+inconsistent, acting [1,2]
pg 3.1 is active+clean+inconsistent, acting [1,0]
pg 3.5 is active+clean+inconsistent, acting [2,1]
pg 3.6 is active+clean+inconsistent, acting [1,2]
MON_DOWN 1/3 mons down, quorum b,e
mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)
pg 넘버를 가르쳐준다. 해당 pg를 repair 해준다.
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.0
instructing pg 3.0 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.1
instructing pg 3.1 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.5
instructing pg 3.5 on osd.2 to repair
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.6
instructing pg 3.6 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_ERR 506 scrub errors; Possible data damage: 4 pgs inconsistent; 1/3 mons down, quorum b,e
OSD_SCRUB_ERRORS 506 scrub errors
PG_DAMAGED Possible data damage: 4 pgs inconsistent
pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [1,2]
pg 3.1 is active+clean+inconsistent, acting [1,0]
pg 3.5 is active+clean+inconsistent, acting [2,1]
pg 3.6 is active+clean+inconsistent, acting [1,2]
MON_DOWN 1/3 mons down, quorum b,e
mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)
조금 시간을 들여 기다려준다.
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_WARN 1/3 mons down, quorum b,e
MON_DOWN 1/3 mons down, quorum b,e
mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)
문제가 되는 모니터의 경우 ntp 설정이 대부분이다. 확인해주어서 해결했다.
root@c2:~# k exec -it -n rook-ceph rook-ceph-tools-776f7b4dbd-zlzjr -- ceph -s
cluster:
id: 1f5b2a70-e56d-4feb-8def-4e9aeff2f58b
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,d,e (age 2s)
mgr: a(active, since 9s)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 20m), 3 in (since 20m)
data:
pools: 3 pools, 48 pgs
objects: 36.43k objects, 135 GiB
usage: 341 GiB used, 1.0 TiB / 1.4 TiB avail
pgs: 48 active+clean
io:
client: 732 B/s rd, 16 KiB/s wr, 1 op/s rd, 1 op/s wr