본문 바로가기

Cloud/Kubernetes

Ceph Repair 해보기

테스트 도중 ceph의 동작이 이상해서 상태를 확인해보았다. HEALTH_ERR로 떠있고 에러 상태였는데, repair 하는 방법을 찾아 작성해본다.

root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph -s
  cluster:
    id:     1f5b2a70-e56d-4feb-8def-4e9aeff2f58b
    health: HEALTH_ERR
            506 scrub errors
            Possible data damage: 4 pgs inconsistent
            1/3 mons down, quorum b,e
 
  services:
    mon: 3 daemons, quorum b,e (age 13m), out of quorum: d
    mgr: a(active, since 12m)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 13m), 3 in (since 13m)
 
  data:
    pools:   3 pools, 48 pgs
    objects: 36.44k objects, 135 GiB
    usage:   339 GiB used, 1.0 TiB / 1.4 TiB avail
    pgs:     44 active+clean
             4  active+clean+inconsistent
 
  io:
    client:   2.9 KiB/s rd, 28 KiB/s wr, 2 op/s rd, 3 op/s wr

 모니터 한 대가 다운되어 있는 상태이고, 4개의 pgs가 데미지를 입었다고 한다. 자세히 본다.

root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_ERR 506 scrub errors; Possible data damage: 4 pgs inconsistent; 1/3 mons down, quorum b,e
OSD_SCRUB_ERRORS 506 scrub errors
PG_DAMAGED Possible data damage: 4 pgs inconsistent
    pg 3.0 is active+clean+inconsistent, acting [1,2]
    pg 3.1 is active+clean+inconsistent, acting [1,0]
    pg 3.5 is active+clean+inconsistent, acting [2,1]
    pg 3.6 is active+clean+inconsistent, acting [1,2]
MON_DOWN 1/3 mons down, quorum b,e
    mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)

pg 넘버를 가르쳐준다. 해당 pg를 repair 해준다.

root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.0
instructing pg 3.0 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.1
instructing pg 3.1 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.5
instructing pg 3.5 on osd.2 to repair
root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph pg repair 3.6
instructing pg 3.6 on osd.1 to repair
root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_ERR 506 scrub errors; Possible data damage: 4 pgs inconsistent; 1/3 mons down, quorum b,e
OSD_SCRUB_ERRORS 506 scrub errors
PG_DAMAGED Possible data damage: 4 pgs inconsistent
    pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [1,2]
    pg 3.1 is active+clean+inconsistent, acting [1,0]
    pg 3.5 is active+clean+inconsistent, acting [2,1]
    pg 3.6 is active+clean+inconsistent, acting [1,2]
MON_DOWN 1/3 mons down, quorum b,e
    mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)

 조금 시간을 들여 기다려준다.

root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph health detail
HEALTH_WARN 1/3 mons down, quorum b,e
MON_DOWN 1/3 mons down, quorum b,e
    mon.d (rank 1) addr [v2:[IP]:3300/0,v1:[IP]:6789/0] is down (out of quorum)

문제가 되는 모니터의 경우 ntp 설정이 대부분이다. 확인해주어서 해결했다.

root@c2:~# k exec -it -n rook-ceph         rook-ceph-tools-776f7b4dbd-zlzjr -- ceph -s
  cluster:
    id:     1f5b2a70-e56d-4feb-8def-4e9aeff2f58b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum b,d,e (age 2s)
    mgr: a(active, since 9s)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 20m), 3 in (since 20m)
 
  data:
    pools:   3 pools, 48 pgs
    objects: 36.43k objects, 135 GiB
    usage:   341 GiB used, 1.0 TiB / 1.4 TiB avail
    pgs:     48 active+clean
 
  io:
    client:   732 B/s rd, 16 KiB/s wr, 1 op/s rd, 1 op/s wr