lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 28 Nov 2020 13:25:22 +0100
From:   Donald Buczek <buczek@...gen.mpg.de>
To:     Song Liu <song@...nel.org>, linux-raid@...r.kernel.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        it+raid@...gen.mpg.de
Subject: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

     echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

     root@...e:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
     UID        PID  PPID  C STIME TTY          TIME CMD
     root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
     root@...e:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
     [<0>] kthread_stop+0x6e/0x150
     [<0>] md_unregister_thread+0x3e/0x70
     [<0>] md_reap_sync_thread+0x1f/0x1e0
     [<0>] action_store+0x141/0x2b0
     [<0>] md_attr_store+0x71/0xb0
     [<0>] kernfs_fop_write+0x113/0x1a0
     [<0>] vfs_write+0xbc/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
     
     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
     [ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
   Donald

-- 
Donald Buczek
buczek@...gen.mpg.de
Tel: +49 30 8413 1433

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ