lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 3 Dec 2020 02:55:15 +0100 From: Guoqing Jiang <guoqing.jiang@...ud.ionos.com> To: Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>, linux-raid@...r.kernel.org, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, it+raid@...gen.mpg.de Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Hi Donald, On 12/2/20 18:28, Donald Buczek wrote: > Dear Guoqing, > > unfortunately the patch didn't fix the problem (unless I messed it up > with my logging). This is what I used: > > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev) > clear_bit(MD_RECOVERY_NEEDED, > &mddev->recovery); > goto unlock; > } I think you can add the check of RECOVERY_CHECK in above part instead of add a new part. > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && > + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || > + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { > + /* resync/recovery still happening */ > + pr_info("md: XXX BUGFIX applied\n"); > + clear_bit(MD_RECOVERY_NEEDED, > &mddev->recovery); > + goto unlock; > + } > if (mddev->sync_thread) { > md_reap_sync_thread(mddev); > goto unlock; > > With pausing and continuing the check four times an hour, I could > trigger the problem after about 48 hours. This time, the other device > (md0) has locked up on `echo idle > > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 > is still ongoing: Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ... What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm. > > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] > sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [=>...................] check = 8.5% (666852112/7813894144) > finish=1271.2min speed=93701K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] > sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] > sdu[2] sdt[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [>....................] check = 0.2% (19510348/7813894144) > finish=253779.6min speed=511K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > > after 1 minute: > > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] > sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [=>...................] check = 8.6% (674914560/7813894144) > finish=941.1min speed=126418K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] > sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] > sdu[2] sdt[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [>....................] check = 0.2% (19510348/7813894144) > finish=256805.0min speed=505K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > > A data point, I didn't mention in my previous mail, is that the > mdX_resync thread is not gone when the problem occurs: > > buczek@...e:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md > root 134 2 0 Nov30 ? 00:00:00 [md] > root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6] > root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6] > root 5845 2 0 16:20 ? 00:00:30 [md0_resync] > root 5855 2 13 16:20 ? 00:14:11 [md1_resync] > buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md > buczek@...e:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat > /proc/5845/stack > [<0>] md_bitmap_cond_end_sync+0x12d/0x170 > [<0>] raid5_sync_request+0x24b/0x390 > [<0>] md_do_sync+0xb41/0x1030 > [<0>] md_thread+0x122/0x160 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x1f/0x30 > > I guess, md_bitmap_cond_end_sync+0x12d is the > `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) > == 0);` in md-bitmap.c. > Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe. I'd suggest to compare the stacks between md0 and md1 to find the difference. Thanks, Guoqing
Powered by blists - more mailing lists