lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 8 Aug 2017 09:10:08 +0200
From:   Dominik Brodowski <linux@...inikbrodowski.net>
To:     Shaohua Li <shli@...nel.org>
Cc:     NeilBrown <neilb@...e.com>, Shaohua Li <shli@...com>,
        David R <david@...olicited.net>, linux-kernel@...r.kernel.org,
        linux-raid@...r.kernel.org, tj@...nel.org
Subject: Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to
 4ad23a976413: MD: use per-cpu counter for writes_pending

Shaouhua,

( really CC'ing Tejun now )

On Mon, Aug 07, 2017 at 09:51:03PM -0700, Shaohua Li wrote:
> On Mon, Aug 07, 2017 at 01:20:25PM +0200, Dominik Brodowski wrote:
> > Neil, Shaohua,
> > 
> > following up on David R's bug message: I have observed something similar
> > on v4.12.[345] and v4.13-rc4, but not on v4.11. This is a RAID1 (on bare
> > metal partitions, /dev/sdaX and /dev/sdbY linked together). In case it
> > matters: Further upwards are cryptsetup, a DM volume group, then logical
> > volumes, and then filesystems (ext4, but also happened with xfs).
> > 
> > In a tedious bisect (the bug wasn't as quickly reproducible as I would like,
> > but happened when I repeatedly created large lvs and filled them with some
> > content, while compiling kernels in parallel), I was able to track this
> > down to:
> > 
> > 
> > commit 4ad23a976413aa57fe5ba7a25953dc35ccca5b71
> > Author: NeilBrown <neilb@...e.com>
> > Date:   Wed Mar 15 14:05:14 2017 +1100
> > 
> >     MD: use per-cpu counter for writes_pending
> >     
> >     The 'writes_pending' counter is used to determine when the
> >     array is stable so that it can be marked in the superblock
> >     as "Clean".  Consequently it needs to be updated frequently
> >     but only checked for zero occasionally.  Recent changes to
> >     raid5 cause the count to be updated even more often - once
> >     per 4K rather than once per bio.  This provided
> >     justification for making the updates more efficient.
> > 
> >     ...
> > 
> > 
> > CC'ing tj@...nel.org, as 4ad23a976413 is the first (and only?) user
> > of percpu_ref_switch_to_atomic_sync() introduced in 210f7cdcf088.
> > 
> > Applying a415c0f10627 on top of 4ad23a976413 does *not* fix the issue, but
> > reverting all of a2bfc6753065, a415c0f10627 and 4ad23a976413 seems to fix
> > the issue for v4.12.5.
> 
> Spent some time to check this one,

Thanks for taking a look at this issue!

> unfortunately I can't find how that patch makes rcu stall. the percpu part
> looks good to me too. Can you double check if reverting 4ad23a976413aa57
> makes the issue go away?

v4.12.5 with the three patches reverted survived many hours of testing
yesterday. Rebooted into 4ad23a976413aa57 (rebuilt with lockup detection),
and got back traces for mdX_raid1 after a few minutes. The following test
was done with plain v4.12.5, with the first stall after around ~1 minute:

> When the rcu stall happens, what the /sys/block/md/md0/array_state?

First, a "ps axf | grep 'md'" snippet:

  498 ?        S<     0:00  \_ [md]
 1163 ?        D      0:01  \_ [md0_raid1]
 1172 ?        D      0:07  \_ [md1_raid1]
 1182 ?        D      0:00  \_ [md2_raid1]
 1192 ?        R      1:35  \_ [md3_raid1]

(why are md[012]_raid1 continuously in "D" state?)

array_state for md0: active
array_state for md1: active-idle
array_state for md2: active-idle
array_state for md3: clean

> please also attach /proc/mdstat.

Personalities : [raid1] 
md3 : active raid1 sda8[0] sdb5[1]
      174981120 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md2 : active raid1 sda7[0] sdb6[1]
      174981120 blocks super 1.2 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda6[0] sdb7[1]
      174981120 blocks super 1.2 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda5[0] sdb8[1]
      174981120 blocks super 1.2 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

> When you say the mdx_raid1 threads are in 'R' state, can you double check
> if the /proc/pid/stack always 0xffffffffff?

root@...est ~ # cat /proc/1192/stack # md3_raid1
[<ffffffffffffffff>] 0xffffffffffffffff

root@...est ~ # cat /proc/1182/stack 
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

root@...est ~ # cat /proc/1172/stack 
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

root@...est ~ # cat /proc/1163/stack 
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

So the other md[012]_raid1 threads are waiting in
percpu_ref_switch_to_atomic_sync(). And md3_raid1 is,
(according to NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s!
[md3_raid1:1192], stuck somewhere in raid1d...

[   92.564717] task: ffff880234442e80 task.stack: ffffc9000102c000
[   92.564781] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[   92.564843] RSP: 0018:ffffc9000102fdb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[   92.564913] RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff8802321a1e00
[   92.564995] RDX: ffffffff815cca08 RSI: 0000000000000001 RDI: ffffffff81793291
[   92.565077] RBP: ffffc9000102fdc0 R08: 0000000000000000 R09: 0000000000000000
[   92.565159] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802321a1e18
[   92.565241] R13: ffff8802321a1e70 R14: ffff880234442e80 R15: 0000000000000000
[   92.565323] FS:  0000000000000000(0000) GS:ffff880236e00000(0000) knlGS:0000000000000000
[   92.565425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   92.565502] CR2: 00000083b78e4e30 CR3: 0000000001c23000 CR4: 00000000000006e0
[   92.565585] Call Trace:
[   92.565657]  raid1d+0x5f8/0x910
[   92.565729]  ? retint_kernel+0x10/0x10
[   92.565802]  md_thread+0x12d/0x160
[   92.565874]  ? md_thread+0x12d/0x160
[   92.565947]  ? __wake_up_common+0x80/0x80
[   92.566022]  kthread+0x131/0x170
[   92.566093]  ? find_pers+0x70/0x70
[   92.566165]  ? __kthread_create_on_node+0x220/0x220
[   92.566241]  ret_from_fork+0x27/0x40
[   92.566312] Code: fc 48 8b 55 08 53 48 8d 7f 18 48 89 f3 be 01 00 00 00 e8 b1 ce 9d ff 4c 89 e7 e8 19 0c 9e ff f6 c7 02 74 13 e8 ff 90 9d ff 53 9d <65> ff 0d 3e 94 87 7e 5b 41 5c 5d c3 53 9d e8 4a 93 9d ff eb eb 

Thanks,
	Dominik

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ