linux-kernel - Re: [syzbot] [mm?] INFO: rcu detected stall in validate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <q7omtpah3byvo5p3szra7kln63gtas35ml3kksltgj525pyezl@cn7v2o6qf2vc>
Date: Sun, 12 May 2024 13:28:31 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: syzbot <syzbot+a941018a091f1a1f9546@...kaller.appspotmail.com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, lstoakes@...il.com,
        syzkaller-bugs@...glegroups.com, vbabka@...e.cz
Subject: Re: [syzbot] [mm?] INFO: rcu detected stall in validate_mm (3)

* syzbot <syzbot+a941018a091f1a1f9546@...kaller.appspotmail.com> [240512 05:19]:
> Hello,
> 
> syzbot found the following issue on:

First, excellent timing of this report - Sunday on an -rc7 release the
day before LSF/MM/BPF.

> 
> HEAD commit:    dccb07f2914c Merge tag 'for-6.9-rc7-tag' of git://git.kern..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=13f6734c980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7144b4fe7fbf5900
> dashboard link: https://syzkaller.appspot.com/bug?extid=a941018a091f1a1f9546
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10306760980000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=138c8970980000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/e1fea5a49470/disk-dccb07f2.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/5f7d53577fef/vmlinux-dccb07f2.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/430b18473a18/bzImage-dccb07f2.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+a941018a091f1a1f9546@...kaller.appspotmail.com
> 
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-1): P17678/1:b..l
> rcu: 	(detected by 1, t=10502 jiffies, g=36541, q=38 ncpus=2)
> task:syz-executor952 state:R  running task     stack:28968 pid:17678 tgid:17678 ppid:5114   flags:0x00000002
> Call Trace:
>  <TASK>
>  context_switch kernel/sched/core.c:5409 [inline]
>  __schedule+0xf15/0x5d00 kernel/sched/core.c:6746
>  preempt_schedule_irq+0x51/0x90 kernel/sched/core.c:7068
>  irqentry_exit+0x36/0x90 kernel/entry/common.c:354
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
> RIP: 0010:bytes_is_nonzero mm/kasan/generic.c:88 [inline]
> RIP: 0010:memory_is_nonzero mm/kasan/generic.c:122 [inline]
> RIP: 0010:memory_is_poisoned_n mm/kasan/generic.c:129 [inline]
> RIP: 0010:memory_is_poisoned mm/kasan/generic.c:161 [inline]
> RIP: 0010:check_region_inline mm/kasan/generic.c:180 [inline]
> RIP: 0010:kasan_check_range+0xc7/0x1a0 mm/kasan/generic.c:189
> Code: 83 c0 08 48 39 d0 0f 84 be 00 00 00 48 83 38 00 74 ed 48 8d 50 08 eb 0d 48 83 c0 01 48 39 c2 0f 84 8d 00 00 00 80 38 00 74 ee <48> 89 c2 b8 01 00 00 00 48 85 d2 74 1e 41 83 e2 07 49 39 d1 75 0a
> RSP: 0018:ffffc900031ef850 EFLAGS: 00000202
> RAX: fffffbfff2949b78 RBX: fffffbfff2949b79 RCX: ffffffff8ac92249
> RDX: fffffbfff2949b79 RSI: 0000000000000004 RDI: ffffffff94a4dbc0
> RBP: fffffbfff2949b78 R08: 0000000000000001 R09: fffffbfff2949b78
> R10: ffffffff94a4dbc3 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000001 R14: 0000000000000300 R15: 0000000000000000
>  instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
>  atomic_inc include/linux/atomic/atomic-instrumented.h:435 [inline]
>  mt_validate_nulls+0x5e9/0x9e0 lib/maple_tree.c:7550
>  mt_validate+0x3148/0x4390 lib/maple_tree.c:7599
>  validate_mm+0x9c/0x4b0 mm/mmap.c:288
>  mmap_region+0x1478/0x2760 mm/mmap.c:2934
>  do_mmap+0x8ae/0xf10 mm/mmap.c:1385
>  vm_mmap_pgoff+0x1ab/0x3c0 mm/util.c:573
>  ksys_mmap_pgoff+0x7d/0x5b0 mm/mmap.c:1431

..

I was concerned that we had somehow constructed a broken tree, but I
believe the information below rules that situation out. It appears that
the verification of a tasks maple tree has exceeded the timeout allotted
to do so.  This call stack indicates it is all happening while holding
the mmap lock, so no locking or RCU issue there.

This trace seems to think we are stuck in the checking the tree for
sequential NULLs, but not in the tree operation itself.  This would
indicate the issue isn't here at all - or we have a broken tree which
causes the iteration to never advance.

The adjustments of the timeouts do seem to be sufficient and I am not
getting hung on my vm running the c reproducer, yet.  I am not using the
bots config, yet.

I also noticed that the git bisect is very odd and inconsistent, often
ending in "crashed: INFO: rcu detected stall in corrupted".  I also
noticed that KASAN is disabled in this report?
"disabling configs for [UBSAN BUG KASAN LOCKDEP ATOMIC_SLEEP LEAK], they
are not needed"

This seems like it would be wise to enable as it seems there is
corrupted stack traces, at least?  I noticed that the .config DOES have
kasan enabled, so I guess it was dropped because it didn't pick up an
issue on the initial run?

There is only one report (the initial report) that detects the hung
state in the validate_mm() test function.  This is actually the less
concerning of all of the other places - because this validate function
is generally disabled on production systems.

The last change to lib/maple_tree.c went in through in
mm-stable-2024-03-13-20-04.

I cannot say that this isn't the maple tree in an infinite loop, but I
don't think it is given the information above.  Considering the infinite
loop scenario would produce the same crash on reproduction but this is
not what syzbot sees on the git bisect, I think it is not an issue in
the tree but an issue somewhere else - and probably a corruption issue
that wasn't detected by kasan (is this possible?).

Thanks,
Liam