linux-kernel - Re: [syzbot] [kernel?] WARNING in flush_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <457e2827-119e-446c-90b3-8e9cc7cd3e5d@suse.cz>
Date: Fri, 24 May 2024 10:02:58 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Thomas Gleixner <tglx@...utronix.de>,
 syzbot <syzbot+50e25cfa4f917d41749f@...kaller.appspotmail.com>,
 bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 linux-kernel@...r.kernel.org, mingo@...hat.com,
 syzkaller-bugs@...glegroups.com, x86@...nel.org, linux-mm@...ck.org,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Tejun Heo
 <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>,
 Dennis Zhou <dennis@...nel.org>
Subject: Re: [syzbot] [kernel?] WARNING in flush_cpu_slab

On 5/24/24 12:32 AM, Thomas Gleixner wrote:
> On Thu, May 23 2024 at 23:03, Vlastimil Babka wrote:
>> On 5/23/24 12:36 PM, Thomas Gleixner wrote:
>>>> ------------[ cut here ]------------
>>>> DEBUG_LOCKS_WARN_ON(l->owner)
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
>>
>> I'm puzzled by this. We use local_lock_irqsave() on !PREEMPT_RT everywhere.
>> IIUC this warning says we did the irqsave() and then found out somebody else
>> already set the owner? But that means they also did that irqsave() and set
>> themselves as l->owner. Does that mey there would be a spurious irq enable
>> that didn't go through local_unlock_irqrestore()?
>>
>> Also this particular stack is from the work, which is scheduled by
>> queue_work_on() in flush_all_cpus_locked(), which also has a
>> lockdep_assert_cpus_held() so it should fullfill the "the caller must ensure
>> the cpu doesn't go away" property. But I think even if this ended up on the
>> wrong cpu (for the full duration or migrated while processing the work item)
>> somehow, it wouldn't be able to cause such warning, but rather corrupt
>> something else
> 
> Indeed. There is another report which makes no sense either:
> 
>  https://lore.kernel.org/lkml/000000000000fa09d906191c3ee5@google.com

That looks like slab->next which should contain a valid pointer or NULL,
contains 0x13.
slab->next is initialized in put_cpu_partial() from s->cpu_slab->partial

Here we have corruption inside s->cpu_slab->list_lock

> Both look like data corropution issues caused by whatever...

s->cpu_slab is percpu allocation so possibly another percpu alloc user has a
buffer overflow?

> Thanks,
> 
>         tglx