linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <459beb1c-defd-4836-952c-589203b7005c@meta.com>
Date: Wed, 18 Sep 2024 11:28:52 +0200
From: Chris Mason <clm@...a.com>
To: Jens Axboe <axboe@...nel.dk>, Matthew Wilcox <willy@...radead.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
        Dave Chinner <david@...morbit.com>,
        Christian Theune <ct@...ingcircus.io>, linux-mm@...ck.org,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
        regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

One or more of the originally attached files triggered the rule module.access.rule.exestrip_notify

The following attachments were deleted from the original message.
radixcheck.py

Original Message:

On 9/18/24 2:37 AM, Jens Axboe wrote:
> On 9/17/24 7:25 AM, Matthew Wilcox wrote:
>> On Tue, Sep 17, 2024 at 01:13:05PM +0200, Chris Mason wrote:
>>> On 9/17/24 5:32 AM, Matthew Wilcox wrote:
>>>> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote:
>>>>> I've got a bunch of assertions around incorrect folio->mapping and I'm
>>>>> trying to bash on the ENOMEM for readahead case.  There's a GFP_NOWARN
>>>>> on those, and our systems do run pretty short on ram, so it feels right
>>>>> at least.  We'll see.
>>>>
>>>> I've been running with some variant of this patch the whole way across
>>>> the Atlantic, and not hit any problems.  But maybe with the right
>>>> workload ...?
>>>>
>>>> There are two things being tested here.  One is whether we have a
>>>> cross-linked node (ie a node that's in two trees at the same time).
>>>> The other is whether the slab allocator is giving us a node that already
>>>> contains non-NULL entries.
>>>>
>>>> If you could throw this on top of your kernel, we might stand a chance
>>>> of catching the problem sooner.  If it is one of these problems and not
>>>> something weirder.
>>>>
>>>
>>> This fires in roughly 10 seconds for me on top of v6.11.  Since array seems
>>> to always be 1, I'm not sure if the assertion is right, but hopefully you
>>> can trigger yourself.
>>
>> Whoops.
>>
>> $ git grep XA_RCU_FREE
>> lib/xarray.c:#define XA_RCU_FREE        ((struct xarray *)1)
>> lib/xarray.c:   node->array = XA_RCU_FREE;
>>
>> so you walked into a node which is currently being freed by RCU.  Which
>> isn't a problem, of course.  I don't know why I do that; it doesn't seem
>> like anyone tests it.  The jetlag is seriously kicking in right now,
>> so I'm going to refrain from saying anything more because it probably
>> won't be coherent.
> 
> Based on a modified reproducer from Chris (N threads reading from a
> file, M threads dropping pages), I can pretty quickly reproduce the
> xas_descend() spin on 6.9 in a vm with 128 cpus. Here's some debugging
> output with a modified version of your patch too, that ignores
> XA_RCU_FREE:

Jens and I are running slightly different versions of reader.c, but we're
seeing the same thing.  v6.11 is lasts all night long, and reverting those
two commits falls over in about 5 minutes or less.

I switched from a VM to bare metal, and managed to hit an assertion I'd
added to filemap_get_read_batch() (should look familiar):

{
	struct address_space *fmapping = READ_ONCE(folio->mapping);
	BUG_ON(fmapping && fmapping != mapping);
}

Walking the xarray in the crashdump shows that it's probably the same
corruption I saw in 5.19.  drgn is printing like so:

print("0x%x mapping 0x%x radix index %d page index %d flags 0x%x (%s) size %d" % (page.address_of_(), page.mapping.value_(), index, page.index, page.flags, decode_page_flags(page), folio._folio_nr_pages))

And I attached radixcheck.py if you want to see the full script.

These are all from the correct mapping:
0xffffea0088b17200 mapping 0xffff88a22a9614e8 radix index 53 page index 53 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 59472
0xffffea008773e940 mapping 0xffff88a22a9614e8 radix index 54 page index 54 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4244589144
0xffffea0084ad1d00 mapping 0xffff88a22a9614e8 radix index 55 page index 55 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4040059330
0xffffea0088c9d840 mapping 0xffff88a22a9614e8 radix index 56 page index 56 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 5958
0xffffea00879c6300 mapping 0xffff88a22a9614e8 radix index 57 page index 57 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 112
0xffffea0086630980 mapping 0xffff88a22a9614e8 radix index 58 page index 58 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4025236287
0xffffea0008eb6580 mapping 0xffff88a22a9614e8 radix index 59 page index 59 flags 0x5ffff000000012c (PG_referenced|PG_uptodate|PG_lru|PG_active|PG_reported) size 269
0xffffea00072db000 mapping 0xffff88a22a9614e8 radix index 60 page index 60 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4
0xffffea000919b600 mapping 0xffff88a22a9614e8 radix index 64 page index 64 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4

These last 3 are not:
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 208 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 224 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 240 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64

I think the bug was in __filemap_add_folio()'s usage of xarray_split_alloc()
and the tree changing before taking the lock.  It's just a guess, but that
was always my biggest suspect.

To reproduce, I used:

mkfs.xfs -f <some device>
mount some_device /xfs
for x in `seq 1 8` ; do
	fallocate -l100m /xfs/file$x
	./reader /xfs/file$x &
done

New reader.c attached.  Jens changed his so that every
reader thread was using its own offset in the file,
and he found that reproduced more consistently.

-chris

View attachment "reader.c" of type "text/plain" (1808 bytes)