linux-kernel - Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff29f723-32de-421b-a65e-7b7d2e03162d@redhat.com>
Date: Wed, 29 May 2024 08:57:48 +0200
From: David Hildenbrand <david@...hat.com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>, Chris Mason
 <clm@...com>, Josef Bacik <josef@...icpanda.com>,
 David Sterba <dsterba@...e.com>
Cc: Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
 Linux Memory Management List <linux-mm@...ck.org>,
 Matthew Wilcox <willy@...radead.org>,
 linux-btrfs <linux-btrfs@...r.kernel.org>
Subject: Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840

On 28.05.24 16:24, David Hildenbrand wrote:
> Am 28.05.24 um 15:57 schrieb David Hildenbrand:
>> Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
>>> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
>>> <mikhail.v.gavrilov@...il.com> wrote:
>>>>
>>>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> The only known workload that causes this is updating a large
>>>> container. Unfortunately, not every container update reproduces the
>>>> problem.
>>>
>>> Is it possible to add more debugging information to make it clearer
>>> what's going on?
>>
>> If we knew who originally allocated that problematic page, that might help.
>> Maybe page_owner could give some hints?
>>
>>>
>>> BUG: Bad page state in process kcompactd0  pfn:605811
>>> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
>>> pfn:0x605811
>>> aops:btree_aops ino:1
>>> flags:
>>> 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
>>> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
>>> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
>>> page dumped because: non-NULL mapping
>>
>> Seems to be an order-0 page, otherwise we would have another "head: ..." report.
>>
>> It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
>> field for them manually on the page freeing path. Likely it's a pagecache folio.
>>
>> So one option is that something seems to not properly set folio->mapping to
>> NULL. But that problem would then also show up without page migration? Hmm.
>>
>>> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
>>> BIOS 2611 04/07/2024
>>> Call Trace:
>>>    <TASK>
>>>    dump_stack_lvl+0x84/0xd0
>>>    bad_page.cold+0xbe/0xe0
>>>    ? __pfx_bad_page+0x10/0x10
>>>    ? page_bad_reason+0x9d/0x1f0
>>>    free_unref_page+0x838/0x10e0
>>>    __folio_put+0x1ba/0x2b0
>>>    ? __pfx___folio_put+0x10/0x10
>>>    ? __pfx___might_resched+0x10/0x10
>>
>> I suspect we come via
>>       migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
>>
>> Maybe this is the "Folio was freed from under us. So we are done." path
>> when "folio_ref_count(src) == 1".
>>
>> Alternatively, we might come via
>>       migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
>>
>> For ordinary migration, move_to_new_folio() will clear src->mapping if
>> the folio was migrated successfully. That's the very first thing that
>> migrate_folio_move() does, so I doubt that is the problem.
>>
>> So I suspect we are in the migrate_folio_unmap() path. But for
>> a !anon folio, who should be freeing the folio concurrently (and not clearing
>> folio->mapping?)? After all, we have to hold the folio lock while migrating.
>>
>> In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
>> dropping the reference.
>>
>> Something to try might be (to see if the problem goes away).
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index dd04f578c19c..45e92e14c904 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>>                   /* Folio was freed from under us. So we are done. */
>>                   folio_clear_active(src);
>>                   folio_clear_unevictable(src);
>> +               /*
>> +                * Anonymous and movable src->mapping will be cleared by
>> +                * free_pages_prepare so don't reset it here for keeping
>> +                * the type to work PageAnon, for example.
>> +                */
>> +               if (!folio_mapping_flags(src))
>> +                       src->mapping = NULL;
>>                   /* free_pages_prepare() will clear PG_isolated. */
>>                   list_del(&src->lru);
>>                   migrate_folio_done(src, reason);
>>
>> But it does feel weird: who freed the page concurrently and didn't clear
>> folio->mapping ...
>>
>> We don't hold the folio lock of src, though, but have the only reference. So
>> another possible thing might be folio refcount mis-counting: folio_ref_count()
>> == 1 but there are other references (e.g., from the pagecache).
> 
> Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> does one folio_put() too much and we are freeing a pageache folio that is still
> in the pageache and, therefore, has folio->mapping set ... bisecting would
> really help.
> 

A little bird just told me that I missed an important piece in the dmesg 
output: "aops:btree_aops ino:1" from dump_mapping():

This is btrfs, i_ino is 1, and we don't have a dentry. Is that 
BTRFS_BTREE_INODE_OBJECTID?

Summarizing what we know so far:
(1) Freeing an order-0 btrfs folio where folio->mapping
     is still set
(2) Triggered by kswapd and kcompactd; not triggered by other means of
     page freeing so far

Possible theories:
(A) folio->mapping not cleared when freeing the folio. But shouldn't
     this also happen on other freeing paths? Or are we simply lucky to
     never trigger that for that folio?
(B) Messed-up refcounting: freeing a folio that is still in use (and
     therefore has folio-> mapping still set)

I was briefly wondering if large folio splitting could be involved.

CCing btrfs maintainers.

-- 
Cheers,

David / dhildenb