[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff29f723-32de-421b-a65e-7b7d2e03162d@redhat.com>
Date: Wed, 29 May 2024 08:57:48 +0200
From: David Hildenbrand <david@...hat.com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>, Chris Mason
<clm@...com>, Josef Bacik <josef@...icpanda.com>,
David Sterba <dsterba@...e.com>
Cc: Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Linux Memory Management List <linux-mm@...ck.org>,
Matthew Wilcox <willy@...radead.org>,
linux-btrfs <linux-btrfs@...r.kernel.org>
Subject: Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840
On 28.05.24 16:24, David Hildenbrand wrote:
> Am 28.05.24 um 15:57 schrieb David Hildenbrand:
>> Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
>>> On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
>>> <mikhail.v.gavrilov@...il.com> wrote:
>>>>
>>>> On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> The only known workload that causes this is updating a large
>>>> container. Unfortunately, not every container update reproduces the
>>>> problem.
>>>
>>> Is it possible to add more debugging information to make it clearer
>>> what's going on?
>>
>> If we knew who originally allocated that problematic page, that might help.
>> Maybe page_owner could give some hints?
>>
>>>
>>> BUG: Bad page state in process kcompactd0 pfn:605811
>>> page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
>>> pfn:0x605811
>>> aops:btree_aops ino:1
>>> flags:
>>> 0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
>>> raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
>>> raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
>>> page dumped because: non-NULL mapping
>>
>> Seems to be an order-0 page, otherwise we would have another "head: ..." report.
>>
>> It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
>> field for them manually on the page freeing path. Likely it's a pagecache folio.
>>
>> So one option is that something seems to not properly set folio->mapping to
>> NULL. But that problem would then also show up without page migration? Hmm.
>>
>>> Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
>>> BIOS 2611 04/07/2024
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x84/0xd0
>>> bad_page.cold+0xbe/0xe0
>>> ? __pfx_bad_page+0x10/0x10
>>> ? page_bad_reason+0x9d/0x1f0
>>> free_unref_page+0x838/0x10e0
>>> __folio_put+0x1ba/0x2b0
>>> ? __pfx___folio_put+0x10/0x10
>>> ? __pfx___might_resched+0x10/0x10
>>
>> I suspect we come via
>> migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().
>>
>> Maybe this is the "Folio was freed from under us. So we are done." path
>> when "folio_ref_count(src) == 1".
>>
>> Alternatively, we might come via
>> migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().
>>
>> For ordinary migration, move_to_new_folio() will clear src->mapping if
>> the folio was migrated successfully. That's the very first thing that
>> migrate_folio_move() does, so I doubt that is the problem.
>>
>> So I suspect we are in the migrate_folio_unmap() path. But for
>> a !anon folio, who should be freeing the folio concurrently (and not clearing
>> folio->mapping?)? After all, we have to hold the folio lock while migrating.
>>
>> In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
>> dropping the reference.
>>
>> Something to try might be (to see if the problem goes away).
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index dd04f578c19c..45e92e14c904 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>> /* Folio was freed from under us. So we are done. */
>> folio_clear_active(src);
>> folio_clear_unevictable(src);
>> + /*
>> + * Anonymous and movable src->mapping will be cleared by
>> + * free_pages_prepare so don't reset it here for keeping
>> + * the type to work PageAnon, for example.
>> + */
>> + if (!folio_mapping_flags(src))
>> + src->mapping = NULL;
>> /* free_pages_prepare() will clear PG_isolated. */
>> list_del(&src->lru);
>> migrate_folio_done(src, reason);
>>
>> But it does feel weird: who freed the page concurrently and didn't clear
>> folio->mapping ...
>>
>> We don't hold the folio lock of src, though, but have the only reference. So
>> another possible thing might be folio refcount mis-counting: folio_ref_count()
>> == 1 but there are other references (e.g., from the pagecache).
>
> Hmm, your original report mentions kswapd, so I'm getting the feeling someone
> does one folio_put() too much and we are freeing a pageache folio that is still
> in the pageache and, therefore, has folio->mapping set ... bisecting would
> really help.
>
A little bird just told me that I missed an important piece in the dmesg
output: "aops:btree_aops ino:1" from dump_mapping():
This is btrfs, i_ino is 1, and we don't have a dentry. Is that
BTRFS_BTREE_INODE_OBJECTID?
Summarizing what we know so far:
(1) Freeing an order-0 btrfs folio where folio->mapping
is still set
(2) Triggered by kswapd and kcompactd; not triggered by other means of
page freeing so far
Possible theories:
(A) folio->mapping not cleared when freeing the folio. But shouldn't
this also happen on other freeing paths? Or are we simply lucky to
never trigger that for that folio?
(B) Messed-up refcounting: freeing a folio that is still in use (and
therefore has folio-> mapping still set)
I was briefly wondering if large folio splitting could be involved.
CCing btrfs maintainers.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists