linux-kernel - Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94b2028d-f42c-41c0-90a3-4938472a6f3f@oracle.com>
Date: Mon, 20 Jan 2025 21:22:07 -0800
From: jane.chu@...cle.com
To: Jiaqi Yan <jiaqiyan@...gle.com>
Cc: David Hildenbrand <david@...hat.com>, nao.horiguchi@...il.com,
        linmiaohe@...wei.com, sidhartha.kumar@...cle.com,
        muchun.song@...ux.dev, akpm@...ux-foundation.org, osalvador@...e.de,
        rientjes@...gle.com, jthoughton@...gle.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/2] How HugeTLB handle HWPoison page at truncation


On 1/20/2025 9:08 PM, Jiaqi Yan wrote:
> On Mon, Jan 20, 2025 at 9:01 PM <jane.chu@...cle.com> wrote:
>>
>> On 1/20/2025 5:21 PM, Jiaqi Yan wrote:
>>> On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@...hat.com> wrote:
>>>> On 19.01.25 19:06, Jiaqi Yan wrote:
>>>>> While I was working on userspace MFR via memfd [1], I spend some time to
>>>>> understand what current kernel does when a HugeTLB-backing memfd is
>>>>> truncated. My expectation is, if there is a HWPoison HugeTLB folio
>>>>> mapped via the memfd to userspace, it will be unmapped right away but
>>>>> still be kept in page cache [2]; however when the memfd is truncated to
>>>>> zero or after the memfd is closed, kernel should dissolve the HWPoison
>>>>> folio in the page cache, and free only the clean raw pages to buddy
>>>>> allocator, excluding the poisoned raw page.
>>>>>
>>>>> So I wrote a hugetlb-mfr-base.c selftest and expect
>>>>> 0. say nr_hugepages initially is 64 as system configuration.
>>>>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even
>>>>>       HWPoison huge folio in page cache. free_hugepages should be
>>>>>       nr_hugepages minus whatever the amount in use.
>>>>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as
>>>>>       kernel dissolved and freed the HWPoison huge folio. free_hugepages
>>>>>       should also be 63.
>>>>>
>>>>> However, when testing at the head of mm-stable commit 2877a83e4a0a
>>>>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found
>>>>> although free_hugepages is reduced to 63, nr_hugepages is not reduced
>>>>> and stay at 64.
>>>>>
>>>>> Is my expectation outdated? Or is this some kind of bug?
>>>>>
>>>>> I assume this is a bug and then digged a little bit more. It seems there
>>>>> are two issues, or two things I don't really understand.
>>>>>
>>>>> 1. During try_memory_failure_hugetlb, we should increased the target
>>>>>       in-use folio's refcount via get_hwpoison_hugetlb_folio. However,
>>>>>       until the end of try_memory_failure_hugetlb, this refcout is not put.
>>>>>       I can make sense of this given we keep in-use huge folio in page
>>>>>       cache.
>>>> Isn't the general rule that hwpoisoned folios have a raised refcount
>>>> such that they won't get freed + reused? At least that's how the buddy
>>>> deals with them, and I suspect also hugetlb?
>>> Thanks, David.
>>>
>>> I see, so it is expected that the _entire_ huge folio will always have
>>> at least a refcount of 1, even when the folio can become "free".
>>>
>>> For *free* huge folio, try_memory_failure_hugetlb dissolves it and
>>> frees the clean pages (a lot) to the buddy allocator. This made me
>>> think the same thing will happen for *in-use* huge folio _eventually_
>>> (i.e. somehow the refcount due to HWPoison can be put). I feel this is
>>> a little bit unfortunate for the clean pages, but if it is what it is,
>>> that's fair as it is not a bug.
>> Agreed with David.  For *in use* hugetlb pages, including unused shmget
>> pages, hugetlb shouldn't dissvolve the page, not until an explicit freeing action is taken like
>> RMID and echo 0 > nr_hugepages.
> To clarify myself, I am not asking memory-failure.c to dissolve the
> hugepage at the time it is in-use, but rather when it becomes free
> (truncated or process exited).

Understood, a free hugetlb page in the pool should have refcount 1 though.

-jane

>
>> -jane
>>
>>>>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000
>>>>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1)
>>>>> [ 1069.320982] page_type: f4(hugetlb)
>>>>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>>>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>>>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000
>>>>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000
>>>>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000
>>>>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000
>>>>> [ 1069.320992] page dumped because: track hwpoison folio's ref
>>>>>
>>>>> 2. Even if folio's refcount do drop to zero and we get into
>>>>>       free_huge_folio, it is not clear to me which part of free_huge_folio
>>>>>       is handling the case that folio is HWPoison. In my test what I
>>>>>       observed is that evantually the folio is enqueue_hugetlb_folio()-ed.
>>>> How would we get a refcount of 0 if we assume the raised refcount on a
>>>> hwpoisoned hugetlb folio?
>>>>
>>>> I'm probably missing something: are you saying that you can trigger a
>>>> hwpoisoned hugetlb folio to get reallocated again, in upstream code?
>>> No, I think it is just my misunderstanding. From what you said, the
>>> expectation of HWPoison hugetlb folio is just it won't get reallocated
>>> again, which is true.
>>>
>>> My (wrong) expectation is, in addition to the "won't reallocated
>>> again" part, some (large) portion of the huge folio will be freed to
>>> the buddy allocator. On the other hand, is it something worth having /
>>> improving? (1G - some_single_digit * 4KB) seems to be valuable to the
>>> system, though they are all 4K. #1 and #2 above are then what needs to
>>> be done if the improvement is worth chasing.
>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>>