linux-kernel - Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com>
Date:   Sat, 13 Feb 2021 10:44:01 -0500
From:   Zi Yan <ziy@...dia.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
CC:     Joao Martins <joao.m.martins@...cle.com>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording

On 11 Feb 2021, at 18:44, Mike Kravetz wrote:

> On 2/11/21 12:47 PM, Zi Yan wrote:
>> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
>>
>>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>>> For a given hugepage backing a VA, there's a rather ineficient
>>>> loop which is solely responsible for storing subpages in GUP
>>>> @pages/@...s array. For each subpage we check whether it's within
>>>> range or size of @pages and keep increment @pfn_offset and a couple
>>>> other variables per subpage iteration.
>>>>
>>>> Simplify this logic and minimize the cost of each iteration to just
>>>> store the output page/vma. Instead of incrementing number of @refs
>>>> iteratively, we do it through pre-calculation of @refs and only
>>>> with a tight loop for storing pinned subpages/vmas.
>>>>
>>>> Additionally, retain existing behaviour with using mem_map_offset()
>>>> when recording the subpages for configurations that don't have a
>>>> contiguous mem_map.
>>>>
>>>> pinning consequently improves bringing us close to
>>>> {pin,get}_user_pages_fast:
>>>>
>>>>   - 16G with 1G huge page size
>>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>>
>>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@...cle.com>
>>>> ---
>>>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>
>>> Thanks for updating this.
>>>
>>> Reviewed-by: Mike Kravetz <mike.kravetz@...cle.com>
>>>
>>> I think there still is an open general question about whether we can always
>>> assume page structs are contiguous for really big pages.  That is outside
>>
>> I do not think page structs need to be contiguous, but PFNs within a big page
>> need to be contiguous, at least based on existing code like mem_map_offset() we have.
>
> Thanks for looking Zi,
> Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
> to be contiguous.  The issue is that there is code that assumes page structs
> are contiguous for gigantic pages.  hugetlb code does not make this assumption
> and does a pfn_to_page() when looping through page structs for gigantic pages.
>
> I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
> is defined and struct pages can be accessed contiguously.  I 'think' we could
> run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
> and doing hotplug operations.  However, I still need to look into more.

Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
!CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
dynamically reserving gigantic hugetlb pages then freeing them in a system
with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.

Here are the steps to reproduce:
0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable hotplug.
2. After boot the machine, add large enough memory using
   “object_add memory-backend-ram,id=mem1,size=7g” and
   “device_add pc-dimm,id=dimm1,memdev=mem1”.
3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block size.
If you have larger memory block size, I think you will need to plug in more memory.
4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I reserved
12GB, like “echo 12 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
5. Free all hugetlb gigantic pages,
“echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
6. You will get “BUG: Bad page state in process …” errors.

The patch below can fix the error, but I suspect there might be other places missing
the necessary mem_map_next() too.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..aae99c6984f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
        h->nr_huge_pages--;
        h->nr_huge_pages_node[page_to_nid(page)]--;
        for (i = 0; i < pages_per_huge_page(h); i++) {
-               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
+               struct page *subpage = mem_map_next(subpage, page, i);
+               subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
                                1 << PG_referenced | 1 << PG_dirty |
                                1 << PG_active | 1 << PG_private |
                                1 << PG_writeback);


—
Best Regards,
Yan Zi

Download attachment "signature.asc" of type "application/pgp-signature" (855 bytes)