linux-kernel - Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <941f0f8f-a2c2-0021-0773-6cfaa81aabd7@redhat.com>
Date:   Wed, 18 Jan 2023 19:21:43 +0100
From:   David Hildenbrand <david@...hat.com>
To:     James Houghton <jthoughton@...gle.com>,
        Peter Xu <peterx@...hat.com>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Muchun Song <songmuchun@...edance.com>,
        David Rientjes <rientjes@...gle.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Zach O'Keefe <zokeefe@...gle.com>,
        Manish Mishra <manish.mishra@...anix.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        "Dr . David Alan Gilbert" <dgilbert@...hat.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Yang Shi <shy828301@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for
 walk_hugetlb_range

>>> Once the last piece is unmapped (or simpler: once the complete subtree of
>>> page tables is gone), we decrement refcount+mapcount. Might require some
>>> brain power to do this tracking, but I wouldn't call it impossible right
>>> from the start.
>>>
>>> Would such a design violate other design aspects that are important?
> 
> This is actually how mapcount was treated in HGM RFC v1 (though not
> refcount); it is doable for both [2].
> 
> One caveat here: if a page is unmapped in small pieces, it is
> difficult to know if the page is legitimately completely unmapped (we
> would have to check all the PTEs in the page table). In RFC v1, I
> sidestepped this caveat by saying that "page_mapcount() is incremented
> if the hstate-level PTE is present". A single unmap on the whole
> hugepage will clear the hstate-level PTE, thus decrementing the
> mapcount.
> 
> On a related note, there still exists an (albeit minor) API difference
> vs. THPs: a piece of a page that is legitimately unmapped can still
> have a positive page_mapcount().
> 
> Given that this approach allows us to retain the hugetlb vmemmap
> optimization (and it wouldn't require a horrible amount of
> complexity), I prefer this approach over the THP-like approach.

If we can store (directly/indirectly) metadata in the highest pgtable 
that HGM-maps a hugetlb page, I guess what would be reasonable:

* hugetlb page pointer
* mapped size

Whenever mapping/unmapping sub-parts, we'd have to update that information.

Once "mapped size" dropped to 0, we know that the hugetlb page was 
completely unmapped and we can drop the refcount+mapcount, clear 
metadata (including hugetlb page pointer) [+ remove the page tables?].

Similarly, once "mapped size" corresponds to the hugetlb size, we can 
immediately spot that everything is mapped.

Again, just a high-level idea.

-- 
Thanks,

David / dhildenb