linux-kernel - Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zo8LTaP-6-zIcc9v@arm.com>
Date: Wed, 10 Jul 2024 23:29:33 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Nanyong Sun <sunnanyong@...wei.com>, will@...nel.org,
	mike.kravetz@...cle.com, muchun.song@...ux.dev,
	akpm@...ux-foundation.org, anshuman.khandual@....com,
	willy@...radead.org, wangkefeng.wang@...wei.com,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

On Wed, Jul 10, 2024 at 11:12:01AM -0600, Yu Zhao wrote:
> On Wed, Jul 10, 2024 at 10:51 AM Catalin Marinas
> <catalin.marinas@....com> wrote:
> > On Fri, Jul 05, 2024 at 11:41:34AM -0600, Yu Zhao wrote:
> > > On Fri, Jul 5, 2024 at 9:49 AM Catalin Marinas <catalin.marinas@....com> wrote:
> > > > If I did the maths right, for a 2MB hugetlb page, we have about 8
> > > > vmemmap pages (32K). Once we split a 2MB vmemap range,
> > >
> > > Correct.
> > >
> > > > whatever else
> > > > needs to be touched in this range won't require a stop_machine().
> > >
> > > There might be some misunderstandings here.
> > >
> > > To do HVO:
> > > 1. we split a PMD into 512 PTEs;
> > > 2. for every 8 PTEs:
> > >   2a. we allocate an order-0 page for PTE #0;
> > >   2b. we remap PTE #0 *RW* to this page;
> > >   2c. we remap PTEs #1-7 *RO* to this page;
> > >   2d. we free the original order-3 page.
> >
> > Thanks. I now remember why we reverted such support in 060a2c92d1b6
> > ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP"). The main
> > problem is that point 2c also changes the output address of the PTE
> > (and the content of the page slightly). The architecture requires a
> > break-before-make in such scenario, though it would have been nice if it
> > was more specific on what could go wrong.
> >
> > We can do point 1 safely if we have FEAT_BBM level 2. For point 2, I
> > assume these 8 vmemmap pages may be accessed and that's why we can't do
> > a break-before-make safely.
> 
> Correct
> 
> > I was wondering whether we could make the
> > PTEs RO first and then change the output address but we have another
> > rule that the content of the page should be the same. I don't think
> > entries 1-7 are identical to entry 0 (though we could ask the architects
> > for clarification here). Also, can we guarantee that nothing writes to
> > entry 0 while we would do such remapping?
> 
> Yes, it's already guaranteed.
> 
> > We know entries 1-7 won't be
> > written as we mapped them as RO but entry 0 contains the head page.
> > Maybe it's ok to map it RO temporarily until the newly allocated hugetlb
> > page is returned.
> 
> We can do that. I don't understand how this could elide BBM. After the
> above, we would still need to do:
> 3. remap entry 0 from RO to RW, mapping the `struct page` page that
> will be shared with entry 1-7.
> 4. remap entry 1-7 from their respective `struct page` pages to that
> of entry 0, while they remain RO.

The Arm ARM states that we need a BBM if we change the output address
and: the old or new mappings are RW *or* the content of the page
changes. Ignoring the latter (page content), we can turn the PTEs RO
first without changing the pfn followed by changing the pfn while they
are RO. Once that's done, we make entry 0 RW and, of course, with
additional TLBIs between all these steps. Can we leave entry 0 RO? This
would save an additional TLBI.

Now, I wonder if all this is worth it. What are the scenarios where the
8 PTEs will be accessed? The vmemmap range corresponding to a 2MB
hugetlb page for example is pretty well defined - 8 x 4K pages, aligned.

> > If we could get the above work, it would be a lot simpler than thinking
> > of stop_machine() or other locks to wait for such remapping.
> 
> Steps 3/4 would not require BBM somehow?

If we ignore the 'content' requirement, I think we could skip the BBM
but we need to make sure we don't change the permission and pfn at the
same time.

> > > To do de-HVO:
> > > 1. for every 8 PTEs:
> > >   1a. we allocate 7 order-0 pages.
> > >   1b. we remap PTEs #1-7 *RW* to those pages, respectively.
> >
> > Similar problem in 1.b, changing the output address. Here we could force
> > the content to be the same
> 
> I don't follow the "the content to be the same" part. After HVO, we have:
> 
> Entry 0 -> `struct page` page A, RW
> Entry 1 -> `struct page` page A, RO
> ...
> Entry 7 -> `struct page` page A, RO
> 
> To de-HVO, we need to make them:
> 
> Entry 0 -> `struct page` page A, RW
> Entry 1 -> `struct page` page B, RW
> ...
> Entry 7 -> `struct page` page H, RW
> 
> I assume the same content means PTE_0 == PTE_1/.../7?

That's the content of the page at the corresponding pfn before and after
the pte change. I'm pretty sure the Arm ARM states this in case the
hardware starts a load (e.g. unaligned) from one page and completes it
from another, the software should not see any difference. But for the
fields we care about in struct page, I assume they'd be the same (or
that we just don't care about inconsistencies during this transient
period).

-- 
Catalin