linux-kernel - Re: [RFC PATCH 0/3] support large folio for mlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <967ccf33-0982-6042-e4ce-b0c859b4c3b1@redhat.com>
Date:   Mon, 10 Jul 2023 11:57:50 +0200
From:   David Hildenbrand <david@...hat.com>
To:     "Yin, Fengwei" <fengwei.yin@...el.com>,
        Matthew Wilcox <willy@...radead.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        yuzhao@...gle.com, ryan.roberts@....com, shy828301@...il.com,
        akpm@...ux-foundation.org
Subject: Re: [RFC PATCH 0/3] support large folio for mlock

On 10.07.23 11:43, Yin, Fengwei wrote:
> Hi David,
> 
> On 7/10/2023 5:32 PM, David Hildenbrand wrote:
>> On 09.07.23 15:25, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>>>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>>>> The old folio will go on the deferred_list and be split later, or its
>>>> valid parts will be written to swap and then it can be freed.
>>> If the large folio splitting failure is because of GUP pages, can we
>>> do copy here?
>>>
>>> Let's say, if the GUP page is target of DMA operation and DMA operation
>>> is ongoing. We allocated a new page and copy GUP page content to the
>>> new page, the data in the new page can be corrupted.
>>
>> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
> Thanks a lot for clarification.
> 
> So my understanding is that if large folio splitting fails, it's not always
> true that we can allocate new folios, copy original large folio content to
> new folios, remove original large folio from VMA and map the new folios to
> VMA (like it's only true if original large folio is marked as maybe shared).
> 

While it might work in many cases, there are some corner cases where it 
won't work.

So to summarize

(1) THP are transparent and should not result in arbitrary syscall
     failures.
(2) Splitting a THP might fail at random points in time either due to
     GUP pins or due to speculative page references (including
     speculative GUP pins).
(3) Replacing an exclusive anon page that maybe pinned will result in
     memory corruptions.

So we can try to split any THP that crosses VMA borders on VMA 
modifications (split due to munmap, mremap, madvise, mprotect, mlock, 
...), it's not guaranteed to work due to (1). And we can try to replace 
pages such pages, but it's not guaranteed to be allowed due to (3).

And as it's all transparent, we cannot fail (1).

For the other cases that Willy and I discussed (split on VMA 
modifications after fork()), we can at least always replace the anon page.

<details>

What always works, is putting the THP on the deferred split queue to see 
if we can split it later. The deferred split queue is a bit suboptimal 
right now, because it requires the (sub)page mapcounts to detect whether 
the folio is partially mapped vs. fully mapped. If we want to get rid of 
that, we have to come up with something reasonable.

I was wondering if we could have a an optimized deferred split queue, 
that only conditionally splits: do an rmap walk and detect if (a) each 
page of the folio is still mapped (b) the folio does not cross a VMA. If 
both are met, one could skip the deferred split. But that needs a bit of 
thought -- but we're already doing an rmap walk when splitting, so 
scanning which parts are actually mapped does not sound too weird.

</details>

-- 
Cheers,

David / dhildenb