linux-kernel - Re: [RFC PATCH 0/3] support large folio for mlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <af1ee9a7-2c6f-0450-e44e-59e5eeb50d6b@intel.com>
Date:   Mon, 10 Jul 2023 18:19:37 +0800
From:   "Yin, Fengwei" <fengwei.yin@...el.com>
To:     David Hildenbrand <david@...hat.com>,
        "Yin, Fengwei" <fengwei.yin@...el.com>,
        Matthew Wilcox <willy@...radead.org>
CC:     <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        <yuzhao@...gle.com>, <ryan.roberts@....com>, <shy828301@...il.com>,
        <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH 0/3] support large folio for mlock



On 7/10/2023 5:57 PM, David Hildenbrand wrote:
> On 10.07.23 11:43, Yin, Fengwei wrote:
>> Hi David,
>>
>> On 7/10/2023 5:32 PM, David Hildenbrand wrote:
>>> On 09.07.23 15:25, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>>>>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>>>>> The old folio will go on the deferred_list and be split later, or its
>>>>> valid parts will be written to swap and then it can be freed.
>>>> If the large folio splitting failure is because of GUP pages, can we
>>>> do copy here?
>>>>
>>>> Let's say, if the GUP page is target of DMA operation and DMA operation
>>>> is ongoing. We allocated a new page and copy GUP page content to the
>>>> new page, the data in the new page can be corrupted.
>>>
>>> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
>> Thanks a lot for clarification.
>>
>> So my understanding is that if large folio splitting fails, it's not always
>> true that we can allocate new folios, copy original large folio content to
>> new folios, remove original large folio from VMA and map the new folios to
>> VMA (like it's only true if original large folio is marked as maybe shared).
>>
> 
> While it might work in many cases, there are some corner cases where it won't work.
> 
> So to summarize
> 
> (1) THP are transparent and should not result in arbitrary syscall
>     failures.
> (2) Splitting a THP might fail at random points in time either due to
>     GUP pins or due to speculative page references (including
>     speculative GUP pins).
> (3) Replacing an exclusive anon page that maybe pinned will result in
>     memory corruptions.
> 
> So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3).
> 
> And as it's all transparent, we cannot fail (1).
Very clear to me now.

> 
> For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page.
> 
> <details>
> 
> What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable.
> 
> I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird.
> 
> </details>
> 
Thanks a lot for extra information which help me to know more background.
Really appreciate it.


Regards
Yin, Fengwei