lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 8 Jul 2023 11:34:53 +0800
From:   "Yin, Fengwei" <fengwei.yin@...el.com>
To:     David Hildenbrand <david@...hat.com>,
        Matthew Wilcox <willy@...radead.org>
CC:     <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        <yuzhao@...gle.com>, <ryan.roberts@....com>, <shy828301@...il.com>,
        <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH 0/3] support large folio for mlock



On 7/8/2023 2:54 AM, David Hildenbrand wrote:
> On 07.07.23 19:26, Matthew Wilcox wrote:
>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>>> This series identified the large folio for mlock to two types:
>>>    - The large folio is in VM_LOCKED VMA range
>>>    - The large folio cross VM_LOCKED VMA boundary
>>
>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
>> has led us astray.  Today when the arguments to mlock() cross a folio
>> boundary, we split the PMD entry but leave the folio intact.  That means
>> that we continue to manage the folio as a single entry on the LRU list.
>> But userspace may have no idea that we're doing this.  It may have made
>> several calls to mmap() 256kB at once, they've all been coalesced into
>> a single VMA and khugepaged has come along behind its back and created
>> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
>> a hint that oops, maybe we shouldn't've done that, we do our utmost to
>> preserve the 2MB folio.
>>
>> I think this whole approach needs rethinking.  IMO, anonymous folios
>> should not cross VMA boundaries.  Tell me why I'm wrong.
> 
> I think we touched upon that a couple of times already, and the main issue is that while it sounds nice in theory, it's impossible in practice.
> 
> THP are supposed to be transparent, that is, we should not let arbitrary operations fail.
> 
> But nothing stops user space from
> 
> (a) mmap'ing a 2 MiB region
> (b) GUP-pinning the whole range
> (c) GUP-pinning the first half
> (d) unpinning the whole range from (a)
> (e) munmap'ing the second half
> 
> 
> And that's just one out of many examples I can think of, not even considering temporary/speculative references that can prevent a split at random points in time -- especially when splitting a VMA.
> 
Yes. The case that folio can't be split successfully is the only reason
I tried to avoid split the folio in mlock() syscall. I'd like to postpone
the split to page reclaim phase.


Regards
Yin, Fengwei

> Sure, any time we PTE-map a THP we might just say "let's put that on the deferred split queue" and cross fingers that we can eventually split it later. (I was recently thinking about that in the context of the mapcount ...)
> 
> It's all a big mess ...
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ