lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <022e1c15-7988-9975-acbc-e661e989ca4a@suse.cz>
Date:   Mon, 27 Mar 2023 14:41:00 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Matthew Wilcox <willy@...radead.org>,
        Yang Shi <shy828301@...il.com>,
        Ryan Roberts <ryan.roberts@....com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: What size anonymous folios should we allocate?

On 2/22/23 04:52, Matthew Wilcox wrote:
> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> 
>> > C. We add a new wrinkle to the LRU handling code.  When our scan of the
>> >    active list examines a folio, we look to see how many of the PTEs
>> >    mapping the folio have been accessed.  If it is fewer than half, and
>> >    those half are all in either the first or last half of the folio, we
>> >    split it.  The active half stays on the active list and the inactive
>> >    half is moved to the inactive list.
>> 
>> With contiguous PTE, every PTE still maintains its own access bit (but
>> it is implementation defined, some implementations may just set access
>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>> anyway this is definitely feasible.
> 
> If a CPU doesn't have separate access bits for PTEs, then we should just
> not use the contiguous bits.  Knowing which parts of the folio are
> unused is more important than using the larger TLB entries.

Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
the thread, that sounds similar. So I IIUC there will be a larger TLB entry
transparently, and then I don't expect the CPU to update individual bits as
that would defeat the purpose. So I'd expect it will either set them all to
active when forming the larger TLB entry, or set them on a single subpage
and leave the rest at whatever state they were. Hm I wonder if the exact
behavior is defined anywhere.

>> > For the third case, in contrast, the parent had already established
>> > an appropriate size folio to use for this VMA before calling fork().
>> > Whether it is the parent or the child causing the COW, it should probably
>> > inherit that choice and we should default to the same size folio that
>> > was already found.
>> 
>> Actually this is not what THP does now. The current THP behavior is to
>> split the PMD then fallback to order-0 page fault. For smaller orders,
>> we may consider allocating a large folio.
> 
> I know it's not what THP does now.  I think that's because the gap
> between PMD and PAGE size is too large and we end up wasting too much
> memory.  We also have very crude mechanisms for determining when to
> use THPs.  With the adaptive mechanism I described above, I think it's
> time to change that.
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ