lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201005193746.GO20115@casper.infradead.org>
Date:   Mon, 5 Oct 2020 20:37:46 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     Zi Yan <ziy@...dia.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Yang Shi <shy828301@...il.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        John Hubbard <jhubbard@...dia.com>,
        David Nellans <dnellans@...dia.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote:
> On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:
> > One of the longer-term todo items is to support variable sized THPs for
> > anonymous memory, just like I've done for the pagecache.  With that in
> > place, I think scaling up from PMD sized pages to PUD sized pages starts
> > to look more natural.  Itanium and PA-RISC (two architectures that will
> > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> > The RiscV spec you pointed me at the other day confines itself to adding
> > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> > sizes would be possible additions in the future.
> 
> Just to understand the todo items clearly. With your pagecache patchset,
> kernel should be able to understand variable sized THPs no matter they
> are anonymous or not, right?

... yes ... modulo bugs and places I didn't fix because only anonymous
pages can get there ;-)  There are still quite a few references to
HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related
to things which are actually PMD sized.  I did fix a couple of places
where the anonymous path assumed that pages were PMD sized because I
thought we'd probably want to do that sooner rather than later.

> For anonymous memory, we need kernel policies
> to decide what THP sizes to use at allocation, what to do when under
> memory pressure, and so on. In terms of implementation, THP split function
> needs to support from any order to any lower order. Anything I am missing here?

I think that's the bulk of the work.  The swap code also needs work so we
don't have to split pages to swap them out.

> > I think I'm leaning towards not merging this patchset yet.  I'm in
> > agreement with the goals (allowing systems to use PUD-sized pages
> > automatically), but I think we need to improve the infrastructure to
> > make it work well automatically.  Does that make sense?
> 
> I agree that this patchset should not be merged in the current form.
> I think PUD THP support is a part of variable sized THP support, but
> current form of the patchset does not have the “variable sized THP”
> spirit yet and is more like a special PUD case support. I guess some
> changes to existing THP code to make PUD THP less a special case would
> make the whole patchset more acceptable?
> 
> Can you elaborate more on the infrastructure part? Thanks.

Oh, this paragraph was just summarising the above.  We need to
be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc.
I haven't put much effort yet into supporting pages which are larger than
PMD-size -- that is, if a page is mapped with a PMD entry, we assume
it's PMD-sized.  Once we can allocate a larger-than-PMD sized page,
that's off.  I assume a lot of that is dealt with in your patchset,
although I haven't audited it to check for that.

> > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> > level when using PMD/PUD sized mappings.  I don't know of any that does
> > that today.
> 
> I agree it would be a nice hardware feature, but it also has a high cost.
> Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
> assuming each entry takes 8B space. Now it becomes why not having a bigger
> TLB. ;)

Oh, we don't have to track at the individual-page level for this to be
useful.  Let's take the RISC-V Sv39 page table entry format as an example:

63-54 attributes
53-28 PPN2
27-19 PPN1
18-10 PPN0
9-8 RSW
7-0 DAGUXWRV

For a 2MB page, we currently insist that 18-10 are zero.  If we repurpose
eight of those nine bits as A/D bits, we can track at 512kB granularity.
For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB
granularity.  It's not great, but it is quite cheap!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ