lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkog9B_NWhvYmb7=n3Fnb0oER-sXhE3=Nyx_8Kc3-dggcQ@mail.gmail.com>
Date:   Thu, 31 Aug 2023 10:15:09 -0700
From:   Yang Shi <shy828301@...il.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     "Huang, Ying" <ying.huang@...el.com>,
        Ryan Roberts <ryan.roberts@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Yin Fengwei <fengwei.yin@...el.com>,
        Yu Zhao <yuzhao@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Zi Yan <ziy@...dia.com>, Luis Chamberlain <mcgrof@...nel.org>,
        Itaru Kitayama <itaru.kitayama@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

On Thu, Aug 31, 2023 at 12:57 AM David Hildenbrand <david@...hat.com> wrote:
>
> On 31.08.23 03:40, Huang, Ying wrote:
> > Ryan Roberts <ryan.roberts@....com> writes:
> >
> >> On 15/08/2023 22:32, Huang, Ying wrote:
> >>> Hi, Ryan,
> >>>
> >>> Ryan Roberts <ryan.roberts@....com> writes:
> >>>
> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> >>>> allocated in large folios of a determined order. All pages of the large
> >>>> folio are pte-mapped during the same page fault, significantly reducing
> >>>> the number of page faults. The number of per-page operations (e.g. ref
> >>>> counting, rmap management lru list management) are also significantly
> >>>> reduced since those ops now become per-folio.
> >>>>
> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> >>>> which defaults to disabled for now; The long term aim is for this to
> >>>> defaut to enabled, but there are some risks around internal
> >>>> fragmentation that need to be better understood first.
> >>>>
> >>>> Large anonymous folio (LAF) allocation is integrated with the existing
> >>>> (PMD-order) THP and single (S) page allocation according to this policy,
> >>>> where fallback (>) is performed for various reasons, such as the
> >>>> proposed folio order not fitting within the bounds of the VMA, etc:
> >>>>
> >>>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
> >>>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
> >>>> ----------------|-----------|-------------|---------------|-------------
> >>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
> >>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
> >>>> MADV_NOHUGEPAGE | S         | S           | S             | S
> >>>
> >>> IMHO, we should use the following semantics as you have suggested
> >>> before.
> >>>
> >>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
> >>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
> >>> ----------------|-----------|-------------|---------------|-------------
> >>> no hint         | S         | S           | LAF>S         | THP>LAF>S
> >>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S         | S           | S             | S
> >>>
> >>> Or even,
> >>>
> >>>                  | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
> >>>                  | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
> >>> ----------------|-----------|-------------|---------------|-------------
> >>> no hint         | S         | S           | S             | THP>LAF>S
> >>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
> >>> MADV_NOHUGEPAGE | S         | S           | S             | S
> >>>
> >>>  From the implementation point of view, PTE mapped PMD-sized THP has
> >>> almost no difference with LAF (just some small sized THP).  It will be
> >>> confusing to distinguish them from the interface point of view.
> >>>
> >>> So, IMHO, the real difference is the policy.  For example, prefer
> >>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
> >>> interface is used to specify system global policy.  In the long term, it
> >>> can be something like below,
> >>>
> >>> never:      S               # disable all THP
> >>> madvise:                    # never by default, control via madvise()
> >>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
> >>> small:      LAF>S           # prefer small sized THP
> >>> auto:                       # use in-kernel heuristics for THP size
> >>>
> >>> But it may be not ready to add new policies now.  So, before the new
> >>> policies are ready, we can add a debugfs interface to override the
> >>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
> >>> we have tuned enough workloads, collected enough data, we can add new
> >>> policies to the sysfs interface.
> >>
> >> I think we can all imagine many policy options. But we don't really have much
> >> evidence yet for what it best. The policy I'm currently using is intended to
> >> give some flexibility for testing (use LAF without THP by setting sysfs=never,
> >> use THP without LAF by compiling without LAF) without adding any new knobs at
> >> all. Given that, surely we can defer these decisions until we have more data?
> >>
> >> In the absence of data, your proposed solution sounds very sensible to me. But
> >> for the purposes of scaling up perf testing, I don't think its essential given
> >> the current policy will also produce the same options.
> >>
> >> If we were going to add a debugfs knob, I think the higher priority would be a
> >> knob to specify the folio order. (but again, I would rather avoid if possible).
> >
> > I totally understand we need some way to control PMD-sized THP and LAF
> > to tune the workload, and nobody likes debugfs knob.
> >
> > My concern about interface is that we have no way to disable LAF
> > system-wise without rebuilding the kernel.  In the future, should we add
> > a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
> > stricter than "never"?  "really_never"?
>
> Let's talk about that in a bi-weekly MM session. (I proposed it as a
> topic for next week).
>
> As raised in another mail, we can then discuss
> * how we want to call this feature (transparent large pages? there is
>    the concern that "THP" might confuse users. Maybe we can consider
>    "large" the more generic version and "huge" only PMD-size, TBD)

I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB),
"Large" means any order but less than PMD-mappable order, "Gigantic"
means PUD mappable. This should incur the least confusion IMHO.

> * how to expose it in stats towards the user (e.g., /proc/meminfo)

I recalled I suggested new statistics for each order, but was NAK'ed.

> * which minimal toggles we want
>
> I think there *really* has to be a way to disable it for a running
> system, otherwise no distro will dare pulling it in, even after we
> figured out the other stuff.

TBH I really don't like to tie large folio to THP toggles. THP
(PMD-mappable) is just a special case of LAF. The large folio should
be tried whenever it is possible ideally. But I do agree we may not be
able to achieve the ideal case at the time being, and also understand
the concern about regression in early adoption, so a knob that can
disable large folio may be needed for now. But it should be just a
simple binary knob (on/off), and should not be a part of kernel ABI
(temporary and debugging only) IMHO.

One more thing we may discuss is whether huge page madvise APIs should
take effect for large folio or not.

>
> Note that for the pagecache, large folios can be disabled and
> distributions are actively making use of that.
>
> --
> Cheers,
>
> David / dhildenb
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ