linux-kernel - Re: [PATCH v5 3/5] mm: LARGE_ANON

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87cyz43s63.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Thu, 31 Aug 2023 09:40:52 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Ryan Roberts <ryan.roberts@....com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Yin Fengwei <fengwei.yin@...el.com>,
        David Hildenbrand <david@...hat.com>,
        Yu Zhao <yuzhao@...gle.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Yang Shi <shy828301@...il.com>, Zi Yan <ziy@...dia.com>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Itaru Kitayama <itaru.kitayama@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

Ryan Roberts <ryan.roberts@....com> writes:

> On 15/08/2023 22:32, Huang, Ying wrote:
>> Hi, Ryan,
>> 
>> Ryan Roberts <ryan.roberts@....com> writes:
>> 
>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>> which defaults to disabled for now; The long term aim is for this to
>>> defaut to enabled, but there are some risks around internal
>>> fragmentation that need to be better understood first.
>>>
>>> Large anonymous folio (LAF) allocation is integrated with the existing
>>> (PMD-order) THP and single (S) page allocation according to this policy,
>>> where fallback (>) is performed for various reasons, such as the
>>> proposed folio order not fitting within the bounds of the VMA, etc:
>>>
>>>                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>>                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>>> ----------------|-----------|-------------|---------------|-------------
>>> no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
>>> MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
>>> MADV_NOHUGEPAGE | S         | S           | S             | S
>> 
>> IMHO, we should use the following semantics as you have suggested
>> before.
>> 
>>                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>> ----------------|-----------|-------------|---------------|-------------
>> no hint         | S         | S           | LAF>S         | THP>LAF>S
>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>> MADV_NOHUGEPAGE | S         | S           | S             | S
>> 
>> Or even,
>> 
>>                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
>>                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
>> ----------------|-----------|-------------|---------------|-------------
>> no hint         | S         | S           | S             | THP>LAF>S
>> MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
>> MADV_NOHUGEPAGE | S         | S           | S             | S
>> 
>> From the implementation point of view, PTE mapped PMD-sized THP has
>> almost no difference with LAF (just some small sized THP).  It will be
>> confusing to distinguish them from the interface point of view.
>> 
>> So, IMHO, the real difference is the policy.  For example, prefer
>> PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
>> interface is used to specify system global policy.  In the long term, it
>> can be something like below,
>> 
>> never:      S               # disable all THP
>> madvise:                    # never by default, control via madvise()
>> always:     THP>LAF>S       # prefer PMD-sized THP in fact
>> small:      LAF>S           # prefer small sized THP
>> auto:                       # use in-kernel heuristics for THP size
>> 
>> But it may be not ready to add new policies now.  So, before the new
>> policies are ready, we can add a debugfs interface to override the
>> original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
>> we have tuned enough workloads, collected enough data, we can add new
>> policies to the sysfs interface.
>
> I think we can all imagine many policy options. But we don't really have much
> evidence yet for what it best. The policy I'm currently using is intended to
> give some flexibility for testing (use LAF without THP by setting sysfs=never,
> use THP without LAF by compiling without LAF) without adding any new knobs at
> all. Given that, surely we can defer these decisions until we have more data?
>
> In the absence of data, your proposed solution sounds very sensible to me. But
> for the purposes of scaling up perf testing, I don't think its essential given
> the current policy will also produce the same options.
>
> If we were going to add a debugfs knob, I think the higher priority would be a
> knob to specify the folio order. (but again, I would rather avoid if possible).

I totally understand we need some way to control PMD-sized THP and LAF
to tune the workload, and nobody likes debugfs knob.

My concern about interface is that we have no way to disable LAF
system-wise without rebuilding the kernel.  In the future, should we add
a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
stricter than "never"?  "really_never"?

--
Best Regards,
Huang, Ying