linux-kernel - Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ygxt4iR9ZMYEbV78@google.com>
Date:   Tue, 15 Feb 2022 20:22:10 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Mike Rapoport <rppt@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Mel Gorman <mgorman@...e.de>, Michal Hocko <mhocko@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
        Barry Song <21cnbao@...il.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Jonathan Corbet <corbet@....net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Michael Larabel <Michael@...haellarabel.com>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        page-reclaim@...gle.com, x86@...nel.org,
        Brian Geffon <bgeffon@...gle.com>,
        Jan Alexander Steffens <heftig@...hlinux.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        Steven Barrett <steven@...uorix.net>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Daniel Byrne <djbyrne@....edu>,
        Donald Carr <d@...os-reins.com>,
        Holger Hoffstätte 
        <holger@...lied-asynchrony.com>,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>,
        Shuang Zhai <szhai2@...rochester.edu>,
        Sofia Trinh <sofia.trinh@....works>
Subject: Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation

On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:

Thanks for reviewing.

> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
> >  Documentation/vm/index.rst                    |   1 +
> >  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
> 
> Please consider splitting this patch into Documentation/admin-guide and
> Documentation/vm parts.

Will do.

> > +=====================
> > +Multigenerational LRU
> > +=====================
> +
> > +Quick start
> > +===========
> 
> There is no explanation why one would want to use multigenerational LRU
> until the next section.
> 
> I think there should be an overview that explains why users would want to
> enable multigenerational LRU. 

Will do.

> > +Build configurations
> > +--------------------
> > +:Required: Set ``CONFIG_LRU_GEN=y``.
> 
> Maybe 
> 
> 	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

Will do.

> > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> > + multigenerational LRU by default.
> > +
> > +Runtime configurations
> > +----------------------
> > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> > + ``CONFIG_LRU_GEN_ENABLED=n``.
> > +
> > +This file accepts different values to enabled or disabled the
> > +following features:
> 
> Maybe
> 
>   After multigenerational LRU is enabled, this file accepts different
>   values to enable or disable the following feaures:

Will do.

> > +====== ========
> > +Values Features
> > +====== ========
> > +0x0001 the multigenerational LRU
> 
> The multigenerational LRU what?

Itself? This depends on the POV, and I'm trying to determine what would
be the natural way to present it.

MGLRU itself could be seen as an add-on atop the existing page reclaim
or an alternative in parallel. The latter would be similar to sl[aou]b,
and that's how I personally see it.

But here I presented it more like the former because I feel this way is
more natural to users because they are like switches on a single panel.

> What will happen if I write 0x2 to this file?

Just like turning on a branch breaker while leaving the main breaker
off in a circuit breaker box. This is how I see it, and I'm totally
fine with changing it to whatever you'd recommend.

> Please consider splitting "enable" and "features" attributes.

How about s/Features/Components/?

> > +0x0002 clear the accessed bit in leaf page table entries **in large
> > +       batches**, when MMU sets it (e.g., on x86)
> 
> Is extra markup really needed here...
> 
> > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > +       well**, when MMU sets it (e.g., on x86)
> 
> ... and here?

Will do.

> As for the descriptions, what is the user-visible effect of these features?
> How different modes of clearing the access bit are reflected in, say, GUI
> responsiveness, database TPS, or probability of OOM?

These remain to be seen :) I just added these switches in v7, per Mel's
request from the meeting we had. These were never tested in the field.

> > +[yYnN] apply to all the features above
> > +====== ========
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Most users should enable or disable all the features unless some of
> > +them have unforeseen side effects.
> > +
> > +Recipes
> > +=======
> > +Personal computers
> > +------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multigenerational LRU offers thrashing prevention to
> > +the majority of laptop and desktop users who don't have oomd.
> 
> I'd expect something like this paragraph in overview.
> 
> > +
> > +:Thrashing prevention: Write ``N`` to
> > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> > + if this working set can't be kept in memory. Based on the average
> > + human detectable lag (~100ms), ``N=1000`` usually eliminates
> > + intolerable janks due to thrashing. Larger values like ``N=3000``
> > + make janks less noticeable at the risk of premature OOM kills.
> 
> > +
> > +Data centers
> > +------------
> > +Data centers want to optimize job scheduling (bin packing) to improve
> > +memory utilizations. Job schedulers need to estimate whether a server
> > +can allocate a certain amount of memory for a new job, and this step
> > +is known as working set estimation, which doesn't impact the existing
> > +jobs running on this server. They also want to attempt freeing some
> > +cold memory from the existing jobs, and this step is known as proactive
> > +reclaim, which improves the chance of landing a new job successfully.
> 
> This paragraph also fits overview.

Will do.

> > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> > + for working set estimation and proactive reclaim.
> 
> Please add a note that this is build time option.

Will do.

> > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> 
> Is debugfs interface relevant only for datacenters? 

For the moment, yes.

> > + format:
> > + ::
> > +
> > +   memcg  memcg_id  memcg_path
> > +     node  node_id
> > +       min_gen  birth_time  anon_size  file_size
> > +       ...
> > +       max_gen  birth_time  anon_size  file_size
> > +
> > + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> > + youngest generation number. ``birth_time`` is in milliseconds.
> 
> It's unclear what is birth_time reference point. Is it milliseconds from
> the system start or it is measured some other way?

Good point. Will clarify.

> > + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> > + represents the group of the MRU pages and the oldest generation
> > + represents the group of the LRU pages. For working set estimation, a
> 
> Please spell out MRU and LRU fully.

Will do.

> > + job scheduler writes to this file at a certain time interval to
> > + create new generations, and it ranks available servers based on the
> > + sizes of their cold memory defined by this time interval. For
> > + proactive reclaim, a job scheduler writes to this file before it
> > + tries to land a new job, and if it fails to materialize the cold
> > + memory without impacting the existing jobs, it retries on the next
> > + server according to the ranking result.
> 
> Is this knob only relevant for a job scheduler? Or it can be used in other
> use-cases as well?

There are other concrete use cases but I'm not ready to discuss them
yet.

> > + This file accepts commands in the following subsections. Multiple
> 
>                               ^ described

Will do.