linux-kernel - Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ygou6Gq79XY3mFK7@kernel.org>
Date:   Mon, 14 Feb 2022 12:28:56 +0200
From:   Mike Rapoport <rppt@...nel.org>
To:     Yu Zhao <yuzhao@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Mel Gorman <mgorman@...e.de>, Michal Hocko <mhocko@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
        Barry Song <21cnbao@...il.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Jonathan Corbet <corbet@....net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Michael Larabel <Michael@...haellarabel.com>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        page-reclaim@...gle.com, x86@...nel.org,
        Brian Geffon <bgeffon@...gle.com>,
        Jan Alexander Steffens <heftig@...hlinux.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        Steven Barrett <steven@...uorix.net>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Daniel Byrne <djbyrne@....edu>,
        Donald Carr <d@...os-reins.com>,
        Holger Hoffstätte 
        <holger@...lied-asynchrony.com>,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>,
        Shuang Zhai <szhai2@...rochester.edu>,
        Sofia Trinh <sofia.trinh@....works>
Subject: Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation

Hi,

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> Acked-by: Brian Geffon <bgeffon@...gle.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@...hlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@...alenko.name>
> Acked-by: Steven Barrett <steven@...uorix.net>
> Acked-by: Suleiman Souhlal <suleiman@...gle.com>
> Tested-by: Daniel Byrne <djbyrne@....edu>
> Tested-by: Donald Carr <d@...os-reins.com>
> Tested-by: Holger Hoffstätte <holger@...lied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@...dex.ru>
> Tested-by: Shuang Zhai <szhai2@...rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@....works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++

Please consider splitting this patch into Documentation/admin-guide and
Documentation/vm parts.

For now I only had time to review the admin-guide part.

>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..16a543c8b886
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,121 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
+
> +Quick start
> +===========

There is no explanation why one would want to use multigenerational LRU
until the next section.

I think there should be an overview that explains why users would want to
enable multigenerational LRU. 

> +Build configurations
> +--------------------
> +:Required: Set ``CONFIG_LRU_GEN=y``.

Maybe 

	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

> +
> +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> + multigenerational LRU by default.
> +
> +Runtime configurations
> +----------------------
> +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> + ``CONFIG_LRU_GEN_ENABLED=n``.
> +
> +This file accepts different values to enabled or disabled the
> +following features:

Maybe

  After multigenerational LRU is enabled, this file accepts different
  values to enable or disable the following feaures:

> +====== ========
> +Values Features
> +====== ========
> +0x0001 the multigenerational LRU

The multigenerational LRU what?

What will happen if I write 0x2 to this file?
Please consider splitting "enable" and "features" attributes.

> +0x0002 clear the accessed bit in leaf page table entries **in large
> +       batches**, when MMU sets it (e.g., on x86)

Is extra markup really needed here...

> +0x0004 clear the accessed bit in non-leaf page table entries **as
> +       well**, when MMU sets it (e.g., on x86)

... and here?

As for the descriptions, what is the user-visible effect of these features?
How different modes of clearing the access bit are reflected in, say, GUI
responsiveness, database TPS, or probability of OOM?

> +[yYnN] apply to all the features above
> +====== ========
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Most users should enable or disable all the features unless some of
> +them have unforeseen side effects.
> +
> +Recipes
> +=======
> +Personal computers
> +------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multigenerational LRU offers thrashing prevention to
> +the majority of laptop and desktop users who don't have oomd.

I'd expect something like this paragraph in overview.

> +
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> + if this working set can't be kept in memory. Based on the average
> + human detectable lag (~100ms), ``N=1000`` usually eliminates
> + intolerable janks due to thrashing. Larger values like ``N=3000``
> + make janks less noticeable at the risk of premature OOM kills.

> +
> +Data centers
> +------------
> +Data centers want to optimize job scheduling (bin packing) to improve
> +memory utilizations. Job schedulers need to estimate whether a server
> +can allocate a certain amount of memory for a new job, and this step
> +is known as working set estimation, which doesn't impact the existing
> +jobs running on this server. They also want to attempt freeing some
> +cold memory from the existing jobs, and this step is known as proactive
> +reclaim, which improves the chance of landing a new job successfully.

This paragraph also fits overview.

> +
> +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> + for working set estimation and proactive reclaim.

Please add a note that this is build time option.

> +
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following

Is debugfs interface relevant only for datacenters? 

> + format:
> + ::
> +
> +   memcg  memcg_id  memcg_path
> +     node  node_id
> +       min_gen  birth_time  anon_size  file_size
> +       ...
> +       max_gen  birth_time  anon_size  file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.

It's unclear what is birth_time reference point. Is it milliseconds from
the system start or it is measured some other way?

> + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> + represents the group of the MRU pages and the oldest generation
> + represents the group of the LRU pages. For working set estimation, a

Please spell out MRU and LRU fully.

> + job scheduler writes to this file at a certain time interval to
> + create new generations, and it ranks available servers based on the
> + sizes of their cold memory defined by this time interval. For
> + proactive reclaim, a job scheduler writes to this file before it
> + tries to land a new job, and if it fails to materialize the cold
> + memory without impacting the existing jobs, it retries on the next
> + server according to the ranking result.

Is this knob only relevant for a job scheduler? Or it can be used in other
use-cases as well?

> +
> + This file accepts commands in the following subsections. Multiple

                              ^ described

> + command lines are supported, so does concatenation with delimiters
> + ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
> + the aging. It scans PTEs for hot pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
> + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
> + as well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

I feel that /sys/kernel/debug/lru_gen is too overloaded.

> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration

-- 
Sincerely yours,
Mike.