linux-kernel - Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4z8kh4Pn-TUrVq6FALR1J5j4fpvQkef2xPFYPWdWfXdxA@mail.gmail.com>
Date: Thu, 19 Sep 2024 20:20:51 +1200
From: Barry Song <baohua@...nel.org>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Hugh Dickins <hughd@...gle.com>, 
	Jonathan Corbet <corbet@....net>, "Matthew Wilcox (Oracle)" <willy@...radead.org>, 
	David Hildenbrand <david@...hat.com>, Lance Yang <ioworker0@...il.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, Gavin Shan <gshan@...hat.com>, 
	Pankaj Raghav <kernel@...kajraghav.com>, Daniel Gomez <da.gomez@...sung.com>, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH v1 0/4] Control folio sizes used for page cache memory

On Thu, Aug 8, 2024 at 10:27 PM Ryan Roberts <ryan.roberts@....com> wrote:
>
> On 17/07/2024 08:12, Ryan Roberts wrote:
> > Hi All,
> >
> > This series is an RFC that adds sysfs and kernel cmdline controls to configure
> > the set of allowed large folio sizes that can be used when allocating
> > file-memory for the page cache. As part of the control mechanism, it provides
> > for a special-case "preferred folio size for executable mappings" marker.
> >
> > I'm trying to solve 2 separate problems with this series:
> >
> > 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified
> > approach for the change at [1]. Instead of hardcoding the preferred executable
> > folio size into the arch, user space can now select it. This decouples the arch
> > code and also makes the mechanism more generic; it can be bypassed (the default)
> > or any folio size can be set. For my use case, 64K is preferred, but I've also
> > heard from Willy of a use case where putting all text into 2M PMD-sized folios
> > is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and
> > therefore faulting in all text ahead of time) to achieve that.
>
> Just a polite bump on this; I'd really like to get something like this merged to
> help reduce iTLB pressure. We had a discussion at the THP Cabal meeting a few
> weeks back without solid conclusion. I haven't heard any concrete objections
> yet, but also only a luke-warm reception. How can I move this forwards?

Hi Ryan,

These requirements seem to apply to anon, swap, pagecache, and shmem to
some extent. While the swapin_enabled knob was rejected, the shmem_enabled
option is already in place.

I wonder if it's possible to use the existing 'enabled' setting across
all cases, as
from an architectural perspective with cont-pte, pagecache may not differ from
anon. The demand for reducing page faults, LRU overhead, etc., also seems
quite similar.

I imagine that once Android's file systems support mTHP, we’ll uniformly enable
64KB for anon, swap, shmem, and page cache. It should then be sufficient to
enable all of them using a single knob:
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled'.

Is there anything that makes pagecache and shmem significantly different
from anon? In my Android case, they all seem the same. However, I assume
there might be other use cases where differentiating them is necessary?

>
> Thanks,
> Ryan
>
>
> >
> > 2. Reduce memory fragmentation in systems under high memory pressure (e.g.
> > Android): The theory goes that if all folios are 64K, then failure to allocate a
> > 64K folio should become unlikely. But if the page cache is allocating lots of
> > different orders, with most allocations having an order below 64K (as is the
> > case today) then ability to allocate 64K folios diminishes. By providing control
> > over the allowed set of folio sizes, we can tune to avoid crucial 64K folio
> > allocation failure. Additionally I've heard (second hand) of the need to disable
> > large folios in the page cache entirely due to latency concerns in some
> > settings. These controls allow all of this without kernel changes.
> >
> > The value of (1) is clear and the performance improvements are documented in
> > patch 2. I don't yet have any data demonstrating the theory for (2) since I
> > can't reproduce the setup that Barry had at [2]. But my view is that by adding
> > these controls we will enable the community to explore further, in the same way
> > that the anon mTHP controls helped harden the understanding for anonymous
> > memory.
> >
> > ---
> > This series depends on the "mTHP allocation stats for file-backed memory" series
> > at [3], which itself applies on top of yesterday's mm-unstable (650b6752c8a3). All
> > mm selftests have been run; no regressions were observed.
> >
> > [1] https://lore.kernel.org/linux-mm/20240215154059.2863126-1-ryan.roberts@arm.com/
> > [2] https://www.youtube.com/watch?v=ht7eGWqwmNs&list=PLbzoR-pLrL6oj1rVTXLnV7cOuetvjKn9q&index=4
> > [3] https://lore.kernel.org/linux-mm/20240716135907.4047689-1-ryan.roberts@arm.com/
> >
> > Thanks,
> > Ryan
> >
> > Ryan Roberts (4):
> >   mm: mTHP user controls to configure pagecache large folio sizes
> >   mm: Introduce "always+exec" for mTHP file_enabled control
> >   mm: Override mTHP "enabled" defaults at kernel cmdline
> >   mm: Override mTHP "file_enabled" defaults at kernel cmdline
> >
> >  .../admin-guide/kernel-parameters.txt         |  16 ++
> >  Documentation/admin-guide/mm/transhuge.rst    |  66 +++++++-
> >  include/linux/huge_mm.h                       |  61 ++++---
> >  mm/filemap.c                                  |  26 ++-
> >  mm/huge_memory.c                              | 158 +++++++++++++++++-
> >  mm/readahead.c                                |  43 ++++-
> >  6 files changed, 329 insertions(+), 41 deletions(-)
> >
> > --
> > 2.43.0
> >
>

Thanks
Barry