linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkZ_xQHMoze_w3yBHgjPhQeDynJ+vWddbYKFzi2c63sT7w@mail.gmail.com>
Date: Thu, 31 Oct 2024 08:59:46 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Usama Arif <usamaarif642@...il.com>, Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, 
	David Hildenbrand <david@...hat.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Chris Li <chrisl@...nel.org>, "Huang, Ying" <ying.huang@...el.com>, 
	Kairui Song <kasong@...cent.com>, Ryan Roberts <ryan.roberts@....com>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg

On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@...il.com> wrote:
> > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@...il.com> wrote:
> > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > >>>> then the mitigation being tried here wont apply.
> > > >>>
> > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > >>> entire system is under memory pressure. I guess we would want the same
> > > >>> "mitigation" either way.
> > > >>>
> > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > >> in memcg?
> > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > >> in cgroup and limit memory.max.
> > > >
> > > > You mean a benchmark that puts the entire system under memory
> > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > you have, among other factors.
> > > >
> > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > size like a memcg, although you'd probably need to leave more room
> > > > because the entire guest OS will also subject to the same limit.
> > > >
> > >
> > > I had tried this, but the variance in time/zswap numbers was very high.
> > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > it very difficult to make comparison.
> >
> > Hmm yeah maybe more factors come into play with global memory
> > pressure. I am honestly not sure how to test this scenario, and I
> > suspect variance will be high anyway.
> >
> > We can just try to use whatever technique we use for the memcg limit
> > though, if possible, right?
>
> You can boot a physical machine with mem=1G on the commandline, which
> restricts the physical range of memory that will be initialized.
> Double check /proc/meminfo after boot, because part of that physical
> range might not be usable RAM.
>
> I do this quite often to test physical memory pressure with workloads
> that don't scale up easily, like kernel builds.
>
> > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > >>>> well.
>
> +1
>
> I also think there is too much focus on cgroup alone. The bigger issue
> seems to be how much optimistic volume we swap in when we're under
> pressure already. This applies to large folios and readahead; global
> memory availability and cgroup limits.

Agreed, although the characteristics of large folios and readahead are
different. But yeah, different flavors of the same problem.

>
> It happens to manifest with THP in cgroups because that's what you
> guys are testing. But IMO, any solution to this problem should
> consider the wider scope.

+1, and I really think this should be addressed separately, not just
rely on large block compression/decompression to offset the cost. It's
probably not just a zswap/zram problem anyway, it just happens to be
what we support large folio swapin for.

>
> > > >>> I think large folio swapin would make the problem worse anyway. I am
> > > >>> also not sure if the readahead window adjusts on memory pressure or
> > > >>> not.
> > > >>>
> > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> > > >> seen here as there would be in swapin_readahead?
> > > >
> > > > Maybe readahead is not as aggressive in general as large folio
> > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > > > of the window is the smaller of page_cluster (2 or 3) and
> > > > SWAP_RA_ORDER_CEILING (5).
> > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> > > be similar to enabling 32K mTHP?
> >
> > Not quite.
>
> Actually, I would expect it to be...
>
> > > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > > > contiguous allocation like large folio swapin. So that could be
> > > > another factor why readahead may not reproduce the problem.
> >
> > Because of this ^.
>
> ...this matters for the physical allocation, which might require more
> reclaim and compaction to produce the 32k. But an earlier version of
> Barry's patch did the cgroup margin fallback after the THP was already
> physically allocated, and it still helped.
>
> So the issue in this test scenario seems to be mostly about cgroup
> volume. And then 8 4k charges should be equivalent to a singular 32k
> charge when it comes to cgroup pressure.

In this test scenario, yes, because it's only exercising cgroup
pressure. But if we want a general solution that also addresses global
pressure, I expect large folios to be worse because of the contiguity
and the size (compared to default readahead window sizes). So I think
we shouldn't only test with readahead, as it won't cover some of the
large folio cases.