linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yTuQMH2MMUnXRiSMbstOuoC2-fvNBsmb2noK9Axte5Gg@mail.gmail.com>
Date: Fri, 1 Nov 2024 09:59:54 +1300
From: Barry Song <21cnbao@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Usama Arif <usamaarif642@...il.com>, 
	akpm@...ux-foundation.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, 
	David Hildenbrand <david@...hat.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Chris Li <chrisl@...nel.org>, "Huang, Ying" <ying.huang@...el.com>, 
	Kairui Song <kasong@...cent.com>, Ryan Roberts <ryan.roberts@....com>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg

On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@...xchg.org> wrote:
> >
> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@...il.com> wrote:
> > > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@...il.com> wrote:
> > > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > > >>>> then the mitigation being tried here wont apply.
> > > > >>>
> > > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > > >>> entire system is under memory pressure. I guess we would want the same
> > > > >>> "mitigation" either way.
> > > > >>>
> > > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > > >> in memcg?
> > > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > > >> in cgroup and limit memory.max.
> > > > >
> > > > > You mean a benchmark that puts the entire system under memory
> > > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > > you have, among other factors.
> > > > >
> > > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > > size like a memcg, although you'd probably need to leave more room
> > > > > because the entire guest OS will also subject to the same limit.
> > > > >
> > > >
> > > > I had tried this, but the variance in time/zswap numbers was very high.
> > > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > > it very difficult to make comparison.
> > >
> > > Hmm yeah maybe more factors come into play with global memory
> > > pressure. I am honestly not sure how to test this scenario, and I
> > > suspect variance will be high anyway.
> > >
> > > We can just try to use whatever technique we use for the memcg limit
> > > though, if possible, right?
> >
> > You can boot a physical machine with mem=1G on the commandline, which
> > restricts the physical range of memory that will be initialized.
> > Double check /proc/meminfo after boot, because part of that physical
> > range might not be usable RAM.
> >
> > I do this quite often to test physical memory pressure with workloads
> > that don't scale up easily, like kernel builds.
> >
> > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > > >>>> well.
> >
> > +1
> >
> > I also think there is too much focus on cgroup alone. The bigger issue
> > seems to be how much optimistic volume we swap in when we're under
> > pressure already. This applies to large folios and readahead; global
> > memory availability and cgroup limits.
>
> Agreed, although the characteristics of large folios and readahead are
> different. But yeah, different flavors of the same problem.
>
> >
> > It happens to manifest with THP in cgroups because that's what you
> > guys are testing. But IMO, any solution to this problem should
> > consider the wider scope.
>
> +1, and I really think this should be addressed separately, not just
> rely on large block compression/decompression to offset the cost. It's
> probably not just a zswap/zram problem anyway, it just happens to be
> what we support large folio swapin for.

Agreed these are two separate issues and should be both investigated
though 2 can offset the cost of 1.
1. swap thrashing
2. large block compression/decompression

For point 1, we likely want to investigate the following:

1. if we can see the same thrashing if we always perform readahead
(rapidly filling
the memcg to full again after reclamation).

2. Whether there are any issues with balancing file and anon memory
reclamation.

The 'refault feedback loop' in mglru compares refault rates between anon and
file pages to decide which type should be prioritized for reclamation.

type = get_type_to_scan(lruvec, swappiness, &tier);

static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int
*tier_idx)
{
        ...
        read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
        read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
        type = positive_ctrl_err(&sp, &pv);

        read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
        for (tier = 1; tier < MAX_NR_TIERS; tier++) {
                read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
                if (!positive_ctrl_err(&sp, &pv))
                        break;
        }

        *tier_idx = tier - 1;
        return type;
}

In this case, we may want to investigate whether reclamation is primarily
targeting anonymous memory due to potential errors in the statistics path
after mTHP is involved.

3. Determine if this is a memcg-specific issue by setting mem=1GB and
running the same test on the global system.

Yosry, Johannes, Usama,
Is there anything else that might interest us?

I'll get back to you after completing the investigation mentioned above.

>
> >
> > > > >>> I think large folio swapin would make the problem worse anyway. I am
> > > > >>> also not sure if the readahead window adjusts on memory pressure or
> > > > >>> not.
> > > > >>>
> > > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> > > > >> seen here as there would be in swapin_readahead?
> > > > >
> > > > > Maybe readahead is not as aggressive in general as large folio
> > > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > > > > of the window is the smaller of page_cluster (2 or 3) and
> > > > > SWAP_RA_ORDER_CEILING (5).
> > > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> > > > be similar to enabling 32K mTHP?
> > >
> > > Not quite.
> >
> > Actually, I would expect it to be...
> >
> > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > > > > contiguous allocation like large folio swapin. So that could be
> > > > > another factor why readahead may not reproduce the problem.
> > >
> > > Because of this ^.
> >
> > ...this matters for the physical allocation, which might require more
> > reclaim and compaction to produce the 32k. But an earlier version of
> > Barry's patch did the cgroup margin fallback after the THP was already
> > physically allocated, and it still helped.
> >
> > So the issue in this test scenario seems to be mostly about cgroup
> > volume. And then 8 4k charges should be equivalent to a singular 32k
> > charge when it comes to cgroup pressure.
>
> In this test scenario, yes, because it's only exercising cgroup
> pressure. But if we want a general solution that also addresses global
> pressure, I expect large folios to be worse because of the contiguity
> and the size (compared to default readahead window sizes). So I think
> we shouldn't only test with readahead, as it won't cover some of the
> large folio cases.

Thanks
barry