linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGsJ_4zrgsgzHLWrcr8Fa9nMni7FSokgEkY6yk6Eu6R3O_wL2w@mail.gmail.com>
Date: Thu, 31 Oct 2024 10:21:27 +1300
From: Barry Song <21cnbao@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Usama Arif <usamaarif642@...il.com>, akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, David Hildenbrand <david@...hat.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, Chris Li <chrisl@...nel.org>, 
	"Huang, Ying" <ying.huang@...el.com>, Kairui Song <kasong@...cent.com>, 
	Ryan Roberts <ryan.roberts@....com>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg

On Thu, Oct 31, 2024 at 10:10 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> [..]
> > >>> A crucial component is still missing—managing the compression and decompression
> > >>> of multiple pages as a larger block. This could significantly reduce
> > >>> system time and
> > >>> potentially resolve the kernel build issue within a small memory
> > >>> cgroup, even with
> > >>> swap thrashing.
> > >>>
> > >>> I’ll send an update ASAP so you can rebase for zswap.
> > >>
> > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> > >> Thats wont benefit zswap, right?
> > >
> > > That's right. I assume we can also make it work with zswap?
> >
> > Hopefully yes. Thats mainly why I was looking at that series, to try and find
> > a way to do something similar for zswap.
>
> I would prefer for these things to be done separately. We still need
> to evaluate the compression/decompression of large blocks. I am mainly
> concerned about having to decompress a large chunk to fault in one
> page.
>
> The obvious problems are fault latency, and wasted work having to
> consistently decompress the large chunk to take one page from it. We
> also need to decide if we'd rather split it after decompression and
> compress the parts that we didn't swap in separately.
>
> This can cause problems beyond the fault latency. Imagine the case
> where the system is under memory pressure, so we fallback to order-0
> swapin to avoid reclaim. Now we want to decompress a chunk that used
> to be 64K.

Yes, this could be an issue.

We had actually tried to utilize several buffers for those partial
swap-in cases,
where the decompressed data was held in anticipation of the upcoming
swap-in. This approach could address the majority of partial swap-ins for
fallback scenarios.

>
> We need to allocate 64K of contiguous memory for a temporary
> allocation to be able to fault a 4K page. Now we either need to:
> - Go into reclaim, which we were trying to avoid to begin with.
> - Dip into reserves to allocate the 64K as it's a temporary
> allocation. This is probably risky because under memory pressure, many
> CPUs may be doing this concurrently.

This has been addressed by using contiguous memory that is prepared on
a per-CPU basis., search the below:
"alloc_pages() might fail, so we don't depend on allocation:"
https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/

Thanks
Barry