linux-kernel - Re: [PATCH 0/3] mm: zswap: trivial folio conversions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9de0ce63-3815-4c1a-91a2-11cb3d526672@linux.dev>
Date: Mon, 3 Jun 2024 14:19:17 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Yosry Ahmed <yosryahmed@...gle.com>, Nhat Pham <nphamcs@...il.com>
Cc: Matthew Wilcox <willy@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, David Hildenbrand <david@...hat.com>,
 Barry Song <21cnbao@...il.com>, Chris Li <chrisl@...nel.org>,
 Ryan Roberts <ryan.roberts@....com>, Kairui Song <kasong@...cent.com>
Subject: Re: [PATCH 0/3] mm: zswap: trivial folio conversions

On 2024/5/29 03:32, Yosry Ahmed wrote:
> On Tue, May 28, 2024 at 12:08 PM Nhat Pham <nphamcs@...il.com> wrote:
>>
>> On Fri, May 24, 2024 at 4:13 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>>>
>>> On Fri, May 24, 2024 at 12:53 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>>>>
>>>> On Thu, May 23, 2024 at 8:59 PM Matthew Wilcox <willy@...radead.org> wrote:
>>>>>
>>>>> On Fri, May 24, 2024 at 03:38:15AM +0000, Yosry Ahmed wrote:
>>>>>> Some trivial folio conversions in zswap code.
>>>>>
>>>>> The three patches themselves look good.
>>>>>
>>>>>> The mean reason I included a cover letter is that I wanted to get
>>>>>> feedback on what other trivial conversions can/should be done in
>>>>>> mm/zswap.c (keeping in mind that only order-0 folios are supported
>>>>>> anyway).  These are the things I came across while searching for 'page'
>>>>>> in mm/zswap.c, and chose not to do anything about for now:
>>>>>
>>>>> I think there's a deeper question to answer before answering these
>>>>> questions, which is what we intend to do with large folios and zswap in
>>>>> the future.  Do we intend to split them?  Compress them as a large
>>>>> folio?  Compress each page in a large folio separately?  I can see an
>>>>> argument for choices 2 and 3, but I think choice 1 is going to be
>>>>> increasingly untenable.
>>>>
>>>> Yeah I was kinda getting the small things out of the way so that zswap
>>>> is fully folio-ized, before we think about large folios. I haven't
>>>> given it a lot of thought, but here's what I have in mind.
>>>>
>>>> Right now, I think most configs enable zswap will disable
>>>> CONFIG_THP_SWAP (otherwise all THPs will go straight to disk), so
>>>> let's assume that today we are splitting large folios before they go
>>>> to zswap (i.e. choice 1).
>>>>
>>>> What we do next depends on how the core swap intends to deal with
>>>> large folios. My understanding based on recent developments is that we
>>>> intend to swapout large folios as a whole, but I saw some discussions
>>>> about splitting all large folios before swapping them out, or leaving
>>>> them whole but swapping them out in order-0 chunks.
>>>>
>>>> I assume the rationale is that there is little benefit to keeping the
>>>> folios whole because they will most likely be freed soon anyway, but I
>>>> understand not wanting to spend time on splitting them, so swapping
>>>> them out in order-0 chunks makes some sense to me. It also dodges the
>>>> whole fragmentation issue.
>>>>
>>>> If we do either of these things in the core swap code, then I think
>>>> zswap doesn't need to do anything to support large folios. If not,
>>>> then we need to make a choice between 2 (compress large folios) &
>>>> choice 3 (compress each page separately) as you mentioned.
>>>>
>>>> Compressing large folios as a whole means that we need to decompress
>>>> them as a whole to read a single page, which I think could be very
>>>> inefficient in some cases or force us to swapin large folios. Unless
>>>> of course we end up in a world where we mostly swapin the same large
>>>> folios that we swapped out. Although there can be additional
>>>> compression savings from compressing large folios as a whole.
>>>>
>>>> Hence, I think choice 3 is the most reasonable one, at least for the
>>>> short-term. I also think this is what zram does, but I haven't
>>>> checked. Even if we all agree on this, there are still questions that
>>>> we need to answer. For example, do we allocate zswap_entry's for each
>>>> order-0 chunk right away, or do we allocate a single zswap_entry for
>>>> the entire folio, and then "split" it during swapin if we only need to
>>>> read part of the folio?
>>>>
>>>> Wondering what others think here.
>>>
>>> More thoughts that came to mind here:
>>>
>>> - Whether we go with choice 2 or 3, we may face a latency issue. Zswap
>>> compression happens synchronously in the context of reclaim, so if we
>>> start handling large folios in zswap, it may be more efficient to do
>>> it asynchronously like swap to disk.
>>
>> We've been discussing this in private as well :)
>>
>> It doesn't have to be these two extremes right? I'm perfectly happy
>> with starting with compressing each subpage separately, but perhaps we
>> can consider managing larger folios in bigger chunks (say 64KB). That
>> way, on swap-in, we just have to bring a whole chunk in, not the
>> entire folio, and still take advantage of compression efficiencies on
>> bigger-than-one-page chunks. I'd also check with other filesystems
>> that leverage compression, to see what's their unit of compression is.
> 
> Right. But I think it will be a clearer win to start with compressing
> each subpage separately, and it avoids splitting folios during reclaim
> to zswap. It also doesn't depend on the zsmalloc work.
> 
> Once we have that, we can experiment with compressing folios in larger
> chunks. The tradeoffs become less clear at that point, and the number
> of variables you can tune goes up :)

Agree, it's a good approach! And it hasn't any decompression amplification
problem.

Thanks.