[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m7xv5v46sowhm7xwutv5zqxpfygcg2r3asfd3npakmacap6wq2@prtxaiwzzgjn>
Date: Tue, 28 Jan 2025 14:29:56 +0900
From: Sergey Senozhatsky <senozhatsky@...omium.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>, Minchan Kim <minchan@...nel.org>,
Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 5/6] zsmalloc: introduce handle mapping API
On (25/01/28 01:36), Yosry Ahmed wrote:
> > Yes, for (a) mentioned above.
> >
> > > I guess in the WO case the buffer is not needed and we can just pass
> > > NULL?
> >
> > Yes.
>
> Perhaps we want to document this and enforce it (make sure that the
> NULL-ness of the buffer matches the access type).
Right.
> > But, and it's a big but. And it's (b) from the above. I wasn't brave
> > enough to just drop (b) optimization and replace it with memcpy(),
> > especially when we work with relatively large objects (say size-class
> > 3600 bytes and above). This certainly would not make battery powered
> > devices happier. Maybe in zswap the page is only read once (is that
> > correct?), but in zram page can be read multiple times (e.g. when zram
> > is used as a raw block-dev, or has a mounted fs on it) which means
> > multiple extra memcpy()-s.
>
> In zswap, because we use the crypto_acomp API, when we cannot sleep with
> the object mapped (which is true for zsmalloc), we just copy the
> compressed object into a preallocated buffer anyway. So having a
> zs_obj_load() interface would move that copy inside zsmalloc.
Yeah, I saw zpool_can_sleep_mapped() and had the same thought. zram,
as of now, doesn't support algos that can/need schedule internally for
whatever reason - kmalloc, mutex, H/W wait, etc.
> With your series, zswap can drop the memcpy and save some cycles on the
> compress side. I didn't realize that zram does not perform any copies on the
> read/decompress side.
>
> Maybe the load interface can still provide a buffer to avoid the copy
> where possible? I suspect with that we don't need the state and can
> just pass a pointer. We'd need another call to potentially unmap, so
> maybe load_start/load_end, or read_start/read_end.
>
> Something like:
>
> zs_obj_read_start(.., buf)
> {
> if (contained in one page)
> return kmapped obj
> else
> memcpy to buf
> return buf
> }
>
> zs_obj_read_end(.., buf)
> {
> if (container in one page)
> kunmap
> }
>
> The interface is more straightforward and we can drop the map flags
> entirely, unless I missed something here. Unfortunately you'd still need
> the locking changes in zsmalloc to make zram reads fully preemptible.
Agreed, the interface part is less of a problem, the atomicity of zsmalloc
is a much bigger issue. We, technically, only need to mark zspage as "being
used, don't free" so that zsfree/compaction/migration don't mess with it,
but this is only "technically". In practice we then have
CPU0 CPU1
zs_map_object
set READ bit migrate
schedule pool rwlock
size class spin-lock
wait for READ bit to clear
... set WRITE bit
clear READ bit
and the whole thing collapses like a house of cards. I wasn't able
to trigger a watchdog on my tests, but the pattern is there and it's
enough. Maybe we can teach compaction and migration to try-WRITE and
bail out if the page is locked, but I don't know.
> I am not suggesting that we have to go this way, just throwing out
> ideas.
Sure, load+store is still an option. While that zs_map_object()
optimization is nice, it may have two sides [in zram case]. On
one hand, we safe memcpy() [but only for certain objects], on the
other hand, we keep the page locked for the entire decompression
duration, which can be quite a while (e.g. when algorithm is
configured with a very high compression level):
CPU0 CPU1
zs_map_object
read lock page rwlock write lock page rwlock
spin
decompress() ... spin a lot
read unlock page rwlock
Maybe copy-in is just an okay thing to do. Let me try to measure.
> BTW, are we positive that the locking changes made in this series are
> not introducing regressions?
Cannot claim that with confidence. Our workloads don't match, we don't
even use zsmalloc in the same way :) Here be dragons.
Powered by blists - more mailing lists