linux-kernel - Re: [RFC PATCH 5/6] zsmalloc: introduce handle mapping API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <m7xv5v46sowhm7xwutv5zqxpfygcg2r3asfd3npakmacap6wq2@prtxaiwzzgjn>
Date: Tue, 28 Jan 2025 14:29:56 +0900
From: Sergey Senozhatsky <senozhatsky@...omium.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Minchan Kim <minchan@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 5/6] zsmalloc: introduce handle mapping API

On (25/01/28 01:36), Yosry Ahmed wrote:
> > Yes, for (a) mentioned above.
> > 
> > > I guess in the WO case the buffer is not needed and we can just pass
> > > NULL?
> > 
> > Yes.
> 
> Perhaps we want to document this and enforce it (make sure that the
> NULL-ness of the buffer matches the access type).

Right.

> > But, and it's a big but.  And it's (b) from the above.  I wasn't brave
> > enough to just drop (b) optimization and replace it with memcpy(),
> > especially when we work with relatively large objects (say size-class
> > 3600 bytes and above).  This certainly would not make battery powered
> > devices happier.  Maybe in zswap the page is only read once (is that
> > correct?), but in zram page can be read multiple times (e.g. when zram
> > is used as a raw block-dev, or has a mounted fs on it) which means
> > multiple extra memcpy()-s.
> 
> In zswap, because we use the crypto_acomp API, when we cannot sleep with
> the object mapped (which is true for zsmalloc), we just copy the
> compressed object into a preallocated buffer anyway. So having a
> zs_obj_load() interface would move that copy inside zsmalloc.

Yeah, I saw zpool_can_sleep_mapped() and had the same thought.  zram,
as of now, doesn't support algos that can/need schedule internally for
whatever reason - kmalloc, mutex, H/W wait, etc.

> With your series, zswap can drop the memcpy and save some cycles on the
> compress side. I didn't realize that zram does not perform any copies on the
> read/decompress side.
> 
> Maybe the load interface can still provide a buffer to avoid the copy
> where possible? I suspect with that we don't need the state and can
> just pass a pointer. We'd need another call to potentially unmap, so
> maybe load_start/load_end, or read_start/read_end.
> 
> Something like:
> 
> zs_obj_read_start(.., buf)
> {
> 	if (contained in one page)
> 		return kmapped obj
> 	else
> 		memcpy to buf
> 		return buf
> }
> 
> zs_obj_read_end(.., buf)
> {
> 	if (container in one page)
> 		kunmap
> }
> 
> The interface is more straightforward and we can drop the map flags
> entirely, unless I missed something here. Unfortunately you'd still need
> the locking changes in zsmalloc to make zram reads fully preemptible.

Agreed, the interface part is less of a problem, the atomicity of zsmalloc
is a much bigger issue.  We, technically, only need to mark zspage as "being
used, don't free" so that zsfree/compaction/migration don't mess with it,
but this is only "technically".  In practice we then have

	CPU0							CPU1

	zs_map_object
	set READ bit					migrate
	schedule						pool rwlock
									size class spin-lock
									wait for READ bit to clear
	...								set WRITE bit
	clear READ bit

and the whole thing collapses like a house of cards.  I wasn't able
to trigger a watchdog on my tests, but the pattern is there and it's
enough.  Maybe we can teach compaction and migration to try-WRITE and
bail out if the page is locked, but I don't know.

> I am not suggesting that we have to go this way, just throwing out
> ideas.

Sure, load+store is still an option.  While that zs_map_object()
optimization is nice, it may have two sides [in zram case].  On
one hand, we safe memcpy() [but only for certain objects], on the
other hand, we keep the page locked for the entire decompression
duration, which can be quite a while (e.g. when algorithm is
configured with a very high compression level):

	CPU0							CPU1

	zs_map_object
	read lock page rwlock			write lock page rwlock
									spin
	decompress()					... spin a lot
	read unlock page rwlock

Maybe copy-in is just an okay thing to do.  Let me try to measure.

> BTW, are we positive that the locking changes made in this series are
> not introducing regressions?

Cannot claim that with confidence.  Our workloads don't match, we don't
even use zsmalloc in the same way :)  Here be dragons.