linux-kernel - Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <qazrxmu2bdvsned3dp2fnoyagdsv4ueh57pghhjz5q7fmv5pz4@vzlp5kmcw354>
Date: Fri, 31 Jan 2025 12:34:47 +0900
From: Sergey Senozhatsky <senozhatsky@...omium.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Minchan Kim <minchan@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers

On (25/01/30 16:25), Yosry Ahmed wrote:
> On Thu, Jan 30, 2025 at 01:01:15PM +0900, Sergey Senozhatsky wrote:
> > On (25/01/29 16:59), Yosry Ahmed wrote:
> > > On Wed, Jan 29, 2025 at 03:43:47PM +0900, Sergey Senozhatsky wrote:
> > > > We currently have a mix of migrate_{read,write}_lock() helpers
> > > > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > > > access to which is opene-coded.  Factor out pool migrate locking
> > > > into helpers, zspage migration locking API will be renamed to
> > > > reduce confusion.
> > > 
> > > But why did we remove "migrate" from the helpers' names if this is the
> > > real migrate lock? It seems like the real problem is how the zspage lock
> > > helpers are named, which is addressed later.
> > 
> > So this is deliberately.  I guess, just like with page-faults in
> > zs_map_object(), this naming comes from the time when we had fewer
> > locks and less functionality.  These days we have 3 zsmalloc locks that
> > can "prevent migration" at different points.  Even size class spin-lock.
> > But it's not only about migration anymore.  We also now have compaction
> > (defragmentation) that these locks synchronize.  So I want to have a
> > reader-write naming scheme.
> 
> In this case shouldn't we also rename the lock itself?

Yeah, I guess, I thought about that for a second but figured that on
the grand scheme of things it wasn't all that important.  I'm going to
resend the series (after merge with zram series) so I'll rename it, to
pool->lock maybe.

[..]
> > > I am generally scared to move hot members around unnecessarily
> > > due to cache-line sharing problems -- although I don't know if
> > > these are really hot.
> > 
> > Size classes are basically static - once we init the array it
> > becomes RO.  Having migrate_lock at the bottom forces us to
> > jump over a big RO zs_pool region, besides we never use ->name
> > (unlike migrate_lock) so having it at offset 0 is unnecessary.
> 
> What's special about offset 0 though?
> 
> My understanding is that we want to put RW-mostly and RO-mostly members
> apart to land in different cachelines to avoid unnecessary bouncing.
> Also we want the elements that are usually accessed together next to one
> another.
> 
> Placing the RW lock right next to the RO size classes seems like it will
> result in unnecessary cacheline bouncing. For example, if a CPU holds
> the lock and dirties the cacheline then all other CPUs have to drop it
> and fetch it again when accessing the size classes.

So the idea was that offset 0 is located very close to low size classes,
which are a) not accessed that often and b) some of them are not accessed
at all.  I don't think I've ever seen size-class-32 (the first size-class)
being used.  I think for some algorithms compressing PAGE_SIZE below 32 bytes
is practically impossible, because of all the internal data that the
algorithms puts into compression output (checksums, flags, version numbers,
etc. etc.).

> > zs_pool_stats and compaction_in_progress are modified only
> > during compaction, which we do not run in parallel (only one
> > CPU can do compaction at a time), so we can do something like
> 
> But other CPUs can read compaction_in_progress, even if they don't
> modify it.

Yes, well, that's the nature of that flag.

I'll drop patch, I guess.  It's not worth our time.