lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6JMIKlO9U5P-5gE@google.com>
Date: Tue, 4 Feb 2025 17:19:28 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Minchan Kim <minchan@...nel.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

On Tue, Feb 04, 2025 at 03:59:42PM +0900, Sergey Senozhatsky wrote:
> On (25/02/03 21:11), Yosry Ahmed wrote:
> > > > We also lose some debugging capabilities as Hilf pointed out in another
> > > > patch.
> > > 
> > > So that zspage lock should have not been a lock, I think, it's a ref-counter
> > > and it's being used as one
> > > 
> > > map()
> > > {
> > > 	page->users++;
> > > }
> > > 
> > > unmap()
> > > {
> > > 	page->users--;
> > > }
> > > 
> > > migrate()
> > > {
> > > 	if (!page->users)
> > > 		migrate_page();
> > > }
> > 
> > Hmm, but in this case we want migration to block new map/unmap
> > operations. So a vanilla refcount won't work.
> 
> Yeah, correct - migration needs negative values so that map would
> wait until it's positive (or zero).
> 
> > > > Just my 2c.
> > > 
> > > Perhaps we can sprinkle some lockdep on it.  For instance:
> > 
> > Honestly this looks like more reason to use existing lock primitives to
> > me. What are the candidates? I assume rw_semaphore, anything else?
> 
> Right, rwsem "was" the first choice.
> 
> > I guess the main reason you didn't use a rw_semaphore is the extra
> > memory usage.
> 
> sizeof(struct zs_page) change is one thing.  Another thing is that
> zspage->lock is taken from atomic sections, pretty much everywhere.
> compaction/migration write-lock it under pool rwlock and class spinlock,
> but both compaction and migration now EAGAIN if the lock is locked
> already, so that is sorted out.
> 
> The remaining problem is map(), which takes zspage read-lock under pool
> rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> so it cannot schedule unless locks it's nested under permit it (needless to
> say neither rwlock nor spinlock permit it).

Hmm, so we want the lock to be preemtible, but we don't want to use an
existing preemtible lock because it may be held it from atomic context.

I think one problem here is that the lock you are introducing is a
spinning lock but the lock holder can be preempted. This is why spinning
locks do not allow preemption. Others waiting for the lock can spin
waiting for a process that is scheduled out.

For example, the compaction/migration code could be sleeping holding the
write lock, and a map() call would spin waiting for that sleeping task.

I wonder if there's a way to rework the locking instead to avoid the
nesting. It seems like sometimes we lock the zspage with the pool lock
held, sometimes with the class lock held, and sometimes with no lock
held.

What are the rules here for acquiring the zspage lock? Do we need to
hold another lock just to make sure the zspage does not go away from
under us? Can we use RCU or something similar to do that instead?

> 
> > Seems like it uses ~32 bytes more than rwlock_t on x86_64.
> > That's per zspage. Depending on how many compressed pages we have
> > per-zspage this may not be too bad.
> 
> So on a 16GB laptop our memory pressure test at peak used approx 1M zspages.
> That is 32 bytes * 1M ~ 32MB of extra memory use.  Not alarmingly a lot,
> less than what a single browser tab needs nowadays.  I suppose on 4GB/8GB
> that will be even smaller (because those device generate less zspages).
> Numbers are not the main issue, however.
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ