linux-kernel - Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6vtpamir4bvn3snlj36tfmnmpcbd6ks6m3sdn7ewmoles7jhau@nbezqbnoukzv>
Date: Wed, 5 Feb 2025 11:43:16 +0900
From: Sergey Senozhatsky <senozhatsky@...omium.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Minchan Kim <minchan@...nel.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

On (25/02/04 17:19), Yosry Ahmed wrote:
> > sizeof(struct zs_page) change is one thing.  Another thing is that
> > zspage->lock is taken from atomic sections, pretty much everywhere.
> > compaction/migration write-lock it under pool rwlock and class spinlock,
> > but both compaction and migration now EAGAIN if the lock is locked
> > already, so that is sorted out.
> > 
> > The remaining problem is map(), which takes zspage read-lock under pool
> > rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> > into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> > so it cannot schedule unless locks it's nested under permit it (needless to
> > say neither rwlock nor spinlock permit it).
> 
> Hmm, so we want the lock to be preemtible, but we don't want to use an
> existing preemtible lock because it may be held it from atomic context.
> 
> I think one problem here is that the lock you are introducing is a
> spinning lock but the lock holder can be preempted. This is why spinning
> locks do not allow preemption. Others waiting for the lock can spin
> waiting for a process that is scheduled out.
> 
> For example, the compaction/migration code could be sleeping holding the
> write lock, and a map() call would spin waiting for that sleeping task.

write-lock holders cannot sleep, that's the key part.

So the rules are:

1) writer cannot sleep
   - migration/compaction runs in atomic context and grabs
	 write-lock only from atomic context
   - write-locking function disables preemption before lock(), just to be
	 safe, and enables it after unlock()

2) writer does not spin waiting
   - that's why there is only write_try_lock function
	  - compaction and migration bail out when they cannot lock the
		zspage

3) readers can sleep and can spin waiting for a lock
   - other (even preempted) readers don't block new readers
   - writers don't sleep, they always unlock

> I wonder if there's a way to rework the locking instead to avoid the
> nesting. It seems like sometimes we lock the zspage with the pool lock
> held, sometimes with the class lock held, and sometimes with no lock
> held.
> 
> What are the rules here for acquiring the zspage lock?

Most of that code is not written by me, but I think the rule is to disable
"migration" be it via pool lock or class lock.

> Do we need to hold another lock just to make sure the zspage does not go
> away from under us?

Yes, the page cannot go away via "normal" path:
   zs_free(last object) -> zspage becomes empty -> free zspage

so when we have active mapping() it's only migration and compaction
that can free zspage (its content is migrated and so it becomes empty).

> Can we use RCU or something similar to do that instead?

Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
patterns the clients have.   I suspect we'd need to synchronize RCU every
time a zspage is freed: zs_free() [this one is complicated], or migration,
or compaction?  Sounds like anti-pattern for RCU?