linux-kernel - Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z6O2oPP7lyRGXer_@google.com>
Date: Wed, 5 Feb 2025 19:06:08 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Minchan Kim <minchan@...nel.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

On Wed, Feb 05, 2025 at 11:43:16AM +0900, Sergey Senozhatsky wrote:
> On (25/02/04 17:19), Yosry Ahmed wrote:
> > > sizeof(struct zs_page) change is one thing.  Another thing is that
> > > zspage->lock is taken from atomic sections, pretty much everywhere.
> > > compaction/migration write-lock it under pool rwlock and class spinlock,
> > > but both compaction and migration now EAGAIN if the lock is locked
> > > already, so that is sorted out.
> > > 
> > > The remaining problem is map(), which takes zspage read-lock under pool
> > > rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> > > into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> > > so it cannot schedule unless locks it's nested under permit it (needless to
> > > say neither rwlock nor spinlock permit it).
> > 
> > Hmm, so we want the lock to be preemtible, but we don't want to use an
> > existing preemtible lock because it may be held it from atomic context.
> > 
> > I think one problem here is that the lock you are introducing is a
> > spinning lock but the lock holder can be preempted. This is why spinning
> > locks do not allow preemption. Others waiting for the lock can spin
> > waiting for a process that is scheduled out.
> > 
> > For example, the compaction/migration code could be sleeping holding the
> > write lock, and a map() call would spin waiting for that sleeping task.
> 
> write-lock holders cannot sleep, that's the key part.
> 
> So the rules are:
> 
> 1) writer cannot sleep
>    - migration/compaction runs in atomic context and grabs
> 	 write-lock only from atomic context
>    - write-locking function disables preemption before lock(), just to be
> 	 safe, and enables it after unlock()
> 
> 2) writer does not spin waiting
>    - that's why there is only write_try_lock function
> 	  - compaction and migration bail out when they cannot lock the
> 		zspage
> 
> 3) readers can sleep and can spin waiting for a lock
>    - other (even preempted) readers don't block new readers
>    - writers don't sleep, they always unlock

That's useful, thanks. If we go with custom locking we need to document
this clearly and add debug checks where possible.

> 
> > I wonder if there's a way to rework the locking instead to avoid the
> > nesting. It seems like sometimes we lock the zspage with the pool lock
> > held, sometimes with the class lock held, and sometimes with no lock
> > held.
> > 
> > What are the rules here for acquiring the zspage lock?
> 
> Most of that code is not written by me, but I think the rule is to disable
> "migration" be it via pool lock or class lock.

It seems like we're not holding either of these locks in
async_free_zspage() when we call lock_zspage(). Is it safe for a
different reason?

> 
> > Do we need to hold another lock just to make sure the zspage does not go
> > away from under us?
> 
> Yes, the page cannot go away via "normal" path:
>    zs_free(last object) -> zspage becomes empty -> free zspage
> 
> so when we have active mapping() it's only migration and compaction
> that can free zspage (its content is migrated and so it becomes empty).
> 
> > Can we use RCU or something similar to do that instead?
> 
> Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> patterns the clients have.   I suspect we'd need to synchronize RCU every
> time a zspage is freed: zs_free() [this one is complicated], or migration,
> or compaction?  Sounds like anti-pattern for RCU?

Can't we use kfree_rcu() instead of synchronizing? Not sure if this
would still be an antipattern tbh. It just seems like the current
locking scheme is really complicated :/