linux-kernel - Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <etumn4tax7g5c3wygn2aazmo5m7f4ydfji7ehno5i6jckkf27e@mu3fisrw5jcc>
Date: Thu, 13 Feb 2025 10:20:26 +0900
From: Sergey Senozhatsky <senozhatsky@...omium.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <ryncsn@...il.com>, Minchan Kim <minchan@...nel.org>, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible

On (25/02/12 17:14), Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:10PM +0900, Sergey Senozhatsky wrote:
> > Switch over from rwlock_t to a atomic_t variable that takes negative
> > value when the page is under migration, or positive values when the
> > page is used by zsmalloc users (object map, etc.)   Using a rwsem
> > per-zspage is a little too memory heavy, a simple atomic_t should
> > suffice.
> 
> We should also explain that rwsem cannot be used due to the locking
> context (we need to hold it in an atomic context). Basically what you
> explained to me before :)
>
> > zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
> > Since this lock now permits preemption extra care needs to be taken when
> > it is write-acquired - all writers grab it in atomic context, so they
> > cannot spin and wait for (potentially preempted) reader to unlock zspage.
> > There are only two writers at this moment - migration and compaction.  In
> > both cases we use write-try-lock and bail out if zspage is read locked.
> > Writers, on the other hand, never get preempted, so readers can spin
> > waiting for the writer to unlock zspage.
> 
> The details are important, but I think we want to concisely state the
> problem statement either before or after. Basically we want a lock that
> we *never* sleep while acquiring but *can* sleep while holding in read
> mode. This allows holding the lock from any context, but also being
> preemptible if the context allows it.

Ack.

[..]
> > +/*
> > + * zspage locking rules:
> 
> Also here we need to state our key rule:
> Never sleep while acquiring, preemtible while holding (if possible). The
> following rules are basically how we make sure we keep this true.
> 
> > + *
> > + * 1) writer-lock is exclusive
> > + *
> > + * 2) writer-lock owner cannot sleep
> > + *
> > + * 3) writer-lock owner cannot spin waiting for the lock
> > + *   - caller (e.g. compaction and migration) must check return value and
> > + *     handle locking failures
> > + *   - there is only TRY variant of writer-lock function
> > + *
> > + * 4) reader-lock owners (multiple) can sleep
> > + *
> > + * 5) reader-lock owners can spin waiting for the lock, in any context
> > + *   - existing readers (even preempted ones) don't block new readers
> > + *   - writer-lock owners never sleep, always unlock at some point
> 
> 
> May I suggest something more concise and to the point?
> 
> /*
>  * The zspage lock can be held from atomic contexts, but it needs to remain
>  * preemptible when held for reading because it remains held outside of those
>  * atomic contexts, otherwise we unnecessarily lose preemptibility.
>  *
>  * To achieve this, the following rules are enforced on readers and writers:
>  *
>  * - Writers are blocked by both writers and readers, while readers are only
>  *   blocked by writers (i.e. normal rwlock semantics).
>  *
>  * - Writers are always atomic (to allow readers to spin waiting for them).
>  *
>  * - Writers always use trylock (as the lock may be held be sleeping readers).
>  *
>  * - Readers may spin on the lock (as they can only wait for atomic writers).
>  *
>  * - Readers may sleep while holding the lock (as writes only use trylock).
>  */

Looks good, thanks.

> > + */
> > +static void zspage_read_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = atomic_read_acquire(lock);
> > +
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > +#endif
> > +
> > +	do {
> > +		if (old == ZS_PAGE_WRLOCKED) {
> > +			cpu_relax();
> > +			old = atomic_read_acquire(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> > +}
> > +
> > +static void zspage_read_unlock(struct zspage *zspage)
> > +{
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > +#endif
> > +	atomic_dec_return_release(&zspage->lock);
> > +}
> > +
> > +static __must_check bool zspage_try_write_lock(struct zspage *zspage)
> 
> I believe zspage_write_trylock() would be closer to the normal rwlock
> naming.

It derived its name from rwsem "age".  Can rename.

> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = ZS_PAGE_UNLOCKED;
> > +
> > +	WARN_ON_ONCE(preemptible());
> 
> Hmm I know I may have been the one suggesting this, but do we actually
> need it? We disable preemption explicitly anyway before holding the
> lock.

This is just to make sure that the precondition for
"writer is always atomic" is satisfied.  But I can drop it.

> >  	size_class_lock(class);
> > -	/* the migrate_write_lock protects zpage access via zs_map_object */
> > -	migrate_write_lock(zspage);
> > +	/* the zspage write_lock protects zpage access via zs_map_object */
> > +	if (!zspage_try_write_lock(zspage)) {
> > +		size_class_unlock(class);
> > +		pool_write_unlock(pool);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* We're committed, tell the world that this is a Zsmalloc page. */
> > +	__zpdesc_set_zsmalloc(newzpdesc);
> 
> We used to do this earlier on, before any locks are held. Why is it
> moved here?

I want to do that only if zspaage write-trylock has succeeded (we didn't
have any error out paths before).