linux-kernel - Re: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20240126122230.838-1-hdanton@sina.com>
Date: Fri, 26 Jan 2024 20:22:30 +0800
From: Hillf Danton <hdanton@...a.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Will Deacon <will@...nel.org>,
	Waiman Long <longman@...hat.com>,
	Boqun Feng <boqun.feng@...il.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up_write

On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@...gle.com>
> Hillf Danton <hdanton@...a.com> writes:
> > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@...gle.com>
> >> Hillf Danton <hdanton@...a.com> writes:
> >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@...gle.com>
> >> >> So the actual problem we saw was that one job had severe slowdowns
> >> >> during startup with certain other jobs on the machine, and the slowdowns
> >> >> turned out to be some cgroup moves it did during startup. The antagonist
> >> >> jobs were spawning huge numbers of threads and some other internal bugs
> >> >> were exacerbating their contention. The lock handoff meant that a batch
> >> >> of antagonist threads would receive the read lock of
> >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
> >> >> long time to be scheduled.
> >> >
> >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
> >> > in rwsem_down_read_slowpath().
> >> 
> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
> >
> > You and I are not on the same page.
> >
> >> Percpu-rwsem's current code has perfect handoff for read->write, and a very
> >> short window for write->read (or write->write) to be beaten by a new writer.
> >
> > Given no chance left for spin on owner who is legal to take a ten-minute nap,
> > the right thing known to do on behalf of starved waiters is to add the HANDOFF
> > mechanism without any heuristic like you proposed for instance, in order to
> > force lock acquirers to go the slow path.
> >
> > Only for thoughts.
> 
> This is not the type of slowdown that is the problem my patch is trying
> to address. (And due to the way percpu-rwsem works sem->ww is nearly
> entirely redundant with sem->block - the first waiting writer is instead
> waiting on rcuwait and holds sem->block while doing so)
> 
> The problem that my patch addresses is:
> 
> Writer is done: percpu_up_write
>   atomic_set_release(&sem->block, 0);  // #1
>   wake a batch of readers:
>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2
>   wake a single writer
>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3
> new writer wakes up (holding sem->block from #3)
>   sees the readers holding the lock from #2, now sleeps on rcuwait
> time passes // #4
> readers finally get to run, run quickly and release the lock
> now the writer gets to run
> 
> Currently the only source of unfairness/optimistic locking is the window
> between #1 and #2, which occur in quick succession, on the same thread,
> and with no SPIN_ON_OWNER to make this window more likely than it
> otherwise would be.

The sem->ww introduced closes the window between #1 and #2 by define
as it is derived from rwsem's HANDOFF.
> 
> My patch makes the entire #4 available to writers (or new readers), so
> that the woken writer will instead get to run immediately. This is

Victims rise in case the woken readers at #2 have been waiting more
than a minute while the woken writer less than 20ms.

> obviously much less fair, but provides much better throughput (ideally
> it might have some sort of delay, so that in more normal circumstances
> readers don't have to win the wakeup race by luck and being woken
> slightly sooner, but I don't have that).
> 
> This is also only useful because of the assumption that readers will
> almost always not actually block (among other required assumptions) - if

Like heuristic, any assumption makes the locking game more complex than
thought without real win.

> they regularly sleep while holding the lock, then saving writers from
> that first wakeup latency of readers isn't particularly helpful anymore,
> because they'll still be delayed by the other wakeup latencies.