linux-kernel - Re: [PATCH RFC 3/4] lockref: rework CMPXCHG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <808181ffe87d83f8cb36ebb4afbf6cd90778c763.camel@kernel.org>
Date: Sat, 03 Aug 2024 07:32:05 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner
	 <brauner@...nel.org>, Jan Kara <jack@...e.cz>, Andrew Morton
	 <akpm@...ux-foundation.org>, Josef Bacik <josef@...icpanda.com>, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle
 contention better

On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@...nel.org> wrote:
> > 
> > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@...il.com> wrote:
> > > > 
> > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > > > When we tested that initially, the performance devolved significantly
> > > > > due to contention for the parent's d_lockref spinlock.
> > > > > 
> > > > > There are two problems with lockrefs today: First, once any concurrent
> > > > > task takes the spinlock, they all end up taking the spinlock, which is
> > > > > much more costly than a single cmpxchg operation. The second problem is
> > > > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > > > spinlock. The upshot there is that even moderate contention can cause a
> > > > > fallback to serialized spinlocking, which worsens performance.
> > > > > 
> > > > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > > > 
> > > > > First, change the loop to spin instead of falling back to a locked
> > > > > codepath when the spinlock is held. Once the lock is released, allow the
> > > > > task to continue trying its cmpxchg loop as before instead of taking the
> > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > > > Just continue infinitely.
> > > > > 
> > > > > This greatly reduces contention on the lockref when there are large
> > > > > numbers of concurrent increments and decrements occurring.
> > > > > 
> > > > 
> > > > This was already tried by me and it unfortunately can reduce performance.
> > > > 
> > > 
> > > Oh wait I misread the patch based on what I tried there. Spinning
> > > indefinitely waiting for the lock to be free is a no-go as it loses
> > > the forward progress guarantee (and it is possible to get the lock
> > > being continuously held). Only spinning up to an arbitrary point wins
> > > some in some tests and loses in others.
> > > 
> > 
> > I'm a little confused about the forward progress guarantee here. Does
> > that exist today at all? ISTM that falling back to spin_lock() after a
> > certain number of retries doesn't guarantee any forward progress. You
> > can still just end up spinning on the lock forever once that happens,
> > no?
> > 
> 
> There is the implicit assumption that everyone holds locks for a
> finite time. I agree there are no guarantees otherwise if that's what
> you meant.
> 
> In this case, since spinlocks are queued, a constant stream of lock
> holders will make the lock appear taken indefinitely even if they all
> hold it for a short period.
> 
> Stock lockref will give up atomics immediately and make sure to change
> the ref thanks to queueing up.
> 
> Lockref as proposed in this patch wont be able to do anything as long
> as the lock trading is taking place.
> 

Got it, thanks. This spinning is very simplistic, so I could see that
you could have one task continually getting shuffled to the end of the
queue.

> > > Either way, as described below, chances are decent that:
> > > 1. there is an easy way to not lockref_get/put on the parent if the
> > > file is already there, dodging the problem
> > > .. and even if that's not true
> > > 2. lockref can be ditched in favor of atomics. apart from some minor
> > > refactoring this all looks perfectly doable and I have a wip. I will
> > > try to find the time next week to sort it out
> > > 
> > 
> > Like I said in the earlier mail, I don't think we can stay in RCU mode
> > because of the audit_inode call. I'm definitely interested in your WIP
> > though!
> > 
> 
> well audit may be hackable so that it works in rcu most of the time,
> but that's not something i'm interested in

Audit not my favorite area of the kernel to work in either. I don't see
a good way to make it rcu-friendly, but I haven't looked too hard yet
either. It would be nice to be able to do some of the auditing under
rcu or spinlock.

>
> sorting out the lockref situation would definitely help other stuff
> (notably opening the same file RO).
> 

Indeed. It's clear that the current implementation is a real
scalability problem in a lot of situations.

> anyhow one idea is to temporarily disable atomic ops with a flag in
> the counter, a fallback plan is to loosen lockref so that it can do
> transitions other than 0->1->2 with atomics, even if the lock is held.
> 
> I have not looked at this in over a month, I'm going to need to
> refresh my memory on the details, I do remember there was some stuff
> to massage first.
> 
> Anyhow, I expect a working WIP some time in the upcoming week.
> 

Great, I'll stay tuned.

Thanks!
-- 
Jeff Layton <jlayton@...nel.org>