linux-kernel - Re: [PATCH v4 00/30] NT synchronization primitive driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <23472492.6Emhk5qWAg@terabithia>
Date: Tue, 16 Apr 2024 16:18:17 -0500
From: Elizabeth Figura <zfigura@...eweavers.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Arnd Bergmann <arnd@...db.de>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>,
 linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
 wine-devel@...ehq.org,
 André Almeida <andrealmeid@...lia.com>,
 Wolfram Sang <wsa@...nel.org>, Arkadiusz Hiler <ahiler@...eweavers.com>,
 Andy Lutomirski <luto@...nel.org>, linux-doc@...r.kernel.org,
 linux-kselftest@...r.kernel.org, Randy Dunlap <rdunlap@...radead.org>,
 Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
 Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tuesday, 16 April 2024 11:19:17 CDT Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 05:53:45PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 16, 2024 at 05:50:14PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:
> > > > > Some aspects of the implementation may deserve particular comment:
> > > > > 
> > > > > * In the interest of performance, each object is governed only by a
> > > > > single
> > > > > 
> > > > >   spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of
> > > > >   multiple
> > > > >   objects be changed as a single atomic operation. In order to
> > > > >   achieve this, we first take a device-wide lock ("wait_all_lock")
> > > > >   any time we are going to lock more than one object at a time.
> > > > >   
> > > > >   The maximum number of objects that can be used in a vectored wait,
> > > > >   and
> > > > >   therefore the maximum that can be locked simultaneously, is 64.
> > > > >   This number is NT's own limit.
> > > 
> > > AFAICT:
> > > 	spin_lock(&dev->wait_all_lock);
> > > 	
> > > 	  list_for_each_entry(entry, &obj->all_waiters, node)
> > > 	  
> > > 	    for (i=0; i<count; i++)
> > > 	    
> > > 	      spin_lock_nest_lock(q->entries[i].obj->lock,
> > > 	      &dev->wait_all_lock);
> > > 
> > > Where @count <= NTSYNC_MAX_WAIT_COUNT.
> > > 
> > > So while this nests at most 65 spinlocks, there is no actual bound on
> > > the amount of nested lock sections in total. That is, all_waiters list
> > > can be grown without limits.
> > > 
> > > Can we pretty please make wait_all_lock a mutex ?

That should be fine, at least.

> > Hurmph, it's worse, you do that list walk while holding some obj->lock
> > spinlokc too. Still need to figure out how all that works....
> 
> So the point of having that other lock around is so that things like:
> 
> 	try_wake_all_obj(dev, sem)
> 	try_wake_any_sem(sem)
> 
> are done under the same lock?

The point of having the other lock around is that try_wake_all() needs to lock 
multiple objects at the same time. It's a way of avoiding lock inversion.

Consider task A does a wait-for-all on objects X, Y, Z. Then task B signals Y, 
so we do try_wake_all_obj() on Y, which does try_wake_all() on A's queue 
entry; that needs to check X and Z and consume the state of all three objects 
atomically. Another task could be trying to signal Z at the same time and 
could hit a task waiting on Z, Y, X, and that causes inversion.

The simple and easy way to implement everything is just to have a global lock 
on the whole device, but this is kind of known to be a performance bottleneck 
(this was NT's BKL, and they ditched it starting with Vista or 7 or 
something).

Instead we use a lock per object, and normally in the wait-for-any case we 
only ever need to grab one lock at a time, but when we need to do a wait-for-
all we need to lock multiple objects at once, and we grab the outer lock to 
avoid potential lock inversion.

> Where I seem to note that both those functions do that same list
> iteration.

Over different lists. I don't know if there's a better way to name things to 
make that clearer.

There's the "any" wait queue, which tasks which do a wait-for-any add 
themselves to, and the "all" wait queue, which tasks that do a wait-for-all 
add themselves to. Signaling an object could potentially wake up either one, 
but checking whether a task is eligible is a different process.