linux-kernel - Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YjoCqqs+UDxZ55nM@slm.duckdns.org>
Date:   Tue, 22 Mar 2022 07:08:58 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     Imran Khan <imran.f.khan@...cle.com>, gregkh@...uxfoundation.org,
        akpm@...ux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed
 rwsems.

Hello,

On Tue, Mar 22, 2022 at 02:40:54AM +0000, Al Viro wrote:
> Sorry, misread that thing - the reason it copies the damn thing at all is
> the use of strsep().  Yecch...  Rule of the thumb regarding strsep() use,
> be it in kernel or in the userland: don't.  It's almost never the right
> primitive to use.

Yeah, it's ugly. I was being lazy on top of the existing interface.

> Lookups should use qstr; it has both the length and place for hash.
> Switch kernfs_find_ns() to that (and lift the calculation of length
> into the callers that do not have it - note that kernfs_iop_lookup()
> does) and you don't need the strsep() shite (or copying) anymore.

and yeah this would be cleaner.

> That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared.
> 
> HOWEVER, that's not the only lock needed there and this patchset is
> broken in that respect - it locks the starting node, then walks the
> path.  Complete with lookups in rbtrees of children in the descendents
> of that node and those are *not* locked.
> 
> > > Wait a sec; what happens if e.g. kernfs_path_from_node() races with
> > > __kernfs_remove()?  We do _not_ clear ->parent, but we do drop references
> > > that used to pin what it used to point to, unless I'm misreading that
> > > code...  Or is it somehow prevented by drain-related logics?  Seeing
> > > that it seems to be possible to have kernfs_path_from_node() called from
> > > an interrupt context, that could be delicate...
> > 
> > kernfs_remove() is akin to freeing of the node and all its descendants. The
> > caller shouldn't be racing that against any other operations in the subtree.
> 
> That's interesting...  My impression had been that some of these functions
> could be called from interrupt contexts (judging by the spin_lock_irqsave()
> in there).  What kind of async contexts those are, and what do you use to
> make sure they don't leak into overlap with kernfs_remove()?

The spin_lock_irqsave()'s are there because they're often used when printing
messages which can happen from any context. e.g. cpuset ends up calling into
them to print current's cgroup under rcu_read_lock(), iocost to print
warning message under an irq-safe lock. In both and similar cases, the
caller knows that the cgroup is accessible which in turn guarantees that the
kernfs node hasn't be deleted.

> In particular, if you ever use those from RCU callbacks, you need to make
> sure that you have a grace period somewhere; the wait_event() you have in
> kernfs_drain() won't do it.

I don't know of cases where they are called from rcu callbacks although
there are cases where they're used from rcu criticial sections. That said,
at least in cgroup's case, cgroup destruction path has a grace period and
that's how a rcu critical section is safe to deref a cgroup and the
kernfs_node inside it.

Thanks.

-- 
tejun