linux-kernel - Re: [RFC v3 1/1] fs/namespace: remove RCU sync for MNT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a9963f50-6349-4e76-8f12-c12c2ad4d2ab@redhat.com>
Date: Tue, 2 Jul 2024 15:01:54 +0800
From: Ian Kent <ikent@...hat.com>
To: Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>
Cc: Matthew Wilcox <willy@...radead.org>,
 Lucas Karpinski <lkarpins@...hat.com>, viro@...iv.linux.org.uk,
 raven@...maw.net, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, Alexander Larsson <alexl@...hat.com>,
 Eric Chanudet <echanude@...hat.com>
Subject: Re: [RFC v3 1/1] fs/namespace: remove RCU sync for MNT_DETACH umount

On 2/7/24 12:58, Christian Brauner wrote:
> On Fri, Jun 28, 2024 at 01:13:45PM GMT, Jan Kara wrote:
>> On Fri 28-06-24 10:58:54, Ian Kent wrote:
>>> On 27/6/24 19:54, Jan Kara wrote:
>>>> On Thu 27-06-24 09:11:14, Ian Kent wrote:
>>>>> On 27/6/24 04:47, Matthew Wilcox wrote:
>>>>>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote:
>>>>>>> +++ b/fs/namespace.c
>>>>>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init;
>>>>>>>     static DECLARE_RWSEM(namespace_sem);
>>>>>>>     static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>>>>>>>     static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>>>>>>> +static bool lazy_unlock = false; /* protected by namespace_sem */
>>>>>> That's a pretty ugly way of doing it.  How about this?
>>>>> Ha!
>>>>>
>>>>> That was my original thought but I also didn't much like changing all the
>>>>> callers.
>>>>>
>>>>> I don't really like the proliferation of these small helper functions either
>>>>> but if everyone
>>>>>
>>>>> is happy to do this I think it's a great idea.
>>>> So I know you've suggested removing synchronize_rcu_expedited() call in
>>>> your comment to v2. But I wonder why is it safe? I *thought*
>>>> synchronize_rcu_expedited() is there to synchronize the dropping of the
>>>> last mnt reference (and maybe something else) - see the comment at the
>>>> beginning of mntput_no_expire() - and this change would break that?
>>> Interesting, because of the definition of lazy umount I didn't look closely
>>> enough at that.
>>>
>>> But I wonder, how exactly would that race occur, is holding the rcu read
>>> lock sufficient since the rcu'd mount free won't be done until it's
>>> released (at least I think that's how rcu works).
>> I'm concerned about a race like:
>>
>> [path lookup]				[umount -l]
>> ...
>> path_put()
>>    mntput(mnt)
>>      mntput_no_expire(m)
>>        rcu_read_lock();
>>        if (likely(READ_ONCE(mnt->mnt_ns))) {
>> 					do_umount()
>> 					  umount_tree()
>> 					    ...
>> 					    mnt->mnt_ns = NULL;
>> 					    ...
>> 					  namespace_unlock()
>> 					    mntput(&m->mnt)
>> 					      mntput_no_expire(mnt)
>> 				              smp_mb();
>> 					      mnt_add_count(mnt, -1);
>> 					      count = mnt_get_count(mnt);
>> 					      if (count != 0) {
>> 						...
>> 						return;
>>          mnt_add_count(mnt, -1);
>>          rcu_read_unlock();
>>          return;
>> -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount!
>>        }
> Yeah, I think that's a valid concern. mntput_no_expire() requires that
> the last reference is dropped after an rcu grace period and that can
> only be done by synchronize_rcu_*() (It could be reworked but that would
> be quite ugly.). See also mnt_make_shortterm() caller's for kernel
> initiated unmounts.

I've thought about this a couple of times now.


Isn't it the case here that the path lookup thread will have taken a 
reference

(because it's calling path_put()) and the umount will have taken a 
reference on

system call entry.


So for the mount being umounted the starting count will be at lest three 
then if

the umount mntput() is called from namespace_unlock() it will correctly see

count != 0 and the path lookup mntput() to release it's reference 
finally leaving

the mntput() of the path_put() from the top level system call function 
to release

the last reference.


Once again I find myself thinking this should be independent of the rcu 
wait because

only path walks done before the mount being detached can be happening 
and the lockless

walks are done holding the rcu read lock and how likely is it a ref-walk 
path lookup

(that should follow a failed rcu-walk in this case) has been able to 
grab a reference

anyway?


I think the only reason the wait could be significant is to prevent 
changes to the

structures concerned causing problems because they happen earlier than 
can be

tolerated. That I can understand.


Mmm ... I feel like I'm starting to sound like a broken record ... oops!

Ian