linux-kernel - Re: [PATCH RFC fs/namespace] Make kern_unmount() use synchronize_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8E281DF1-248F-4861-A3C0-2573A5EFEE61@fb.com>
Date:   Tue, 15 Feb 2022 18:28:21 +0000
From:   Chris Mason <clm@...com>
To:     "Paul E. McKenney" <paulmck@...nel.org>,
        Giuseppe Scrivano <gscrivan@...hat.com>
CC:     "riel@...riel.com" <riel@...riel.com>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH RFC fs/namespace] Make kern_unmount() use
 synchronize_rcu_expedited()


> On Feb 14, 2022, at 2:26 PM, Chris Mason <clm@...com> wrote:
> 
> 
> 
>> On Feb 14, 2022, at 2:05 PM, Paul E. McKenney <paulmck@...nel.org> wrote:
>> 
>> Experimental.  Not for inclusion.  Yet, anyway.
>> 
>> Freeing large numbers of namespaces in quick succession can result in
>> a bottleneck on the synchronize_rcu() invoked from kern_unmount().
>> This patch applies the synchronize_rcu_expedited() hammer to allow
>> further testing and fault isolation.
>> 
>> Hey, at least there was no need to change the comment!  ;-)
>> 
> 
> I don’t think this will be fast enough.  I think the problem is that commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 is putting all of the ipc namespace frees onto a list, and every free includes one call to synchronize_rcu()
> 
> The end result is that we can create new namespaces much much faster than we can free them, and eventually we run out.  I found this while debugging clone() returning ENOSPC because create_ipc_ns() was returning ENOSPC.

I’m going to try Rik’s patch, but I changed Giuseppe’s benchmark from this commit, just to make it run for a million iterations instead of 1000.

#define _GNU_SOURCE
#include <sched.h>
#include <error.h>
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
        int i;

        for (i = 0; i < 1000000; i++) {
                if (unshare(CLONE_NEWIPC) < 0)
                        error(EXIT_FAILURE, errno, "unshare");
        }
}

Then I put on a drgn script to print the size of the free_ipc_list:

#!/usr/bin/env drgn
# usage: ./check_list <pid of worker thread doing free_ipc calls>

from drgn import *
from drgn.helpers.linux.pid import find_task
import sys,os,time

def llist_count(cur):
    count = 0
    while cur:
        count += 1
        cur = cur.next
    return count

pid = int(sys.argv[1])

# sometimes the worker is in different functions, so this
# will throw exceptions if we can't find the free_ipc call
for x in range(1, 5):
    try:
        task = find_task(prog, int(pid))
        trace = prog.stack_trace(task)
        head = prog['free_ipc_list']
        for i in range(0, len(trace)):
            if "free_ipc at" in str(trace[i]):
                free_ipc_index = i
        n = trace[free_ipc_index]['n']
        print("ipc free list is %d worker %d remaining %d" % (llist_count(head.first), pid, llist_count(n.mnt_llist.next)))
        break
    except:
        time.sleep(0.5)
        pass

I was expecting the run to pretty quickly hit ENOSPC, then try Rik’s patch, then celebrate and move on.  What seems to be happening instead is that unshare is spending all of its time creating super blocks:

    48.07%  boom             [kernel.vmlinux]                                                 [k] test_keyed_super
            |
            ---0x5541f689495641d7
               __libc_start_main
               unshare
               entry_SYSCALL_64
               do_syscall_64
               __x64_sys_unshare
               ksys_unshare
               unshare_nsproxy_namespaces
               create_new_namespaces
               copy_ipcs
               mq_init_ns
               mq_create_mount
               fc_mount
               vfs_get_tree
               vfs_get_super
               sget_fc
               test_keyed_super

But, this does nicely show the backlog on the free_ipc_list.  It gets up to around 150K entries, with our worker thread stuck:

196 kworker/0:2+events D
[<0>] __wait_rcu_gp+0x105/0x120
[<0>] synchronize_rcu+0x64/0x70
[<0>] kern_unmount+0x27/0x50
[<0>] free_ipc+0x6b/0xe0
[<0>] process_one_work+0x1ee/0x3c0
[<0>] worker_thread+0x23a/0x3b0
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30

# ./check_list.drgn 196
ipc free list is 58099 worker 196 remaining 98012

Eventually, hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) is slower than synchronize_rcu(), and the worker thread is able to make progress?  Production in this case is a few nsjail procs, so it’s not a crazy workload.  My guess is that prod tends to have longer grace periods than this test box, so the worker thread loses, but I haven’t been able to figure out why the worker suddenly catches up from time to time on the test box.

-chris