netdev - Re: Persisting mounts between 'ip netns' invocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a68b135f-12ee-3c75-8b12-d039c9036d53@6wind.com>
Date: Fri, 29 Sep 2023 10:26:32 +0200
From: Nicolas Dichtel <nicolas.dichtel@...nd.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>,
 Christian Brauner <brauner@...nel.org>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org,
 "Eric W. Biederman" <ebiederm@...ssion.com>, David Ahern <dsahern@...nel.org>
Subject: Re: Persisting mounts between 'ip netns' invocations

Le 28/09/2023 à 20:21, Toke Høiland-Jørgensen a écrit :
> Christian Brauner <brauner@...nel.org> writes:
> 
>> On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote:
>>> + Eric
>>>
>>> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
>>>> Hi everyone
>>>>
>>>> I recently ran into this problem again, and so I figured I'd ask if
>>>> anyone has any good idea how to solve it:
>>>>
>>>> When running a command through 'ip netns exec', iproute2 will
>>>> "helpfully" create a new mount namespace and remount /sys inside it,
>>>> AFAICT to make sure /sys/class/net/* refers to the right devices inside
>>>> the namespace. This makes sense, but unfortunately it has the side
>>>> effect that no mount commands executed inside the ns persist. In
>>>> particular, this makes it difficult to work with bpffs; even when
>>>> mounting a bpffs inside the ns, it will disappear along with the
>>>> namespace as soon as the process exits.
>>>>
>>>> To illustrate:
>>>>
>>>> # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
>>>> # ip netns exec <nsname> ls /sys/fs/bpf
>>>> <nothing>
>>>>
>>>> This happens because namespaces are cleaned up as soon as they have no
>>>> processes, unless they are persisted by some other means. For the
>>>> network namespace itself, iproute2 will bind mount /proc/self/ns/net to
>>>> /var/run/netns/<nsname> (in the root mount namespace) to persist the
>>>> namespace. I tried implementing something similar for the mount
>>>> namespace, but that doesn't work; I can't manually bind mount the 'mnt'
>>>> ns reference either:
>>>>
>>>> # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
>>>> mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
>>>>        dmesg(1) may have more information after failed mount system call.
>>>>
>>>> When running strace on that mount command, it seems the move_mount()
>>>> syscall returns EINVAL, which, AFAICT, is because the mount namespace
>>>> file references itself as its namespace, which means it can't be
>>>> bind-mounted into the containing mount namespace.
>>>>
>>>> So, my question is, how to overcome this limitation? I know it's
>>>> possible to get a reference to the namespace of a running process, but
>>>> there is no guarantee there is any processes running inside the
>>>> namespace (hence the persisting bind mount for the netns). So is there
>>>> some other way to persist the mount namespace reference, so we can pick
>>>> it back up on the next 'ip netns' invocation?
>>>>
>>>> Hoping someone has a good idea :)
>>> We ran into similar problems. The only solution we found was to use nsenter
>>> instead of 'ip netns exec'.
>>>
>>> To be able to bind mount a mount namespace on a file, the directory of this file
>>> should be private. For example:
>>>
>>> mkdir -p /run/foo
>>> mount --make-rshared /
>>> mount --bind /run/foo /run/foo
>>> mount --make-private /run/foo
>>> touch /run/foo/ns
>>> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
>>>         read -r pid &&
>>>         mount --bind /proc/$pid/ns/mnt /run/foo/ns
>>> }
>>> nsenter --mount=/run/foo/ns ls /
>>>
>>> But this doesn't work under 'ip netns exec'.
>>
>> Afaiu, each ip netns exec invocation allocates a new mount namespace.
>> If you run multiple concurrent ip netns exec command and leave them
>> around then they all get a separate mount namespace. Not sure what the
>> design behind that was. So even if you could persist the mount namespace
>> of one there's still no way for ip netns exec to pick that up iiuc.
>>
>> So imho, the solution is to change ip netns exec to persist a mount
>> namespace and netns namespace pair. unshare does this easily via:
>>
>> sudo mkdir /run/mntns
>> sudo mount --bind /run/mntns /run/mntns
>> sudo mount --make-slave /run/mntns
>>
>> sudo mkdir /run/netns
>>
>> sudo touch /run/mntns/mnt1
>> sudo touch /run/netns/net1
>>
>> sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true
I fear that creating a new mount ns for each net ns will introduce more problems.

>>
>> So I'd probably patch iproute2.
> 
> Patching iproute2 is what I'm trying to do - sorry if that wasn't clear :)
> 
> However, I couldn't get it to work. I think it's probably because I was
> missing the bind-to-self/--make-slave dance on the containing folder, as
> Nicolas pointed out. Will play around with that a bit more, thanks for
> the pointers both of you!

The fundamental problem is that the remount of /sys should not be propagated to
the parent mount ns (and in fact the /etc remount also).
You will have to choose between 'propagating the new mount points to the parent
mount ns' and 'having the right view of /sys (ie the /sys corresponding to the
current netns)'.
Maybe this could be done via a new command, something like 'ip netns light-exec'
(which will be equivalent to 'nsenter --net=/run/netns/foo').

FWIW, here is a nice doc about mount subtleties:
https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt

Regards,
Nicolas