netdev - Re: [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 09 Oct 2023 19:14:37 -0500
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: David Ahern <dsahern@...il.com>,  Stephen Hemminger
 <stephen@...workplumber.org>,  netdev@...r.kernel.org,  Nicolas Dichtel
 <nicolas.dichtel@...nd.com>,  Christian Brauner <brauner@...nel.org>,
  David Laight <David.Laight@...LAB.COM>
Subject: Re: [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces
 along with network namespaces

Toke Høiland-Jørgensen <toke@...hat.com> writes:

> "Eric W. Biederman" <ebiederm@...ssion.com> writes:
>
>> Toke Høiland-Jørgensen <toke@...hat.com> writes:
>>
>>> The 'ip netns' command is used for setting up network namespaces with persistent
>>> named references, and is integrated into various other commands of iproute2 via
>>> the -n switch.
>>>
>>> This is useful both for testing setups and for simple script-based namespacing
>>> but has one drawback: the lack of persistent mounts inside the spawned
>>> namespace. This is particularly apparent when working with BPF programs that use
>>> pinning to bpffs: by default no bpffs is available inside a namespace, and
>>> even if mounting one, that fs disappears as soon as the calling
>>> command exits.
>>
>> It would be entirely reasonable to copy mounts like /sys/fs/bpf from the
>> original mount namespace into the temporary mount namespace used by
>> "ip netns".
>>
>> I would call it a bug that "ip netns" doesn't do that already.
>>
>> I suspect that "ip netns" does copy the mounts from the old sysfs onto
>> the new sysfs is your entire problem.
>
> How would it do that? Walk mtab and remount everything identically after
> remounting /sys? Or is there a smarter way to go about this?

There are not many places to look so something like this is probably sufficient:

# stat all of the possible/probable mount points and see if there is
# something mounted there.  If so recursive bind whatever is there onto
# the new /sys

for dir in /old/sys/fs/* /old/sys/kernel/*; do
	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then
		newdir = $(echo $dir | sed -e s/old/new/)
		mount --rbind $dir/ $newdir/
	fi  
done

If the concern is being robust for the future the mount points can also
be enumerated by looking in one of /proc/self/mounts,
/proc/self/mountinfo, or /proc/self/mountstats.

I am not certain which is less work parsing a file with lots of fields,
or reading a directory and stating the returned files from readdir.

>> Or is their a reason that bpffs should be per network namespace?
>
> Well, I first ran into this issue because of a bug report to
> xdp-tools/libxdp about things not working correctly in network
> namespaces:
>
> https://github.com/xdp-project/xdp-tools/issues/364
>
> And libxdp does assume that there's a separate bpffs per network
> namespace: it persists things into the bpffs that is tied to the network
> devices in the current namespace. So if the bpffs is shared, an
> application running inside the network namespace could access XDP
> programs loaded in the root namespace. I don't know, but suspect, that
> such assumptions would be relatively common in networking BPF programs
> that use pinning (the pinning support in libbpf and iproute2 itself at
> least have the same leaking problem if the bpffs is shared).

Are the names of the values truly network namespace specific?

I did not see any mention of the things that are persisted in the ticket
you pointed me at, and unfortunately I am not familiar with xdp.

Last I looked until all of the cpu side channels are closed it is
unfortunately unsafe to load ebpf programs with anything less than
CAP_SYS_ADMIN (aka with permission to see and administer the entire
system).  So from a system point of view I really don't see a
fundamental danger from having a global /sys/fs/bpf.

If there are name conflicts in /sys/fs/bpf because of duplicate names in
different network namespaces I can see that being a problem.

At that point the name conflicts either need to be fixed or we
fundamentally need to have multiple mount points for bpffs.
Probably under something like /run/netns-mounts/NAME/.

With ip netns updated to mount the appropriate filesystem.


>>> The underlying cause for this is that iproute2 will create a new mount namespace
>>> every time it switches into a network namespace. This is needed to be able to
>>> mount a /sys filesystem that shows the correct network device information, but
>>> has the unfortunate side effect of making mounts entirely transient for any 'ip
>>> netns' invocation.
>>
>> Mount propagation can be made to work if necessary, that would solve the
>> transient problem.
>
> Is mount propagation different from the remount thing you mentioned
> above, or is this something different?
>
> (Sorry for being hopelessly naive about this, as you probably guessed
> from my previous email asking about this, I'm only now learning about
> all the intricacies fs mounts).

Mount propagation is a way to configure a mount namespace (before
creating a new one) that will cause mounts created in the first mount
namespace to be created in it's children, and cause mounts created in
the children to be created in the parent (depending on how things are
configured).

It is not my favorite feature (it makes locking of mount namespaces
terrible) and it is probably too clever by half, unfortunately systemd
started enabling mount propagation by default, so we are stuck with it.

>>> This series is an attempt to fix this situation, by persisting a mount namespace
>>> alongside the persistent network namespace (in a separate directory,
>>> /run/netns-mnt). Doing this allows us to still have a consistent /sys inside
>>> the namespace, but with persistence so any mounts survive.
>>
>> I really don't like that direction.
>>
>> "ip netns" was designed and really should continue to be a command that
>> makes the world look like it has a single network namespace, for
>> compatibility with old code.  Part of that old code "ip netns" supports
>> is "ip" itself.
>
> Well my idea with this change was to keep the functionality as close to
> what 'ip' currently does, but just have mounts persist across
> invocations.
>
>> I think you are making bpffs unnecessarily per network namespace.
>
> See above. 
>
>>> This mode does come with some caveats. I'm sending this as RFC to get feedback
>>> on whether this is the right thing to do, especially considering backwards
>>> compatibility. On balance, I think that the approach taken here of
>>> unconditionally persisting the mount namespace, and using that persistent
>>> reference whenever it exists, is better than the current behaviour, and that
>>> while it does represent a change in behaviour it is backwards compatible in a
>>> way that won't cause issues. But please do comment on this; see the patch
>>> description of patch 4 for details.
>>
>> As I understand it this will cause a problem for any application that
>> is network namespace aware and does not use "ip netns" to wrap itself.
>>
>> I am fairly certain that pinning the mount namespace will result in
>> never seeing an update of /etc/resolve.conf.  At least if you
>> are on a system that has /etc/netns/NAME/resolve.conf
>
> I was actually wondering about that /etc bind mounting support while I
> was looking at this code. Could you please elaborate a bit on what that
> is used for, exactly? :)

The idea is that you can have separate static configuration depending
upon your network namespace.

A real world case is vpning into several company internal networks.
Each company network uses overlapping portions of the 192.168.x.x
network.
Each company network will want it's own dns servers and possibly other
network configuration as well.

For it to make sense you really only need one company network and a home
network.  One of which you could stash in a network namespace to prevent
conflicts.

I don't know if supporting that ever caught on very strongly, but
I tried to build a template that would work for that and similar cases.

> Also, if staleness of the /etc bind mounts is an issue, those could be
> redone on every entry, couldn't they?

They already are ;)

Eric