linux-kernel - Re: [PATCH net-next] netns: filter uevents correctly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <941de2b9-332f-75fc-f8ac-4059a9b5426f@virtuozzo.com>
Date:   Thu, 5 Apr 2018 17:26:59 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Christian Brauner <christian.brauner@...onical.com>
Cc:     ebiederm@...ssion.com, davem@...emloft.net,
        gregkh@...uxfoundation.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, avagin@...tuozzo.com,
        serge@...lyn.com
Subject: Re: [PATCH net-next] netns: filter uevents correctly

On 05.04.2018 17:07, Christian Brauner wrote:
> On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> On 04.04.2018 22:48, Christian Brauner wrote:
>>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>>>
>>> enabled sending hotplug events into all network namespaces back in 2010.
>>> Over time the set of uevents that get sent into all network namespaces has
>>> shrunk. We have now reached the point where hotplug events for all devices
>>> that carry a namespace tag are filtered according to that namespace.
>>>
>>> Specifically, they are filtered whenever the namespace tag of the kobject
>>> does not match the namespace tag of the netlink socket. One example are
>>> network devices. Uevents for network devices only show up in the network
>>> namespaces these devices are moved to or created in.
>>>
>>> However, any uevent for a kobject that does not have a namespace tag
>>> associated with it will not be filtered and we will *try* to broadcast it
>>> into all network namespaces.
>>>
>>> The original patchset was written in 2010 before user namespaces were a
>>> thing. With the introduction of user namespaces sending out uevents became
>>> partially isolated as they were filtered by user namespaces:
>>>
>>> net/netlink/af_netlink.c:do_one_broadcast()
>>>
>>> if (!net_eq(sock_net(sk), p->net)) {
>>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>>                 return;
>>>
>>>         if (!peernet_has_id(sock_net(sk), p->net))
>>>                 return;
>>>
>>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>>                              CAP_NET_BROADCAST))
>>>         j       return;
>>> }
>>>
>>> The file_ns_capable() check will check whether the caller had
>>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>>> namespace of interest. This check is fine in general but seems insufficient
>>> to me when paired with uevents. The reason is that devices always belong to
>>> the initial user namespace so uevents for kobjects that do not carry a
>>> namespace tag should never be sent into another user namespace. This has
>>> been the intention all along. But there's one case where this breaks,
>>> namely if a new user namespace is created by root on the host and an
>>> identity mapping is established between root on the host and root in the
>>> new user namespace. Here's a reproducer:
>>>
>>>  sudo unshare -U --map-root
>>>  udevadm monitor -k
>>>  # Now change to initial user namespace and e.g. do
>>>  modprobe kvm
>>>  # or
>>>  rmmod kvm
>>>
>>> will allow the non-initial user namespace to retrieve all uevents from the
>>> host. This seems very anecdotal given that in the general case user
>>> namespaces do not see any uevents and also can't really do anything useful
>>> with them.
>>>
>>> Additionally, it is now possible to send uevents from userspace. As such we
>>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>>> namespace of the network namespace of the netlink socket) userspace process
>>> make a decision what uevents should be sent.
>>>
>>> This makes me think that we should simply ensure that uevents for kobjects
>>> that do not carry a namespace tag are *always* filtered by user namespace
>>> in kobj_bcast_filter(). Specifically:
>>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>>   event will always be filtered.
>>> - If the network namespace the uevent socket belongs to was created in the
>>>   initial user namespace but was opened from a non-initial user namespace
>>>   the event will be filtered as well.
>>> Put another way, uevents for kobjects not carrying a namespace tag are now
>>> always only sent to the initial user namespace. The regression potential
>>> for this is near to non-existent since user namespaces can't really do
>>> anything with interesting devices.
>>>
>>> Signed-off-by: Christian Brauner <christian.brauner@...ntu.com>
>>> ---
>>>  lib/kobject_uevent.c | 10 +++++++++-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 15ea216a67ce..cb98cddb6e3b 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>>>  		return sock_ns != ns;
>>>  	}
>>>  
>>> -	return 0;
>>> +	/*
>>> +	 * The kobject does not carry a namespace tag so filter by user
>>> +	 * namespace below.
>>> +	 */
>>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>>> +		return 1;
>>> +
>>> +	/* Check if socket was opened from non-initial user namespace. */
>>> +	return sk_user_ns(dsk) != &init_user_ns;
>>>  }
>>>  #endif
>>
>> So, this prohibits to listen events of all devices except network-related
>> in containers? If it's so, I don't think it's a good solution. Uevents is not
> 
> No, this is not correct: As it is right now *without my patch* no
> non-initial user namespace is receiving *any uevents* but those
> specifically namespaced such as those for network devices. This patch
> doesn't change that at all. The commit message outlines this in detail
> how this comes about.
> There is only one case where this currently breaks and this is as I
> outlined explicitly in my commit message when you create a new user
> namespace and map container(0) -> host(0). This patch fixes this.

Could you please point the place, where non-initial user namespaces are filtered?
I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
Now it will return 1 sometimes.

>> net-devices-only related interface and it's used for all devices in system.
>> People may want to delegate block devices to nested user_ns, for example.
> 
> That's fine but that's why I added uevent injection in a previous patch
> series: I repeat no non-initial user namespace will by default receive
> uevents.

Thanks,
Kirill