[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5779f3c6-7432-a3d5-1199-0b28df69ca90@virtuozzo.com>
Date: Thu, 15 Mar 2018 17:14:13 +0300
From: Kirill Tkhai <ktkhai@...tuozzo.com>
To: Christian Brauner <christian.brauner@...onical.com>
Cc: Christian Brauner <christian.brauner@...ntu.com>,
ebiederm@...ssion.com, gregkh@...uxfoundation.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
serge@...lyn.com, avagin@...tuozzo.com
Subject: Re: netns: send uevent messages
On 15.03.2018 16:39, Christian Brauner wrote:
> On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
>> CC Andrey Vagin
>
> Hey Kirill,
>
> Thanks for CCing Andrey.
>
>>
>> On 15.03.2018 03:12, Christian Brauner wrote:
>>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
>>> to allow sending uevent messages into the network namespace the socket
>>> belongs to.
>>>
>>> Currently non-initial network namespaces are already isolated and don't
>>> receive uevents. There are a number of cases where it is beneficial for a
>>> sufficiently privileged userspace process to send a uevent into a network
>>> namespace.
>>>
>>> One such use case would be debugging and fuzzing of a piece of software
>>> which listens and reacts to uevents. By running a copy of that software
>>> inside a network namespace, specific uevents could then be presented to it.
>>> More concretely, this would allow for easy testing of udevd/ueventd.
>>>
>>> This will also allow some piece of software to run components inside a
>>> separate network namespace and then effectively filter what that software
>>> can receive. Some examples of software that do directly listen to uevents
>>> and that we have in the past attempted to run inside a network namespace
>>> are rbd (CEPH client) or the X server.
>>>
>>> Implementation:
>>> The implementation has been kept as simple as possible from the kernel's
>>> perspective. Specifically, a simple input method uevent_net_rcv() is added
>>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
>>> af_netlink infrastructure and does neither add an additional netlink family
>>> nor requires any user-visible changes.
>>>
>>> For example, by using netlink_rcv_skb() we can make use of existing netlink
>>> infrastructure to report back informative error messages to userspace.
>>>
>>> Furthermore, this implementation does not introduce any overhead for
>>> existing uevent generating codepaths. The struct netns gets a new uevent
>>> socket member that records the uevent socket associated with that network
>>> namespace. Since we record the uevent socket for each network namespace in
>>> struct net we don't have to walk the whole uevent socket list.
>>> Instead we can directly retrieve the relevant uevent socket and send the
>>> message. This keeps the codepath very performant without introducing
>>> needless overhead.
>>>
>>> Uevent sequence numbers are kept global. When a uevent message is sent to
>>> another network namespace the implementation will simply increment the
>>> global uevent sequence number and append it to the received uevent. This
>>> has the advantage that the kernel will never need to parse the received
>>> uevent message to replace any existing uevent sequence numbers. Instead it
>>> is up to the userspace process to remove any existing uevent sequence
>>> numbers in case the uevent message to be sent contains any.
>>>
>>> Security:
>>> In order for a caller to send uevent messages to a target network namespace
>>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
>>> target network namespace. Additionally, any received uevent message is
>>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
>>> needed to append the uevent sequence number.
>>>
>>> Testing:
>>> This patch has been tested and verified to work with the following udev
>>> implementations:
>>> 1. CentOS 6 with udevd version 147
>>> 2. Debian Sid with systemd-udevd version 237
>>> 3. Android 7.1.1 with ueventd
>>>
>>> Signed-off-by: Christian Brauner <christian.brauner@...ntu.com>
>>> ---
>>> include/net/net_namespace.h | 1 +
>>> lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 88 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>>> index f306b2aa15a4..467bde763a9b 100644
>>> --- a/include/net/net_namespace.h
>>> +++ b/include/net/net_namespace.h
>>> @@ -78,6 +78,7 @@ struct net {
>>>
>>> struct sock *rtnl; /* rtnetlink socket */
>>> struct sock *genl_sock;
>>> + struct sock *uevent_sock; /* uevent socket */
>>
>> Since you add this per-net uevent_sock pointer, currently existing iterations in uevent_net_exit()
>> become to look confusing. There are:
>>
>> mutex_lock(&uevent_sock_mutex);
>> list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>> if (sock_net(ue_sk->sk) == net)
>> goto found;
>> }
>>
>> Can't we make a small cleanup in lib/kobject_uevent.c after this change
>> and before the main part of the patch goes?
>
> Hm, not sure. It seems it makes sense to maintain them in a separate
> list. Looks like this lets us keep the locking simpler. Otherwise we
> would have to do something like for_each_net() and it seems that this
> would force us to use rntl_{un}lock().
I'm about:
mutex_lock();
list_del(net->ue_sk->list);
mutex_unlock();
kfree();
Thus we avoid iterations in uevent_net_exit().
>>
>>>
>>> struct list_head dev_base_head;
>>> struct hlist_head *dev_name_head;
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 9fe6ec8fda28..10b2144b9fc3 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -25,6 +25,7 @@
>>> #include <linux/uuid.h>
>>> #include <linux/ctype.h>
>>> #include <net/sock.h>
>>> +#include <net/netlink.h>
>>> #include <net/net_namespace.h>
>>>
>>>
>>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>>> EXPORT_SYMBOL_GPL(add_uevent_var);
>>>
>>> #if defined(CONFIG_NET)
>>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
>>> + struct netlink_ext_ack *extack)
>>> +{
>>> + int ret;
>>> + u64 seqnum;
>>> + /* u64 to chars: 2^64 - 1 = 21 chars */
>>
>> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think
>> we forgot +1 in buf declaration.
>
> sizeof("SEQNUM=") will include the '\0' pointer in contrast to
> strlen("SEQNUM=") so this is correct if I'm not completely mistaken.
The code is OK, I'm worrying about comment. But I've missed this sizeof().
So, there is only 1 bytes excess allocated as 0xFFFFFFFFFFFFFFFF=18446744073709551615
Not so important...
>>
>>> + char buf[sizeof("SEQNUM=") + 21];
>>> + struct sk_buff *skbc;
>>> +
>>> + /* bump sequence number */
>>> + mutex_lock(&uevent_sock_mutex);
>>> + seqnum = ++uevent_seqnum;
>>> + mutex_unlock(&uevent_sock_mutex);
>>
>> Commit 7b60a18da393 from Andrew Vagin says:
>>
>> uevent: send events in correct order according to seqnum (v3)
>>
>> The queue handling in the udev daemon assumes that the events are
>> ordered.
>>
>> Before this patch uevent_seqnum is incremented under sequence_lock,
>> than an event is send uner uevent_sock_mutex. I want to say that code
>> contained a window between incrementing seqnum and sending an event.
>>
>> This patch locks uevent_sock_mutex before incrementing uevent_seqnum.
>> Signed-off-by: Andrew Vagin <avagin@...nvz.org>
>>
>> After this change the order will be lost. Also the rest of namespaces
>> will have holes in uevent numbers, as they won't receive a number sent
>> to specific namespace.
>
> Afaict from looking into udevd when I wrote the patch it only cares
> about numbers being ordered (which is also what Andrey's patch states)
> not that they are sequential so holes should be fine. udevd will use
> the DEVPATH to determine whether the sequence number of the current
> uevent should be used as "even->delaying_seqnum" number. All that
> matters is that it is greater than the previous one. About the ordering,
> if that's an issue then we should simply do what Andrey has been doing
> for kobject_uevent_env() and extend the lock being held until after we
> sent out the uevent. Since we're not serving all listeners but only
> ever one this is also way more lightweight then kobject_uevent_env().
Yes, extending the lock to netlink_broadcast() should fix the problem.
>>
>> It seems we should made uevent_seqnum per-net to solve this problem.
>
> Yes, Eric and I have been discussing this already. The idea was to do
> this in a follow-up patch to keep this patch as simple as possible. If
> we agree that it would make sense right away then I will dig out the
> corresponding patch.
> It would basically just involve giving each struct net a uevent_seqnum
> member.
pernet seqnum may have a sense if we introduce per-net locks to protect it.
If there is single mutex, it does not matter either they are pernet or not.
Per-net mutex may be useful only if we have many single-net messages like
you introduce in this patch, or if we are going to filter net in existing
kobject_uevent_net_broadcast() (by user_ns?!) in the future.
Kirill
Powered by blists - more mailing lists