[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80574175-3692-0278-a74e-23b752d44f73@gmail.com>
Date: Mon, 19 Dec 2016 19:52:15 -0700
From: David Ahern <dsahern@...il.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Andy Lutomirski <luto@...nel.org>,
Daniel Mack <daniel@...que.org>,
Mickaël Salaün <mic@...ikod.net>,
Kees Cook <keescook@...omium.org>, Jann Horn <jann@...jh.net>,
Tejun Heo <tj@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Thomas Graf <tgraf@...g.ch>,
Michael Kerrisk <mtk.manpages@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
Linux API <linux-api@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Network Development <netdev@...r.kernel.org>
Subject: Re: Potential issues (security and otherwise) with the current
cgroup-bpf API
On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern@...il.com> wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>
> Can you elaborate on what goes wrong? (Obviously the
> "address_family_list" example makes no sense in that context.)
Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.
>>
>> ...
>>
>>>> you're ignoring use cases I described earlier.
>>>> In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost. Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is
>>
>> cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.
>
> I tend to agree. I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live. I
> have no desire to cause problems for the vrf use case.
>
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism. Suppose I create a container and want to trace all of its
> created sockets. I'll set up cgrp2/container and load my tracer as a
> socket creation hook. Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter. Now the tracing stops working -- oops.
There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want.
>>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>>>>> have to do it now (or disable the feature for 4.10). This is why I'm
>>>>> bringing this whole thing up now.
>>>>
>>>> We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh? My example in the original email attaches a program in a
>>> sub-hierarchy. Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
>
> Yes, exactly. I think there are two sensible behaviors:
>
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
>
> b) sub-cgroups can have a filter if a parent does, too. The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped. If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well. (And so on, all the way up to the root.)
That comes with a big performance hit for skb / data path cases.
I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.
Powered by blists - more mailing lists