netdev - Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAF2d9jiWj6xoF6_TMt=E0hKGGvjLbwsOpYO3vktryzbZU-j2cg@mail.gmail.com>
Date:   Thu, 15 Mar 2018 09:22:24 -0700
From:   Mahesh Bandewar (महेश बंडेवार) 
        <maheshb@...gle.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Alexei Starovoitov <ast@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

On Wed, Mar 14, 2018 at 8:37 PM, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
> On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
>> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
>> > <alexei.starovoitov@...il.com> wrote:
>> >>
>> >>> It seems this is exactly the case where a netns would be the correct answer.
>> >>
>> >> Unfortuantely that's not the case. That's what I tried to explain
>> >> in the cover letter:
>> >> "The setup involves per-container IPs, policy, etc, so traditional
>> >> network-only solutions that involve VRFs, netns, acls are not applicable."
>> >> To elaborate more on that:
>> >> netns is l2 isolation.
>> >> vrf is l3 isolation.
>> >> whereas to containerize an application we need to punch connectivity holes
>> >> in these layered techniques.
>> >> We also considered resurrecting Hannes's afnetns work
>> >> and even went as far as designing a new namespace for L4 isolation.
>> >> Unfortunately all hierarchical namespace abstraction don't work.
>> >> To run an application inside cgroup container that was not written
>> >> with containers in mind we have to make an illusion of running
>> >> in non-containerized environment.
>> >> In some cases we remember the port and container id in the post-bind hook
>> >> in a bpf map and when some other task in a different container is trying
>> >> to connect to a service we need to know where this service is running.
>> >> It can be remote and can be local. Both client and service may or may not
>> >> be written with containers in mind and this sockaddr rewrite is providing
>> >> connectivity and load balancing feature that you simply cannot do
>> >> with hierarchical networking primitives.
>> >
>> > have to explain this a bit further...
>> > We also considered hacking these 'connectivity holes' in
>> > netns and/or vrf, but that would be real layering violation,
>> > since clean l2, l3 abstraction would suddenly support
>> > something that breaks through the layers.
>> > Just like many consider ipvlan a bad hack that punches
>> > through the layers and connects l2 abstraction of netns
>> > at l3 layer, this is not something kernel should ever do.
>> > We really didn't want another ipvlan-like hack in the kernel.
>> > Instead bpf programs at bind/connect time _help_
>> > applications discover and connect to each other.
>> > All containers are running in init_nens and there are no vrfs.
>> > After bind/connect the normal fib/neighbor core networking
>> > logic works as it should always do. The whole system is
>> > clean from network point of view.
>>
>>
>> We apparently missed something when deploying ipvlan and one netns per
>> container/job
>
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.

I had a long discussion with Hanness and there are two pending  issues
(discounting minor bug fixes / improvement). (a) the
multicast-group-membership and (b) early demux.
multicast group membership is just a matter of putting some code there
to fix it. While early-demux is little harder without violating
isolation boundaries. To me isolation is critical / important and if
we find a right solution that doesn't violate isolation, we'd fix it.

> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
> Since ipvlan works for you, great, but it's clearly a layering violation.
> ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
> To me it's a definition 'punch connectivity hole' in L2 abstraction.
> In normal L2 setup of netns+veth the traffic from one netns should
> have went into another netns via full L2. ipvlan cheats by giving
> L3 connectivity. It's not clean to me.

IPvlan supports three different modes and you have mixed all of them
while explaining your understanding of IPvlan. Probably one needs to
digest all these modes and evaluate them in the context of their use
case. Well, I'm not even going to attempt to explain the differences,
if you were serious you could have figured it out.

There are lots of use cases and people use it in interesting ways.
Each  case can be better handled by using either VRF, or macvlan, or
IPvlan or whatever is out there. It would be childish to say one
use-case is better than others as these are *different* use cases. All
these solutions come with their own caveats and you choose what you
can live with. Well, you can always improve and I can see Redhat folks
are doing it and I appreciate their efforts.

Like I said there are several different ways to make this work with
namespaces in much cleaner way and IPvlan does not need to be
involved. However adding another eBPF hook just because we can in a
hackish way is *not* the right way. Especially when a problem has
already been solved (with namespace) these 2000 lines dont deserve to
be in kernel. eBPF is a good tool and there is a thin line between
using it appropriately and misusing it. I don't want to argue and we
can agree to disagree!


> There are still neighbour
> tables in netnses that are duplicated.
> Because netns is L2 there is full requeuing for traffic across netnses.
> I guess google doesn't prioritize container to container traffic
> while outside into netns via ipvlan works ok similar to bond, but
> imo it's cheating too.
> imo afnetns would have been much better alternative for your
> use case without ipvlan pitfalls, but as you said ipvlan already
> in the tree and afnetns is not.
> With afnetns early demux would have worked not only for traffic from
> the network, but for traffic across afnetns-es.
>
Isolation is a critical piece of our puzzle and none of the
suggestions you have given solve it. cgroups clearly don't! However,
those could be good solutions in some other use-cases.

>> I find netns isolation very clean, powerful, and it is there already.
>
> netns+veth is a clean abstraction, but netns+ipvlan is imo not.
> imo VRF is another clean L3 abstraction. Yet some folks tried
> to do VRF-like things with netns.
> David Ahern wrote nice blog about issues with that.
> I suspect VRF also could have worked for google use case
> and would have been easier to use than netns+ipvlan.
> But since ipvlan works for you in the current shape, great,
> I'm not going to argue further.
> Let's agree to disagree on cleanliness of the solution.
>
>> It also works with UDP just fine. Are you considering adding a hook
>> later for sendmsg() (unconnected socket or not), or do you want to use
>> the existing one in ip_finish_output(), adding per-packet overhead ?
>
> Currently that's indeed the case. Existing cgroup-bpf hooks
> at ip_finish_output work for many use cases, but per-packet overhead
> is bad. With bind/connect hooks we avoid that overhead for
> good traffic (which is tcp and connected udp). We still need
> to solve it for unconnected udp. Rough idea is to do similar
> sockaddr rewrite/drop in unconnected part of udp_sendmsg.
>