netdev - Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180315033657.jj7ozjx66p27h3ar@ast-mbp>
Date:   Wed, 14 Mar 2018 20:37:00 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> >>
> >>> It seems this is exactly the case where a netns would be the correct answer.
> >>
> >> Unfortuantely that's not the case. That's what I tried to explain
> >> in the cover letter:
> >> "The setup involves per-container IPs, policy, etc, so traditional
> >> network-only solutions that involve VRFs, netns, acls are not applicable."
> >> To elaborate more on that:
> >> netns is l2 isolation.
> >> vrf is l3 isolation.
> >> whereas to containerize an application we need to punch connectivity holes
> >> in these layered techniques.
> >> We also considered resurrecting Hannes's afnetns work
> >> and even went as far as designing a new namespace for L4 isolation.
> >> Unfortunately all hierarchical namespace abstraction don't work.
> >> To run an application inside cgroup container that was not written
> >> with containers in mind we have to make an illusion of running
> >> in non-containerized environment.
> >> In some cases we remember the port and container id in the post-bind hook
> >> in a bpf map and when some other task in a different container is trying
> >> to connect to a service we need to know where this service is running.
> >> It can be remote and can be local. Both client and service may or may not
> >> be written with containers in mind and this sockaddr rewrite is providing
> >> connectivity and load balancing feature that you simply cannot do
> >> with hierarchical networking primitives.
> > 
> > have to explain this a bit further...
> > We also considered hacking these 'connectivity holes' in
> > netns and/or vrf, but that would be real layering violation,
> > since clean l2, l3 abstraction would suddenly support
> > something that breaks through the layers.
> > Just like many consider ipvlan a bad hack that punches
> > through the layers and connects l2 abstraction of netns
> > at l3 layer, this is not something kernel should ever do.
> > We really didn't want another ipvlan-like hack in the kernel.
> > Instead bpf programs at bind/connect time _help_
> > applications discover and connect to each other.
> > All containers are running in init_nens and there are no vrfs.
> > After bind/connect the normal fib/neighbor core networking
> > logic works as it should always do. The whole system is
> > clean from network point of view.
> 
> 
> We apparently missed something when deploying ipvlan and one netns per
> container/job

Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
I couldn't find the complete link. This one mentions some of the issues:
https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
Since ipvlan works for you, great, but it's clearly a layering violation.
ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
To me it's a definition 'punch connectivity hole' in L2 abstraction.
In normal L2 setup of netns+veth the traffic from one netns should
have went into another netns via full L2. ipvlan cheats by giving
L3 connectivity. It's not clean to me. There are still neighbour
tables in netnses that are duplicated.
Because netns is L2 there is full requeuing for traffic across netnses.
I guess google doesn't prioritize container to container traffic
while outside into netns via ipvlan works ok similar to bond, but
imo it's cheating too.
imo afnetns would have been much better alternative for your
use case without ipvlan pitfalls, but as you said ipvlan already
in the tree and afnetns is not.
With afnetns early demux would have worked not only for traffic from
the network, but for traffic across afnetns-es.

> I find netns isolation very clean, powerful, and it is there already.

netns+veth is a clean abstraction, but netns+ipvlan is imo not.
imo VRF is another clean L3 abstraction. Yet some folks tried
to do VRF-like things with netns.
David Ahern wrote nice blog about issues with that.
I suspect VRF also could have worked for google use case
and would have been easier to use than netns+ipvlan.
But since ipvlan works for you in the current shape, great,
I'm not going to argue further.
Let's agree to disagree on cleanliness of the solution.

> It also works with UDP just fine. Are you considering adding a hook
> later for sendmsg() (unconnected socket or not), or do you want to use
> the existing one in ip_finish_output(), adding per-packet overhead ?

Currently that's indeed the case. Existing cgroup-bpf hooks
at ip_finish_output work for many use cases, but per-packet overhead
is bad. With bind/connect hooks we avoid that overhead for
good traffic (which is tcp and connected udp). We still need
to solve it for unconnected udp. Rough idea is to do similar
sockaddr rewrite/drop in unconnected part of udp_sendmsg.