netdev - Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com>
Date:   Wed, 14 Mar 2018 17:17:54 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind



On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
> 
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.


We apparently missed something when deploying ipvlan and one netns per
container/job

Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.

It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?

This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.

eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.