netdev - Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180314180006.z5tphbscw7gotqyl@ast-mbp>
Date:   Wed, 14 Mar 2018 11:00:08 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>, davem@...emloft.net,
        daniel@...earbox.net, netdev@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

On Tue, Mar 13, 2018 at 11:21:08PM -0700, Eric Dumazet wrote:
> 
> If I understand well,  strace(1) will not show the real (after modification
> by eBPF) IP/port ?

correct. Just like it won't show anything after syscall entry, whether
lsm acted, seccomp, etc

> What about selinux and other LSM ?

clearly lsm is not place to do ip/port enforcement for containers.
lsm in general is missing post-bind lsm hook and visibility in cgroups.
This patch set is not about policy, but more about connectivity.
That's why sockaddr rewrite is must have.

> We have now network namespaces for full isolation. Soon ILA will come.

we're already using a form of ila. That's orthogonal to this feature.

> The argument that it is not convenient (or even possible) to change the
> application or using modern isolation is quite strange, considering the

just like any other datacenter there are thousands of third party
applications that we cannot control. Including open source code
written by google. Would golang switch to use glibc? I very much doubt.
Statically linked apps also don't work with ld_preload.

> added burden/complexity/bloat to the kernel.

bloat? that's very odd to hear. bpf is very much anti-bloat technique.
If you were serious with that comment, please argue with tracing folks
who add thousand upon thousand lines of code to the kernel to do
hard coded things while bpf already does all that and more
without any extra kernel code.

> The post hook for sys_bind is clearly a failure of the model, since
> releasing the port might already be too late, another thread might fail to
> get it during a non zero time window.

I suspect commit log wasn't clear. In post-bind hook we don't release
the port, we only fail sys_bind and user space will eventually close
the socket and release the port.
I don't think it's safe to call inet_put_port() here. It is also
racy as you pointed out.

> If you want to provide an alternate port allocation strategy, better provide
> a correct eBPF hook.

right. that's another separate work indepedent from this feature.
port allocation/free from bpf via helper is also necessary, but
for different use case.

> It seems this is exactly the case where a netns would be the correct answer.

Unfortuantely that's not the case. That's what I tried to explain
in the cover letter:
"The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable."
To elaborate more on that:
netns is l2 isolation.
vrf is l3 isolation.
whereas to containerize an application we need to punch connectivity holes
in these layered techniques.
We also considered resurrecting Hannes's afnetns work
and even went as far as designing a new namespace for L4 isolation.
Unfortunately all hierarchical namespace abstraction don't work.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature that you simply cannot do
with hierarchical networking primitives.

btw the per-container policy enforcement of ip+port via these hooks
wasn't our planned feature. It was requested by other folks and
we had to tweak the api a little bit to satisfy ours and theirs requirement.