[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <11584665-e3b5-3afe-d72c-ef92ad0b9313@iogearbox.net>
Date: Wed, 15 Sep 2021 11:45:36 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Lukas Wunner <lukas@...ner.de>
Cc: Pablo Neira Ayuso <pablo@...filter.org>,
Jozsef Kadlecsik <kadlec@...filter.org>,
Florian Westphal <fw@...len.de>,
netfilter-devel@...r.kernel.org, coreteam@...filter.org,
netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Thomas Graf <tgraf@...g.ch>,
Laura Garcia Liebana <nevola@...il.com>,
John Fastabend <john.fastabend@...il.com>
Subject: Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook
Hi Lukas,
On 9/11/21 11:26 PM, Lukas Wunner wrote:
> On Tue, Jan 26, 2021 at 08:13:19PM +0100, Daniel Borkmann wrote:
>> On 1/22/21 9:47 AM, Lukas Wunner wrote:
[...]
>> Wrt above objection, what needs to be done for the minimum case is to
>> i) fix the asymmetry problem from here, and
>> ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *;
>> with that in place this concern should be resolved. Thanks!
>
> Daniel, your reply above has left me utterly confused. After thinking
> about it for a while, I'm requesting clarification what you really want.
> We do want the netfilter egress hook in mainline and we're happy to
> accommodate to your requirements, but we need you to specify them clearly.
>
> Regarding the data path for packets which are going out from a container,
> I think you've erroneously mixed up the last two elements above:
> If the nf egress hook is placed after tc egress (as is done in the present
> patch), the data path is actually as follows:
>
> [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
> [tc egress (phys,hostns)] -> [nf egress (phys,hostns)*]
>
> By contrast, if nf egress is put in front of tc egress (as you're
> requesting above), the data path becomes:
>
> [nf egress (veth,podns)] -> [tc egress (veth,podns)] ->
> [tc ingress (veth,hostns)] ->
> [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]
>
> So this order results in an *additional* nf egress hook in the data path,
> which may cost performance. Previously you voiced concerns that the
> nf egress hook may degrade performance. To address that concern,
> we ordered nf egress *after* tc egress, thereby avoiding any performance
> impact as much as possible.
>
> I agree with your argument that an inverse order vis-a-vis ingress is
> more logical because it avoids NAT incongruencies etc. So I'll be happy
> to reorder nf egress before tc egress. I'm just confused that you're now
> requesting an order which may be *more* detrimental to performance.
Right, the issue is that placing it either in front or after the existing tc
egress hook just as-is results in layering violations given both tc/nf subsystems
will inevitably step on each other when both dataplanes are used by different
components where things break. To provide a walk-through on what I think would
work w/o breaking layers:
1) Packet going up hostns stack. Here, we'll end up with:
[tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
upper stack ->
[nf egress (phys,hostns)] -> [tc egress (phys,hostns)]
2) Packet going up podns stack. Here, if the forwarding happens entirely in tc
layer, then the datapath (as-is today) operates _below_ nf and must look like:
[tc ingress (phys,hostns)] -> [tc egress (veth,hostns)] ->
netns switch -> (*) -> netns switch ->
[tc ingress (veth,hostns)] -> [tc egress (phys,hostns)]
For simplicity, I left out (*), but it's essentially the same as case 1)
just for the Pod's/container's stack PoV.
3) Packet is locally forwarded between Pods. Same as 2) for the case where the
forwarding happens entirely in tc (as-is today) which operates _below_ nf and
must look like:
(+) -> [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] ->
netns switch -> (*) -> netns switch ->
[tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)
The (+) denotes the src/sink where we enter/exit the hostns after netns switch.
The skb bit marker would be that tc [& BPF]-related redirect functions are internally
setting that bit, so that nf egress hook is skipped for cases like 2)/3).
Walking through a similar 1/2/3) scenario from nf side one layer higher if you were
to do an equivalent, things would look like:
1) Packet going up hostns stack. Same as above:
[tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
upper stack ->
[nf egress (phys,hostns)] -> [tc egress (phys,hostns)]
2) Packet going up podns stack with forwarding from nf side:
[tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
[nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
netns switch -> (*) -> netns switch ->
[tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
[nf egress (phys,hostns)] -> [tc egress (phys,hostns)]
3) Packet is locally forwarded between Pods with forwarding from nf side:
(+) -> [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
[nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
netns switch -> (*) -> netns switch ->
[tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
[nf egress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)
Thanks,
Daniel
Powered by blists - more mailing lists