netdev - Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <11584665-e3b5-3afe-d72c-ef92ad0b9313@iogearbox.net>
Date:   Wed, 15 Sep 2021 11:45:36 +0200
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Lukas Wunner <lukas@...ner.de>
Cc:     Pablo Neira Ayuso <pablo@...filter.org>,
        Jozsef Kadlecsik <kadlec@...filter.org>,
        Florian Westphal <fw@...len.de>,
        netfilter-devel@...r.kernel.org, coreteam@...filter.org,
        netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Thomas Graf <tgraf@...g.ch>,
        Laura Garcia Liebana <nevola@...il.com>,
        John Fastabend <john.fastabend@...il.com>
Subject: Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook

Hi Lukas,

On 9/11/21 11:26 PM, Lukas Wunner wrote:
> On Tue, Jan 26, 2021 at 08:13:19PM +0100, Daniel Borkmann wrote:
>> On 1/22/21 9:47 AM, Lukas Wunner wrote:
[...]
>> Wrt above objection, what needs to be done for the minimum case is to
>> i) fix the asymmetry problem from here, and
>> ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *;
>> with that in place this concern should be resolved. Thanks!
> 
> Daniel, your reply above has left me utterly confused.  After thinking
> about it for a while, I'm requesting clarification what you really want.
> We do want the netfilter egress hook in mainline and we're happy to
> accommodate to your requirements, but we need you to specify them clearly.
> 
> Regarding the data path for packets which are going out from a container,
> I think you've erroneously mixed up the last two elements above:
> If the nf egress hook is placed after tc egress (as is done in the present
> patch), the data path is actually as follows:
> 
>    [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
>    [tc egress (phys,hostns)] -> [nf egress (phys,hostns)*]
> 
> By contrast, if nf egress is put in front of tc egress (as you're
> requesting above), the data path becomes:
> 
>    [nf egress (veth,podns)] -> [tc egress (veth,podns)] ->
>    [tc ingress (veth,hostns)] ->
>    [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]
> 
> So this order results in an *additional* nf egress hook in the data path,
> which may cost performance.  Previously you voiced concerns that the
> nf egress hook may degrade performance.  To address that concern,
> we ordered nf egress *after* tc egress, thereby avoiding any performance
> impact as much as possible.
> 
> I agree with your argument that an inverse order vis-a-vis ingress is
> more logical because it avoids NAT incongruencies etc.  So I'll be happy
> to reorder nf egress before tc egress.  I'm just confused that you're now
> requesting an order which may be *more* detrimental to performance.

Right, the issue is that placing it either in front or after the existing tc
egress hook just as-is results in layering violations given both tc/nf subsystems
will inevitably step on each other when both dataplanes are used by different
components where things break. To provide a walk-through on what I think would
work w/o breaking layers:

1) Packet going up hostns stack. Here, we'll end up with:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      upper stack ->
        [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack. Here, if the forwarding happens entirely in tc
    layer, then the datapath (as-is today) operates _below_ nf and must look like:

    [tc ingress (phys,hostns)] -> [tc egress (veth,hostns)] ->
      netns switch -> (*) -> netns switch ->
        [tc ingress (veth,hostns)] -> [tc egress (phys,hostns)]

    For simplicity, I left out (*), but it's essentially the same as case 1)
    just for the Pod's/container's stack PoV.

3) Packet is locally forwarded between Pods. Same as 2) for the case where the
    forwarding happens entirely in tc (as-is today) which operates _below_ nf and
    must look like:

    (+) -> [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] ->
      netns switch -> (*) -> netns switch ->
        [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

    The (+) denotes the src/sink where we enter/exit the hostns after netns switch.

The skb bit marker would be that tc [& BPF]-related redirect functions are internally
setting that bit, so that nf egress hook is skipped for cases like 2)/3).

Walking through a similar 1/2/3) scenario from nf side one layer higher if you were
to do an equivalent, things would look like:

1) Packet going up hostns stack. Same as above:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      upper stack ->
        [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack with forwarding from nf side:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
        netns switch -> (*) -> netns switch ->
          [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
            [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

3) Packet is locally forwarded between Pods with forwarding from nf side:

    (+) -> [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
      [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
        netns switch -> (*) -> netns switch ->
          [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
            [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

Thanks,
Daniel