netdev - Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190326150712.4zi6idzva3ivaupg@ast-mbp>
Date:   Tue, 26 Mar 2019 08:07:14 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Eric Dumazet <erdnetdev@...il.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>, brakmo <brakmo@...com>,
        netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP

On Tue, Mar 26, 2019 at 01:06:23AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> >>>
> >>>
> >>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> >>
> >>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >>>> Please realize that existing qdiscs already doing this.
> >>>> The patchset allows bpf-cgroup to do the same.
> >>>
> >>> Not the same thing I am afraid.
> >>
> >> To be clear Alexei :
> >>
> >> Existing qdisc set CE mark on a packet, exactly like a router would do.
> >> Simple and universal.
> >> This can be stacked, and done far away from the sender.
> >>
> >> We do not _call_ back local TCP to propagate cn.
> > 
> > How do you classify NET_XMIT_CN ?
> 
> 
> > It's exactly local call back to indicate CN into tcp from layers below tcp.
> > tc-bpf prog returning 'drop' code means drop+cn
> > whereas cgroup-bpf prog returning 'drop' means drop only.
> > This patch set is fixing this discrepancy.
> 
> 
> Except this does not work universally.
> team or bonding or any tunnel wont propagate this signal.
> (This is not something that can be fixed either, since qdisc can be installed there)
> 
> This is a wrong design, and we want to get rid of it, and not 'fix it'.

so after 20+ years linux qdisc design is wrong?
bpf is about choice. We have to give people tools to experiment even
when we philosophically disagree on the design.
We need to make sure that new changes don't hurt existing use cases though.
Performance bar has to remain high.

> Back to my original concerns :
> 
> 1) Larry just upstreamed the bpf_skb_ecn_set_ce().
>    This part was fine, we did the same in fq_codel an fq already.

ecn doesn't work with all congestion controls.

> 2)
>  This ability to inject directly to TCP stack local drops is exactly confirming
>  why I complained earlier about why policers are often wrong. Trying to cope with them
>  is a nightmare.
> 
> I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer)

I've tried to explain several times that it's not about policer.
That's why we called it NRM. Network Resource Manager.
The goal is to control neworking resources in containerized environment
where container scope is a cgroup.
we could have done some of this logic from tc-bpf layer, but overhead is prohibitive there,
since tc sees all packets while we need to control only certain cgroups and
work properly within cgroup hierarchy.

> EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops.

we already discussed this. EDT is not applicable in all cases.

> One of the ECN goal is to avoid drops in the first place, so adding something to favor
> local drops sounds a step in the wrong direction.

according to this statement BBR is wrong design then, since it ignores ecn?

Anyway, to move forward...
We're going to explore few ways to reduce tailcall pessimization in patch 4 and
will proceed with the rest as-is.