netdev - RE: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <AM6PR07MB4456C6F8010187F5783D959BB94A2@AM6PR07MB4456.eurprd07.prod.outlook.com>
Date: Mon, 28 Oct 2024 18:37:11 +0000
From: "Koen De Schepper (Nokia)" <koen.de_schepper@...ia-bell-labs.com>
To: Dave Taht <dave.taht@...il.com>, "Chia-Yu Chang (Nokia)"
	<chia-yu.chang@...ia-bell-labs.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "davem@...emloft.net"
	<davem@...emloft.net>, "stephen@...workplumber.org"
	<stephen@...workplumber.org>, "jhs@...atatu.com" <jhs@...atatu.com>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "kuba@...nel.org"
	<kuba@...nel.org>, "pabeni@...hat.com" <pabeni@...hat.com>,
	"dsahern@...nel.org" <dsahern@...nel.org>, "ij@...nel.org" <ij@...nel.org>,
	"ncardwell@...gle.com" <ncardwell@...gle.com>, "g.white@...lelabs.com"
	<g.white@...lelabs.com>, "ingemar.s.johansson@...csson.com"
	<ingemar.s.johansson@...csson.com>, "mirja.kuehlewind@...csson.com"
	<mirja.kuehlewind@...csson.com>, "cheshire@...le.com" <cheshire@...le.com>,
	"rs.ietf@....at" <rs.ietf@....at>, "Jason_Livingood@...cast.com"
	<Jason_Livingood@...cast.com>, "vidhi_goel@...le.com" <vidhi_goel@...le.com>,
	Olga Albisser <olga@...isser.org>, "Olivier Tilmans (Nokia)"
	<olivier.tilmans@...ia.com>, Henrik Steen <henrist@...rist.net>, Bob Briscoe
	<research@...briscoe.net>
Subject: RE: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc

See below,

Regards,
Koen.

> -----Original Message-----
> From: Dave Taht <dave.taht@...il.com> 
> Sent: Saturday, October 26, 2024 8:57 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@...ia-bell-labs.com>
> Cc: netdev@...r.kernel.org; davem@...emloft.net; stephen@...workplumber.org; jhs@...atatu.com; edumazet@...gle.com; kuba@...nel.org; pabeni@...hat.com; dsahern@...nel.org; ij@...nel.org; ncardwell@...gle.com; Koen De Schepper (Nokia) <koen.de_schepper@...ia-bell-labs.com>; g.white@...lelabs.com; ingemar.s.johansson@...csson.com; mirja.kuehlewind@...csson.com; cheshire@...le.com; rs.ietf@....at; Jason_Livingood@...cast.com; vidhi_goel@...le.com; Olga Albisser <olga@...isser.org>; Olivier Tilmans (Nokia) <olivier.tilmans@...ia.com>; Henrik Steen <henrist@...rist.net>; Bob Briscoe <research@...briscoe.net>
> Subject: Re: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc


> Has this been tested mq->an_aqm_queue_per_core or just as a
> htb+dualpi, and on what platforms?

It is a qdisc that should work in any combination. We mainly tested with HTB, directly on the real interface and with multiple instances in namespaces. We didn't test all the combinations. Did you see any indication that made you expect problems?

> I was also under the impression that 2ms was a more robust target from
> tests given typical scheduling delays and virtualization.

It is a parameter with a default of 1ms, which is a very achievable target on ethernet links. If in certain deployments it is not achievable, it can be relaxed if needed with a simple parameter. On wireless links, dedicated integration with the driver is needed for best performance, but outside the scope of this AQM.

> It appears that gso-splitting is the default? What happens with that off?

It might work under certain environment conditions or with certain combinations of more relaxed parameters, but it will create problems in other cases. Do you suggest we should force gso-splitting always on without the option? I guess it is useful if the conditions are present in certain deployments to be able to disable it?

> What would be a good DC setting?

The alpha and beta parameters are not necessary to be set directly. The easy way to configure DualPI2 is to set typical RTT and max RTT. The optimal parameters are derived from those. So, for a DC it could be:
     Auto-configuring parameters using [max_rtt: 5ms, typical_rtt: 100us]: target=100us tupdate=100us alpha=0.400000 beta=60.000002
Showing the following full config:
     qdisc dualpi2 1: root refcnt 17 limit 10000p target 100us tupdate 100us alpha 0.394531 beta 59.996094 l4s_ect coupling_factor 2 drop_on_overload step_thresh 1ms drop_dequeue split_gso classic_protection 10%
Or any other typical values can be used.

We will clarify this better in the description and man pages and promote the "simple and safe parameters". Probably we should also list the "experiment at own risk" parameters...

>> +        name: maxq
>> +        type: u32
>> +        doc: Maximum number of packets seen in the DualPI2
>
> Seen "by". Also this number will tend towards a peak and stay there,
> and thus is not a particularly useful stat.

Thanks, will be fixed. The stats can be reset, so can be used to find peek queue occupancy in an interval. Can be removed if people think it is not useful.

>> +        name: ecn_mark
>> +        type: u32
>> +        doc: All packets marked with ecn
>
>Since this has higher rates of marking than drop perhaps this should be 64 bits.

All packet counters are typically 32 bits. Would need to be changed in a lot of qdiscs...

>>      name: tc-fq-pie-xstats
>
>? fq-pie?

Thanks, typo that will be fixed.

>> +        name: limit
>> +        type: u32
>> +        doc: Limit of total number of packets in queue
>
>I have noted previously that memlimits make more sense than packet
>limits given the dynamic range of
>64b-64kb of a modern gso/tso packet.

All qdiscs use packet limits. Would again deviate from the common practice...

>> +        name: max_rtt
>> +        type: u32
>> +        doc: The maximum expected RTT of the traffic that is controlled by DualPI2
>
>In what units?

In the tc command it needs to be specified (although the default unit is currently us), in the data structure it is not present as it is converted to the other parameters. We can mention the default unit.

>> + * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
>> + *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,
>
>This is really confusing and up until this moment I thought bbrv3 used
>an ecn marking strategy compatible with prague.

As far as I know the BBRv3 ECN implementation does not implement all Prague requirements and is not intended to be used on the Internet and not for real-time interactive apps. Tests show that BBR's RTT probes pauses the throughput unnecessarily and still does throughput probes (creating unnecessary latency spikes). We will verify with the BBR maintainers and clarify the text.

>> + *   you should use TCP-Prague instead for low latency apps
>
>This is kind of opinionated.

We will change into " should use a Prague compliant CC for use on the Internet".

>> +       if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
>> +               qdisc_qstats_overlimit(sch);
>> +               if (skb_in_l_queue(skb))
>> +                       qdisc_qstats_overlimit(q->l_queue);
>
>shouldn't this be:
>
>               if (skb_in_l_queue(skb))
>                       qdisc_qstats_overlimit(q->l_queue);
>                else
>                       qdisc_qstats_overlimit(sch);

No, it increments 2 different counters. In the first level we keep the overall stats and increment for all packets (although the queue at this level only contains the C-queue), in the l_queue level we keep the L-stats only. If C-stats only are required, both need to be subtracted.

>> +/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue
>
>By default

Thanks, will be fixed.

> I had really grave doubts as to whether L4S would work with GSO at all.

It will definitely behave differently, and most likely won't be usable for the Internet. If used as a DC AQM, it might still be possible to disable. But as said before, we can remove this option. We haven’t further explored the possibilities, but don't want to prevent others to.

>> +                               /* Compute the backlog adjustement that needs
>
>spelling: "adjustment"

Thanks, will be fixed.

>> +       q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
>
>... assuming gso/gro is not in use

"At least 125ms at 1Gbps" is intended. Typically, the limit causes taildrop if not big enough. True is GSO is in full use the time (and memory used could be up to 40 times bigger.