netdev - Re: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJNi1=+gAx6P4keDb9wuHoTjZnN0DNRgBEZ5cJuUcaZHg@mail.gmail.com>
Date: Thu, 31 Oct 2024 15:30:48 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>, chia-yu.chang@...ia-bell-labs.com, 
	netdev@...r.kernel.org, davem@...emloft.net, stephen@...workplumber.org, 
	jhs@...atatu.com, kuba@...nel.org, dsahern@...nel.org, ij@...nel.org, 
	koen.de_schepper@...ia-bell-labs.com, g.white@...lelabs.com, 
	ingemar.s.johansson@...csson.com, mirja.kuehlewind@...csson.com, 
	cheshire@...le.com, rs.ietf@....at, Jason_Livingood@...cast.com, 
	vidhi_goel@...le.com, Olga Albisser <olga@...isser.org>, 
	Olivier Tilmans <olivier.tilmans@...ia.com>, Henrik Steen <henrist@...rist.net>, 
	Bob Briscoe <research@...briscoe.net>
Subject: Re: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc

On Thu, Oct 31, 2024 at 2:28 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Tue, Oct 29, 2024 at 12:53 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 1:56 PM Paolo Abeni <pabeni@...hat.com> wrote:
> > >
> > > On 10/22/24 00:12, chia-yu.chang@...ia-bell-labs.com wrote:
> > > > +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> > > > +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> > > > +{
> > > > +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
> > > > +
> > > > +     q->pi2.target = 15 * NSEC_PER_MSEC;
> > > > +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> > > > +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> > > > +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> > > > +
> > > > +     q->step.thresh = 1 * NSEC_PER_MSEC;
> > > > +     q->step.in_packets = false;
> > > > +
> > > > +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> > > > +
> > > > +     q->ecn_mask = INET_ECN_ECT_1;
> > > > +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
> > > > +     q->drop_overload = true;        /* Preserve latency by dropping */
> > > > +     q->drop_early = false;          /* PI2 drops on dequeue */
> > > > +     q->split_gso = true;
> > >
> > > This is a very unexpected default. Splitting GSO packets earlier WRT the
> > > H/W constaints definitely impact performances in a bad way.
> > >
> > > Under which condition this is expected to give better results?
> > > It should be at least documented clearly.
> >
> > I agree, it is very strange to see this orthogonal feature being
> > spread in some qdisc.
>
> IMHO it makes sense to offer this split_gso feature in the dualpi2
> qdisc because the dualpi2 qdisc is targeted at reducing latency and
> targeted mostly at hops in the last mile of the public Internet, where
> there can be orders of magnitude disparities in bandwidth between
> upstream and downstream links (e.g., packets arriving over 10G
> ethernet and leaving destined for a 10M DSL link). In such cases, GRO
> may aggregate many packets into a single skb receiving data on a fast
> ingress link, and then may want to reduce latency issues on the slow
> link by allowing smaller skbs to be enqueued on the slower egress
> link.
>
> > Also, it seems this qdisc could be a mere sch_prio queue, with two
> > sch_pie children, or two sch_fq or sch_fq_codel ?
>
> Having two independent children would not allow meeting the dualpi2
> goal to "preserve fairness between ECN-capable and non-ECN-capable
> traffic." (quoting text from https://datatracker.ietf.org/doc/rfc9332/
> ). The main issue is that there may be differing numbers of flows in
> the ECN-capable and non-ECN-capable queues, and yet dualpi2 wants to
> maintain approximate per-flow fairness on both sides. To do this, it
> uses a single qdisc with coupling of the ECN mark rate in the
> ECN-capable queue and drop rate in the non-ECN-capable queue.

Not sure I understand this argument.

The dequeue  seems to use WRR, so this means that instead of prio,
this could use net/sched/sch_drr.c,
then two PIE (with different settings) as children, and a proper
classify at enqueue to choose one queue or the other.

Reviewing ~1000 lines of code, knowing that in one year another
net/sched/sch_fq_dualpi2.c
will follow (as net/sched/sch_fq_pie.c followed net/sched/sch_pie.c )
is not exactly appealing to me.

/* Select the queue from which the next packet can be dequeued, ensuring that
+ * neither queue can starve the other with a WRR scheduler.
+ *
+ * The sign of the WRR credit determines the next queue, while the size of
+ * the dequeued packet determines the magnitude of the WRR credit change. If
+ * either queue is empty, the WRR credit is kept unchanged.
+ *
+ * As the dequeued packet can be dropped later, the caller has to perform the
+ * qdisc_bstats_update() calls.
+ */
+static struct sk_buff *dequeue_packet(struct Qdisc *sch,
+                                     struct dualpi2_sched_data *q,
+                                     int *credit_change,
+                                     u64 now)
+{


>
> This could probably be made more clear in the commit message.
>
> > Many of us are using fq_codel or fq, there is no way we can switch to
> > dualpi2 just to experiment things.
>
> Yes, sites that are using fq_codel or fq do not need to switch to dualpi2.
>
> AFAIK the idea with dualpi2 is to offer a new qdisc for folks
> developing hardware for the last mile of the Internet where you want
> low latency via L4S, and want approximate per-flow fairness between
> L4S and non-L4S traffic, even in the presence of VPN-encrypted traffic
> (where flow identifiers are not available for fq_codel or fq fair
> queuing).
>
> Sites that don't have VPN traffic or don't care about the VPN issue
> can use fq or fq_codel with the ce_threshold parameter to allow low
> latency via L4S while achieving approximate per-flow fairness.
>
> best regards,
> neal