[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<AS8PR07MB7973072CDA18FB16109772C4A39BA@AS8PR07MB7973.eurprd07.prod.outlook.com>
Date: Sat, 24 May 2025 00:50:51 +0000
From: "Chia-Yu Chang (Nokia)" <chia-yu.chang@...ia-bell-labs.com>
To: Paolo Abeni <pabeni@...hat.com>, "horms@...nel.org" <horms@...nel.org>,
"donald.hunter@...il.com" <donald.hunter@...il.com>, "xandfury@...il.com"
<xandfury@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"dave.taht@...il.com" <dave.taht@...il.com>, "jhs@...atatu.com"
<jhs@...atatu.com>, "kuba@...nel.org" <kuba@...nel.org>,
"stephen@...workplumber.org" <stephen@...workplumber.org>,
"xiyou.wangcong@...il.com" <xiyou.wangcong@...il.com>, "jiri@...nulli.us"
<jiri@...nulli.us>, "davem@...emloft.net" <davem@...emloft.net>,
"edumazet@...gle.com" <edumazet@...gle.com>, "andrew+netdev@...n.ch"
<andrew+netdev@...n.ch>, "ast@...erby.net" <ast@...erby.net>,
"liuhangbin@...il.com" <liuhangbin@...il.com>, "shuah@...nel.org"
<shuah@...nel.org>, "linux-kselftest@...r.kernel.org"
<linux-kselftest@...r.kernel.org>, "ij@...nel.org" <ij@...nel.org>,
"ncardwell@...gle.com" <ncardwell@...gle.com>, "Koen De Schepper (Nokia)"
<koen.de_schepper@...ia-bell-labs.com>, g.white <g.white@...lelabs.com>,
"ingemar.s.johansson@...csson.com" <ingemar.s.johansson@...csson.com>,
"mirja.kuehlewind@...csson.com" <mirja.kuehlewind@...csson.com>,
"cheshire@...le.com" <cheshire@...le.com>, "rs.ietf@....at" <rs.ietf@....at>,
"Jason_Livingood@...cast.com" <Jason_Livingood@...cast.com>, vidhi_goel
<vidhi_goel@...le.com>
CC: Olga Albisser <olga@...isser.org>, "Olivier Tilmans (Nokia)"
<olivier.tilmans@...ia.com>, Henrik Steen <henrist@...rist.net>, Bob Briscoe
<research@...briscoe.net>
Subject: RE: [PATCH v16 net-next 3/5] sched: Add enqueue/dequeue of dualpi2
qdisc
> -----Original Message-----
> From: Paolo Abeni <pabeni@...hat.com>
> Sent: Tuesday, May 20, 2025 3:34 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@...ia-bell-labs.com>; horms@...nel.org; donald.hunter@...il.com; xandfury@...il.com; netdev@...r.kernel.org; dave.taht@...il.com; jhs@...atatu.com; kuba@...nel.org; stephen@...workplumber.org; xiyou.wangcong@...il.com; jiri@...nulli.us; davem@...emloft.net; edumazet@...gle.com; andrew+netdev@...n.ch; ast@...erby.net; liuhangbin@...il.com; shuah@...nel.org; linux-kselftest@...r.kernel.org; ij@...nel.org; ncardwell@...gle.com; Koen De Schepper (Nokia) <koen.de_schepper@...ia-bell-labs.com>; g.white <g.white@...lelabs.com>; ingemar.s.johansson@...csson.com; mirja.kuehlewind@...csson.com; cheshire@...le.com; rs.ietf@....at; Jason_Livingood@...cast.com; vidhi_goel <vidhi_goel@...le.com>
> Cc: Olga Albisser <olga@...isser.org>; Olivier Tilmans (Nokia) <olivier.tilmans@...ia.com>; Henrik Steen <henrist@...rist.net>; Bob Briscoe <research@...briscoe.net>
> Subject: Re: [PATCH v16 net-next 3/5] sched: Add enqueue/dequeue of dualpi2 qdisc
>
>
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
> On 5/16/25 2:01 AM, chia-yu.chang@...ia-bell-labs.com wrote:
> > From: Koen De Schepper <koen.de_schepper@...ia-bell-labs.com>
> >
> > DualPI2 provides L4S-type low latency & loss to traffic that uses a
> > scalable congestion controller (e.g. TCP-Prague, DCTCP) without
> > degrading the performance of 'classic' traffic (e.g. Reno, Cubic
> > etc.). It is to be the reference implementation of IETF RFC9332 DualQ
> > Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332).
> >
> > Note that creating two independent queues cannot meet the goal of
> > DualPI2 mentioned in RFC9332: "...to preserve fairness between
> > ECN-capable and non-ECN-capable traffic." Further, it could even lead
> > to starvation of Classic traffic, which is also inconsistent with the
> > requirements in RFC9332: "...although priority MUST be bounded in
> > order not to starve Classic traffic." DualPI2 is designed to maintain
> > approximate per-flow fairness on L-queue and C-queue by forming a
> > single qdisc using the coupling factor and scheduler between two
> > queues.
> >
> > The qdisc provides two queues called low latency and classic. It
> > classifies packets based on the ECN field in the IP headers. By
> > default it directs non-ECN and ECT(0) into the classic queue and
> > ECT(1) and CE into the low latency queue, as per the IETF spec.
> >
> > Each queue runs its own AQM:
> > * The classic AQM is called PI2, which is similar to the PIE AQM but
> > more responsive and simpler. Classic traffic requires a decent
> > target queue (default 15ms for Internet deployment) to fully
> > utilize the link and to avoid high drop rates.
> > * The low latency AQM is, by default, a very shallow ECN marking
> > threshold (1ms) similar to that used for DCTCP.
> >
> > The DualQ isolates the low queuing delay of the Low Latency queue from
> > the larger delay of the 'Classic' queue. However, from a bandwidth
> > perspective, flows in either queue will share out the link capacity as
> > if there was just a single queue. This bandwidth pooling effect is
> > achieved by coupling together the drop and ECN-marking probabilities
> > of the two AQMs.
> >
> > The PI2 AQM has two main parameters in addition to its target delay.
> > The integral gain factor alpha is used to slowly correct any
> > persistent standing queue error from the target delay, while the
> > proportional gain factor beta is used to quickly compensate for queue
> > changes (growth or shrinkage). Either alpha and beta are given as a
> > parameter, or they can be calculated by tc from alternative typical and maximum RTT parameters.
> >
> > Internally, the output of a linear Proportional Integral (PI)
> > controller is used for both queues. This output is squared to
> > calculate the drop or ECN-marking probability of the classic queue.
> > This counterbalances the square-root rate equation of Reno/Cubic,
> > which is the trick that balances flow rates across the queues. For the
> > ECN-marking probability of the low latency queue, the output of the
> > base AQM is multiplied by a coupling factor. This determines the
> > balance between the flow rates in each queue. The default setting
> > makes the flow rates roughly equal, which should be generally
> > applicable.
> >
> > If DUALPI2 AQM has detected overload (due to excessive non-responsive
> > traffic in either queue), it will switch to signaling congestion
> > solely using drop, irrespective of the ECN field. Alternatively, it
> > can be configured to limit the drop probability and let the queue grow
> > and eventually overflow (like tail-drop).
> >
> > GSO splitting in DUALPI2 is configurable from userspace while the
> > default behavior is to split gso. When running DUALPI2 at unshaped
> > 10gigE with 4 download streams test, splitting gso apart results in
> > halving the latency with no loss in throughput:
> >
> > Summary of tcp_4down run 'no_split_gso':
> > avg median # data pts
> > Ping (ms) ICMP : 0.53 0.30 ms 350
> > TCP download avg : 2326.86 N/A Mbits/s 350
> > TCP download sum : 9307.42 N/A Mbits/s 350
> > TCP download::1 : 2672.99 2568.73 Mbits/s 350
> > TCP download::2 : 2586.96 2570.51 Mbits/s 350
> > TCP download::3 : 1786.26 1798.82 Mbits/s 350
> > TCP download::4 : 2261.21 2309.49 Mbits/s 350
> >
> > Summart of tcp_4down run 'split_gso':
> > avg median # data pts
> > Ping (ms) ICMP : 0.22 0.23 ms 350
> > TCP download avg : 2335.02 N/A Mbits/s 350
> > TCP download sum : 9340.09 N/A Mbits/s 350
> > TCP download::1 : 2335.30 2334.22 Mbits/s 350
> > TCP download::2 : 2334.72 2334.20 Mbits/s 350
> > TCP download::3 : 2335.28 2334.58 Mbits/s 350
> > TCP download::4 : 2334.79 2334.39 Mbits/s 350
> >
> > A similar result is observed when running DUALPI2 at unshaped 1gigE
> > with 1 download stream test:
> >
> > Summary of tcp_1down run 'no_split_gso':
> > avg median # data pts
> > Ping (ms) ICMP : 1.13 1.25 ms 350
> > TCP download : 941.41 941.46 Mbits/s 350
> >
> > Summart of tcp_1down run 'split_gso':
> > avg median # data pts
> > Ping (ms) ICMP : 0.51 0.55 ms 350
> > TCP download : 941.41 941.45 Mbits/s 350
> >
> > Additional details can be found in the draft:
> > https://datatracker.ietf.org/doc/html/rfc9332
> >
> > Signed-off-by: Koen De Schepper <koen.de_schepper@...ia-bell-labs.com>
> > Co-developed-by: Olga Albisser <olga@...isser.org>
> > Signed-off-by: Olga Albisser <olga@...isser.org>
> > Co-developed-by: Olivier Tilmans <olivier.tilmans@...ia.com>
> > Signed-off-by: Olivier Tilmans <olivier.tilmans@...ia.com>
> > Co-developed-by: Henrik Steen <henrist@...rist.net>
> > Signed-off-by: Henrik Steen <henrist@...rist.net>
> > Signed-off-by: Bob Briscoe <research@...briscoe.net>
> > Signed-off-by: Ilpo Järvinen <ij@...nel.org>
> > Co-developed-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> > Acked-by: Dave Taht <dave.taht@...il.com>
> > ---
> > include/net/dropreason-core.h | 6 +
> > net/sched/Kconfig | 12 +
> > net/sched/Makefile | 1 +
> > net/sched/sch_dualpi2.c | 449 ++++++++++++++++++++++++++++++++++
> > 4 files changed, 468 insertions(+)
> >
> > diff --git a/include/net/dropreason-core.h
> > b/include/net/dropreason-core.h index bea77934a235..faae9f416e54
> > 100644
> > --- a/include/net/dropreason-core.h
> > +++ b/include/net/dropreason-core.h
> > @@ -120,6 +120,7 @@
> > FN(ARP_PVLAN_DISABLE) \
> > FN(MAC_IEEE_MAC_CONTROL) \
> > FN(BRIDGE_INGRESS_STP_STATE) \
> > + FN(DUALPI2_STEP_DROP) \
> > FNe(MAX)
> >
> > /**
> > @@ -570,6 +571,11 @@ enum skb_drop_reason {
> > * ingress bridge port does not allow frames to be forwarded.
> > */
> > SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE,
> > + /**
> > + * @SKB_DROP_REASON_DUALPI2_STEP_DROP: dropped by the step drop
> > + * threshold of DualPI2 qdisc.
> > + */
> > + SKB_DROP_REASON_DUALPI2_STEP_DROP,
> > /**
> > * @SKB_DROP_REASON_MAX: the maximum of core drop reasons, which
> > * shouldn't be used as a real 'reason' - only for tracing code
> > gen diff --git a/net/sched/Kconfig b/net/sched/Kconfig index
> > 9f0b3f943fca..dda66a3590d8 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -415,6 +415,18 @@ config NET_SCH_BPF
> >
> > If unsure, say N.
> >
> > +config NET_SCH_DUALPI2
> > + tristate "Dual Queue PI Square (DUALPI2) scheduler"
> > + help
> > + Say Y here if you want to use the Dual Queue Proportional Integral
> > + Controller Improved with a Square scheduling algorithm.
> > + For more information, please see
> > +https://tools.ietf.org/html/rfc9332
> > +
> > + To compile this driver as a module, choose M here: the module
> > + will be called sch_dualpi2.
> > +
> > + If unsure, say N.
> > +
> > menuconfig NET_SCH_DEFAULT
> > bool "Allow override default queue discipline"
> > help
> > diff --git a/net/sched/Makefile b/net/sched/Makefile index
> > 904d784902d1..5078ea84e6ad 100644
> > --- a/net/sched/Makefile
> > +++ b/net/sched/Makefile
> > @@ -63,6 +63,7 @@ obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o
> > obj-$(CONFIG_NET_SCH_ETF) += sch_etf.o
> > obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o
> > obj-$(CONFIG_NET_SCH_BPF) += bpf_qdisc.o
> > +obj-$(CONFIG_NET_SCH_DUALPI2) += sch_dualpi2.o
> >
> > obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
> > obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git
> > a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index
> > 97986c754e47..7ecd7502332c 100644
> > --- a/net/sched/sch_dualpi2.c
> > +++ b/net/sched/sch_dualpi2.c
> > @@ -113,8 +113,44 @@ struct dualpi2_sched_data {
> > u32 step_marks; /* ECN mark pkt counter due to step AQM */
> > u32 memory_used; /* Memory used of both queues */
> > u32 max_memory_used;/* Maximum used memory */
> > +
> > + /* Deferred drop statistics */
> > + u32 deferred_drops_cnt; /* Packets dropped */
> > + u32 deferred_drops_len; /* Bytes dropped */
> > +};
> > +
> > +struct dualpi2_skb_cb {
> > + u64 ts; /* Timestamp at enqueue */
> > + u8 apply_step:1, /* Can we apply the step threshold */
> > + classified:2, /* Packet classification results */
> > + ect:2; /* Packet ECT codepoint */
> > +};
> > +
> > +enum dualpi2_classification_results {
> > + DUALPI2_C_CLASSIC = 0, /* C-queue */
> > + DUALPI2_C_L4S = 1, /* L-queue (scale mark/classic drop) */
> > + DUALPI2_C_LLLL = 2, /* L-queue (no drops/marks) */
> > + __DUALPI2_C_MAX /* Keep last*/
> > };
> >
> > +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb) {
> > + qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
> > + return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data; }
> > +
> > +static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference) {
> > + return reference - dualpi2_skb_cb(skb)->ts; }
> > +
> > +static u64 head_enqueue_time(struct Qdisc *q) {
> > + struct sk_buff *skb = qdisc_peek_head(q);
> > +
> > + return skb ? dualpi2_skb_cb(skb)->ts : 0; }
> > +
> > static u32 dualpi2_scale_alpha_beta(u32 param) {
> > u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING); @@
> > -136,6 +172,25 @@ static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
> > return ktime_add_ns(ktime_get_ns(), q->pi2_tupdate); }
> >
> > +static bool skb_is_l4s(struct sk_buff *skb) {
> > + return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S; }
> > +
> > +static bool skb_in_l_queue(struct sk_buff *skb) {
> > + return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC; }
> > +
> > +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff
> > +*skb) {
> > + if (INET_ECN_set_ce(skb)) {
> > + q->ecn_mark++;
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
> > {
> > q->c_protection_credit = q->c_protection_init; @@ -155,6
> > +210,398 @@ static void dualpi2_calculate_c_protection(struct Qdisc *sch,
> > dualpi2_reset_c_protection(q);
> > }
> >
> > +static bool dualpi2_roll(u32 prob)
> > +{
> > + return get_random_u32() <= prob; }
> > +
> > +/* Packets in the C-queue are subject to a marking probability pC,
> > +which is the
> > + * square of the internal PI probability (i.e., have an overall lower
> > +mark/drop
> > + * probability). If the qdisc is overloaded, ignore ECT values and only drop.
> > + *
> > + * Note that this marking scheme is also applied to L4S packets during overload.
> > + * Return true if packet dropping is required in C queue */ static
> > +bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
> > + struct sk_buff *skb, u32 prob,
> > + bool overload) {
> > + if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
> > + if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> > + return true;
> > + dualpi2_mark(q, skb);
> > + }
> > + return false;
> > +}
> > +
> > +/* Packets in the L-queue are subject to a marking probability pL
> > +given by the
> > + * internal PI probability scaled by the coupling factor.
> > + *
> > + * On overload (i.e., @local_l_prob is >= 100%):
> > + * - if the qdisc is configured to trade losses to preserve latency (i.e.,
> > + * @q->drop_overload), apply classic drops first before marking.
> > + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
> > + * delay, eventually resulting in taildrop behavior once sch->limit is
> > + * reached.
> > + * Return true if packet dropping is required in L queue */ static
> > +bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
> > + struct sk_buff *skb,
> > + u64 local_l_prob, u32 prob,
> > + bool overload) {
> > + if (overload) {
> > + /* Apply classic drop */
> > + if (!q->drop_overload ||
> > + !(dualpi2_roll(prob) && dualpi2_roll(prob)))
> > + goto mark;
> > + return true;
> > + }
> > +
> > + /* We can safely cut the upper 32b as overload==false */
> > + if (dualpi2_roll(local_l_prob)) {
> > + /* Non-ECT packets could have classified as L4S by filters. */
> > + if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> > + return true;
> > +mark:
> > + dualpi2_mark(q, skb);
> > + }
> > + return false;
> > +}
> > +
> > +/* Decide whether a given packet must be dropped (or marked if ECT),
> > +according
> > + * to the PI2 probability.
> > + *
> > + * Never mark/drop if we have a standing queue of less than 2 MTUs.
> > + */
> > +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
> > + struct sk_buff *skb) {
> > + u64 local_l_prob;
> > + bool overload;
> > + u32 prob;
> > +
> > + if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
> > + return false;
> > +
> > + prob = READ_ONCE(q->pi2_prob);
> > + local_l_prob = (u64)prob * q->coupling_factor;
> > + overload = local_l_prob > MAX_PROB;
> > +
> > + switch (dualpi2_skb_cb(skb)->classified) {
> > + case DUALPI2_C_CLASSIC:
> > + return dualpi2_classic_marking(q, skb, prob, overload);
> > + case DUALPI2_C_L4S:
> > + return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
> > + overload);
> > + default: /* DUALPI2_C_LLLL */
> > + return false;
> > + }
> > +}
> > +
> > +static void dualpi2_read_ect(struct sk_buff *skb) {
> > + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> > + int wlen = skb_network_offset(skb);
> > +
> > + switch (skb_protocol(skb, true)) {
> > + case htons(ETH_P_IP):
> > + wlen += sizeof(struct iphdr);
> > + if (!pskb_may_pull(skb, wlen) ||
> > + skb_try_make_writable(skb, wlen))
> > + goto not_ecn;
> > +
> > + cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
> > + break;
> > + case htons(ETH_P_IPV6):
> > + wlen += sizeof(struct ipv6hdr);
> > + if (!pskb_may_pull(skb, wlen) ||
> > + skb_try_make_writable(skb, wlen))
> > + goto not_ecn;
> > +
> > + cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
> > + break;
> > + default:
> > + goto not_ecn;
> > + }
> > + return;
> > +
> > +not_ecn:
> > + /* Non pullable/writable packets can only be dropped hence are
> > + * classified as not ECT.
> > + */
> > + cb->ect = INET_ECN_NOT_ECT;
> > +}
> > +
> > +static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
> > + struct sk_buff *skb) {
> > + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> > + struct tcf_result res;
> > + struct tcf_proto *fl;
> > + int result;
> > +
> > + dualpi2_read_ect(skb);
> > + if (cb->ect & q->ecn_mask) {
> > + cb->classified = DUALPI2_C_L4S;
> > + return NET_XMIT_SUCCESS;
> > + }
> > +
> > + if (TC_H_MAJ(skb->priority) == q->sch->handle &&
> > + TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
> > + cb->classified = TC_H_MIN(skb->priority);
> > + return NET_XMIT_SUCCESS;
> > + }
> > +
> > + fl = rcu_dereference_bh(q->tcf_filters);
> > + if (!fl) {
> > + cb->classified = DUALPI2_C_CLASSIC;
> > + return NET_XMIT_SUCCESS;
> > + }
> > +
> > + result = tcf_classify(skb, NULL, fl, &res, false);
> > + if (result >= 0) {
> > +#ifdef CONFIG_NET_CLS_ACT
> > + switch (result) {
> > + case TC_ACT_STOLEN:
> > + case TC_ACT_QUEUED:
> > + case TC_ACT_TRAP:
> > + return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> > + case TC_ACT_SHOT:
> > + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > + }
> > +#endif
> > + cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
> > + TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
> > + }
> > + return NET_XMIT_SUCCESS;
> > +}
> > +
> > +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
> > + struct sk_buff **to_free) {
> > + struct dualpi2_sched_data *q = qdisc_priv(sch);
> > + struct dualpi2_skb_cb *cb;
> > +
> > + if (unlikely(qdisc_qlen(sch) >= sch->limit) ||
> > + unlikely((u64)q->memory_used + skb->truesize > q->memory_limit)) {
> > + qdisc_qstats_overlimit(sch);
> > + if (skb_in_l_queue(skb))
> > + qdisc_qstats_overlimit(q->l_queue);
> > + return qdisc_drop_reason(skb, sch, to_free,
> > + SKB_DROP_REASON_QDISC_OVERLIMIT);
> > + }
> > +
> > + if (q->drop_early && must_drop(sch, q, skb)) {
> > + qdisc_drop_reason(skb, sch, to_free,
> > + SKB_DROP_REASON_QDISC_OVERLIMIT);
>
> I think it's better to use a different drop reason here or it will be hard to distinguish between packets dropped here or above.
> Possibly SKB_DROP_REASON_QDISC_CONGESTED.
>
Hi Paolo,
Thanks for the feedback.
This will be changed into SKB_DROP_REASON_QDISC_CONGESTED as suggested.
> > + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > + }
> > +
> > + cb = dualpi2_skb_cb(skb);
> > + cb->ts = ktime_get_ns();
> > + q->memory_used += skb->truesize;
> > + if (q->memory_used > q->max_memory_used)
> > + q->max_memory_used = q->memory_used;
> > +
> > + if (qdisc_qlen(sch) > q->maxq)
> > + q->maxq = qdisc_qlen(sch);
> > +
> > + if (skb_in_l_queue(skb)) {
> > + /* Only apply the step if a queue is building up */
> > + dualpi2_skb_cb(skb)->apply_step = skb_is_l4s(skb) &&
> > + qdisc_qlen(q->l_queue) >= q->min_qlen_step;
> > + /* Keep the overall qdisc stats consistent */
> > + ++sch->q.qlen;
> > + qdisc_qstats_backlog_inc(sch, skb);
> > + ++q->packets_in_l;
> > + if (!q->l_head_ts)
> > + q->l_head_ts = cb->ts;
>
> This chunck is very hard to read. Please insert a black line before the 2nd comment and consider introducing an helper to compute the 'apply_step' value.
A new helper function will be added to represent "qdisc_qlen(q->l_queue) >= q->min_qlen_step".
>
> > + return qdisc_enqueue_tail(skb, q->l_queue);
> > + }
> > + ++q->packets_in_c;
> > + if (!q->c_head_ts)
> > + q->c_head_ts = cb->ts;
> > + return qdisc_enqueue_tail(skb, sch); }
> > +
> > +/* By default, dualpi2 will split GSO skbs into independent skbs and
> > +enqueue
> > + * each of those individually. This yields the following benefits, at
> > +the
> > + * expense of CPU usage:
> > + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
> > + * same fate (e.g., the random mark/drop probability is applied individually)
> > + * - Improved precision of the starvation protection/WRR scheduler at dequeue,
> > + * as the size of the dequeued packets will be smaller.
> > + */
> > +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > + struct sk_buff **to_free) {
> > + struct dualpi2_sched_data *q = qdisc_priv(sch);
> > + int err;
> > +
> > + err = dualpi2_skb_classify(q, skb);
> > + if (err != NET_XMIT_SUCCESS) {
> > + if (err & __NET_XMIT_BYPASS)
> > + qdisc_qstats_drop(sch);
> > + __qdisc_drop(skb, to_free);
> > + return err;
> > + }
> > +
> > + if (q->split_gso && skb_is_gso(skb)) {
> > + netdev_features_t features;
> > + struct sk_buff *nskb, *next;
> > + int cnt, byte_len, orig_len;
> > + int err;
> > +
> > + features = netif_skb_features(skb);
> > + nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> > + if (IS_ERR_OR_NULL(nskb))
> > + return qdisc_drop(skb, sch, to_free);
> > +
> > + cnt = 1;
> > + byte_len = 0;
> > + orig_len = qdisc_pkt_len(skb);
> > + skb_list_walk_safe(nskb, nskb, next) {
> > + skb_mark_not_on_list(nskb);
> > + qdisc_skb_cb(nskb)->pkt_len = nskb->len;
> > + dualpi2_skb_cb(nskb)->classified =
> > + dualpi2_skb_cb(skb)->classified;
>
> Possibly just:
> *qdisc_skb_cb(nskb) = *dualpi2_skb_cb(skb);
However, I was thinking to keep the original code.
Because it clearly shows we got classified & ect values from dualpi2_skb_cb(skb), pkt_len value from nskb->len, and ts value in dualpi2_enqueue_skb().
I would add comments for that, hope this is clear.
>
> > + dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
> > + err = dualpi2_enqueue_skb(nskb, sch, to_free);
> > + if (err == NET_XMIT_SUCCESS) {
> > + /* Compute the backlog adjustment that needs
> > + * to be propagated in the qdisc tree to reflect
> > + * all new skbs successfully enqueued.
> > + */
> > + ++cnt;
> > + byte_len += nskb->len;
> > + }
> > + }
> > + if (err == NET_XMIT_SUCCESS) {
> > + /* The caller will add the original skb stats to its
> > + * backlog, compensate this.
> > + */
> > + --cnt;
> > + byte_len -= orig_len;
> > + }
> > + qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
> > + consume_skb(skb);
> > + return err;
>
> Using the return value from the last enqueue operation is IMHO inaccurate and confusing. Instead you could return NET_XMIT_SUCCESS if at least a skb is enqueued successfully, or possibly always unconditionally NET_XMIT_SUCCESS (will simplify the code a bit)
Indeed, I will replace "if (err == NET_XMIT_SUCCESS)" into "if (cnt > 1)". Then it corresponds to the former case you mentioned.
BRs,
Chia-Yu
>
> > + }
> > + return dualpi2_enqueue_skb(skb, sch, to_free); }
> > +
> > +/* Select the queue from which the next packet can be dequeued,
> > +ensuring that
> > + * neither queue can starve the other with a WRR scheduler.
> > + *
> > + * The sign of the WRR credit determines the next queue, while the
> > +size of
> > + * the dequeued packet determines the magnitude of the WRR credit
> > +change. If
> > + * either queue is empty, the WRR credit is kept unchanged.
> > + *
> > + * As the dequeued packet can be dropped later, the caller has to
> > +perform the
> > + * qdisc_bstats_update() calls.
> > + */
> > +static struct sk_buff *dequeue_packet(struct Qdisc *sch,
> > + struct dualpi2_sched_data *q,
> > + int *credit_change,
> > + u64 now) {
> > + struct sk_buff *skb = NULL;
> > + int c_len;
> > +
> > + *credit_change = 0;
> > + c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> > + if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection_credit <= 0)) {
> > + skb = __qdisc_dequeue_head(&q->l_queue->q);
> > + WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
> > + if (c_len)
> > + *credit_change = q->c_protection_wc;
> > + qdisc_qstats_backlog_dec(q->l_queue, skb);
>
> Please add an empty line here.
>
> > + /* Keep the global queue size consistent */
> > + --sch->q.qlen;
> > + q->memory_used -= skb->truesize;
> > + } else if (c_len) {
> > + skb = __qdisc_dequeue_head(&sch->q);
> > + WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
> > + if (qdisc_qlen(q->l_queue))
> > + *credit_change = ~((s32)q->c_protection_wl) + 1;
> > + q->memory_used -= skb->truesize;
> > + } else {
> > + dualpi2_reset_c_protection(q);
> > + return NULL;
> > + }
> > + *credit_change *= qdisc_pkt_len(skb);
> > + qdisc_qstats_backlog_dec(sch, skb);
> > + return skb;
> > +}
> > +
> > +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
> > + u64 now)
> > +{
> > + u64 qdelay = 0;
> > +
> > + if (q->step_in_packets)
> > + qdelay = qdisc_qlen(q->l_queue);
> > + else
> > + qdelay = dualpi2_sojourn_time(skb, now);
> > +
> > + if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step_thresh) {
> > + if (!dualpi2_skb_cb(skb)->ect)
> > + /* Drop this non-ECT packet */
> > + return 1;
>
> Please add brackets and an empty line:
>
> if (!dualpi2_skb_cb(skb)->ect) {
> /* Drop this non-ECT packet */
> return 1;
> }
>
> if (dualpi2_mark(q, skb))
>
>
> /P
Powered by blists - more mailing lists