netdev - RE: [PATCH v16 net-next 3/5] sched: Add enqueue/dequeue of dualpi2 qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <AS8PR07MB7973072CDA18FB16109772C4A39BA@AS8PR07MB7973.eurprd07.prod.outlook.com>
Date: Sat, 24 May 2025 00:50:51 +0000
From: "Chia-Yu Chang (Nokia)" <chia-yu.chang@...ia-bell-labs.com>
To: Paolo Abeni <pabeni@...hat.com>, "horms@...nel.org" <horms@...nel.org>,
	"donald.hunter@...il.com" <donald.hunter@...il.com>, "xandfury@...il.com"
	<xandfury@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"dave.taht@...il.com" <dave.taht@...il.com>, "jhs@...atatu.com"
	<jhs@...atatu.com>, "kuba@...nel.org" <kuba@...nel.org>,
	"stephen@...workplumber.org" <stephen@...workplumber.org>,
	"xiyou.wangcong@...il.com" <xiyou.wangcong@...il.com>, "jiri@...nulli.us"
	<jiri@...nulli.us>, "davem@...emloft.net" <davem@...emloft.net>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "andrew+netdev@...n.ch"
	<andrew+netdev@...n.ch>, "ast@...erby.net" <ast@...erby.net>,
	"liuhangbin@...il.com" <liuhangbin@...il.com>, "shuah@...nel.org"
	<shuah@...nel.org>, "linux-kselftest@...r.kernel.org"
	<linux-kselftest@...r.kernel.org>, "ij@...nel.org" <ij@...nel.org>,
	"ncardwell@...gle.com" <ncardwell@...gle.com>, "Koen De Schepper (Nokia)"
	<koen.de_schepper@...ia-bell-labs.com>, g.white <g.white@...lelabs.com>,
	"ingemar.s.johansson@...csson.com" <ingemar.s.johansson@...csson.com>,
	"mirja.kuehlewind@...csson.com" <mirja.kuehlewind@...csson.com>,
	"cheshire@...le.com" <cheshire@...le.com>, "rs.ietf@....at" <rs.ietf@....at>,
	"Jason_Livingood@...cast.com" <Jason_Livingood@...cast.com>, vidhi_goel
	<vidhi_goel@...le.com>
CC: Olga Albisser <olga@...isser.org>, "Olivier Tilmans (Nokia)"
	<olivier.tilmans@...ia.com>, Henrik Steen <henrist@...rist.net>, Bob Briscoe
	<research@...briscoe.net>
Subject: RE: [PATCH v16 net-next 3/5] sched: Add enqueue/dequeue of dualpi2
 qdisc

> -----Original Message-----
> From: Paolo Abeni <pabeni@...hat.com> 
> Sent: Tuesday, May 20, 2025 3:34 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@...ia-bell-labs.com>; horms@...nel.org; donald.hunter@...il.com; xandfury@...il.com; netdev@...r.kernel.org; dave.taht@...il.com; jhs@...atatu.com; kuba@...nel.org; stephen@...workplumber.org; xiyou.wangcong@...il.com; jiri@...nulli.us; davem@...emloft.net; edumazet@...gle.com; andrew+netdev@...n.ch; ast@...erby.net; liuhangbin@...il.com; shuah@...nel.org; linux-kselftest@...r.kernel.org; ij@...nel.org; ncardwell@...gle.com; Koen De Schepper (Nokia) <koen.de_schepper@...ia-bell-labs.com>; g.white <g.white@...lelabs.com>; ingemar.s.johansson@...csson.com; mirja.kuehlewind@...csson.com; cheshire@...le.com; rs.ietf@....at; Jason_Livingood@...cast.com; vidhi_goel <vidhi_goel@...le.com>
> Cc: Olga Albisser <olga@...isser.org>; Olivier Tilmans (Nokia) <olivier.tilmans@...ia.com>; Henrik Steen <henrist@...rist.net>; Bob Briscoe <research@...briscoe.net>
> Subject: Re: [PATCH v16 net-next 3/5] sched: Add enqueue/dequeue of dualpi2 qdisc
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On 5/16/25 2:01 AM, chia-yu.chang@...ia-bell-labs.com wrote:
> > From: Koen De Schepper <koen.de_schepper@...ia-bell-labs.com>
> >
> > DualPI2 provides L4S-type low latency & loss to traffic that uses a 
> > scalable congestion controller (e.g. TCP-Prague, DCTCP) without 
> > degrading the performance of 'classic' traffic (e.g. Reno, Cubic 
> > etc.). It is to be the reference implementation of IETF RFC9332 DualQ 
> > Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332).
> >
> > Note that creating two independent queues cannot meet the goal of
> > DualPI2 mentioned in RFC9332: "...to preserve fairness between 
> > ECN-capable and non-ECN-capable traffic." Further, it could even lead 
> > to starvation of Classic traffic, which is also inconsistent with the 
> > requirements in RFC9332: "...although priority MUST be bounded in 
> > order not to starve Classic traffic." DualPI2 is designed to maintain 
> > approximate per-flow fairness on L-queue and C-queue by forming a 
> > single qdisc using the coupling factor and scheduler between two 
> > queues.
> >
> > The qdisc provides two queues called low latency and classic. It 
> > classifies packets based on the ECN field in the IP headers. By 
> > default it directs non-ECN and ECT(0) into the classic queue and
> > ECT(1) and CE into the low latency queue, as per the IETF spec.
> >
> > Each queue runs its own AQM:
> > * The classic AQM is called PI2, which is similar to the PIE AQM but
> >   more responsive and simpler. Classic traffic requires a decent
> >   target queue (default 15ms for Internet deployment) to fully
> >   utilize the link and to avoid high drop rates.
> > * The low latency AQM is, by default, a very shallow ECN marking
> >   threshold (1ms) similar to that used for DCTCP.
> >
> > The DualQ isolates the low queuing delay of the Low Latency queue from 
> > the larger delay of the 'Classic' queue. However, from a bandwidth 
> > perspective, flows in either queue will share out the link capacity as 
> > if there was just a single queue. This bandwidth pooling effect is 
> > achieved by coupling together the drop and ECN-marking probabilities 
> > of the two AQMs.
> >
> > The PI2 AQM has two main parameters in addition to its target delay.
> > The integral gain factor alpha is used to slowly correct any 
> > persistent standing queue error from the target delay, while the 
> > proportional gain factor beta is used to quickly compensate for queue 
> > changes (growth or shrinkage). Either alpha and beta are given as a 
> > parameter, or they can be calculated by tc from alternative typical and maximum RTT parameters.
> >
> > Internally, the output of a linear Proportional Integral (PI) 
> > controller is used for both queues. This output is squared to 
> > calculate the drop or ECN-marking probability of the classic queue.
> > This counterbalances the square-root rate equation of Reno/Cubic, 
> > which is the trick that balances flow rates across the queues. For the 
> > ECN-marking probability of the low latency queue, the output of the 
> > base AQM is multiplied by a coupling factor. This determines the 
> > balance between the flow rates in each queue. The default setting 
> > makes the flow rates roughly equal, which should be generally 
> > applicable.
> >
> > If DUALPI2 AQM has detected overload (due to excessive non-responsive 
> > traffic in either queue), it will switch to signaling congestion 
> > solely using drop, irrespective of the ECN field. Alternatively, it 
> > can be configured to limit the drop probability and let the queue grow 
> > and eventually overflow (like tail-drop).
> >
> > GSO splitting in DUALPI2 is configurable from userspace while the 
> > default behavior is to split gso. When running DUALPI2 at unshaped 
> > 10gigE with 4 download streams test, splitting gso apart results in 
> > halving the latency with no loss in throughput:
> >
> > Summary of tcp_4down run 'no_split_gso':
> >                          avg         median      # data pts
> >  Ping (ms) ICMP   :       0.53      0.30 ms         350
> >  TCP download avg :    2326.86       N/A Mbits/s    350
> >  TCP download sum :    9307.42       N/A Mbits/s    350
> >  TCP download::1  :    2672.99   2568.73 Mbits/s    350
> >  TCP download::2  :    2586.96   2570.51 Mbits/s    350
> >  TCP download::3  :    1786.26   1798.82 Mbits/s    350
> >  TCP download::4  :    2261.21   2309.49 Mbits/s    350
> >
> > Summart of tcp_4down run 'split_gso':
> >                          avg          median      # data pts
> >  Ping (ms) ICMP   :       0.22      0.23 ms         350
> >  TCP download avg :    2335.02       N/A Mbits/s    350
> >  TCP download sum :    9340.09       N/A Mbits/s    350
> >  TCP download::1  :    2335.30   2334.22 Mbits/s    350
> >  TCP download::2  :    2334.72   2334.20 Mbits/s    350
> >  TCP download::3  :    2335.28   2334.58 Mbits/s    350
> >  TCP download::4  :    2334.79   2334.39 Mbits/s    350
> >
> > A similar result is observed when running DUALPI2 at unshaped 1gigE 
> > with 1 download stream test:
> >
> > Summary of tcp_1down run 'no_split_gso':
> >                          avg         median      # data pts
> >  Ping (ms) ICMP :         1.13      1.25 ms         350
> >  TCP download   :       941.41    941.46 Mbits/s    350
> >
> > Summart of tcp_1down run 'split_gso':
> >                          avg         median      # data pts
> >  Ping (ms) ICMP :         0.51      0.55 ms         350
> >  TCP download   :       941.41    941.45 Mbits/s    350
> >
> > Additional details can be found in the draft:
> >   https://datatracker.ietf.org/doc/html/rfc9332
> >
> > Signed-off-by: Koen De Schepper <koen.de_schepper@...ia-bell-labs.com>
> > Co-developed-by: Olga Albisser <olga@...isser.org>
> > Signed-off-by: Olga Albisser <olga@...isser.org>
> > Co-developed-by: Olivier Tilmans <olivier.tilmans@...ia.com>
> > Signed-off-by: Olivier Tilmans <olivier.tilmans@...ia.com>
> > Co-developed-by: Henrik Steen <henrist@...rist.net>
> > Signed-off-by: Henrik Steen <henrist@...rist.net>
> > Signed-off-by: Bob Briscoe <research@...briscoe.net>
> > Signed-off-by: Ilpo Järvinen <ij@...nel.org>
> > Co-developed-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> > Acked-by: Dave Taht <dave.taht@...il.com>
> > ---
> >  include/net/dropreason-core.h |   6 +
> >  net/sched/Kconfig             |  12 +
> >  net/sched/Makefile            |   1 +
> >  net/sched/sch_dualpi2.c       | 449 ++++++++++++++++++++++++++++++++++
> >  4 files changed, 468 insertions(+)
> >
> > diff --git a/include/net/dropreason-core.h 
> > b/include/net/dropreason-core.h index bea77934a235..faae9f416e54 
> > 100644
> > --- a/include/net/dropreason-core.h
> > +++ b/include/net/dropreason-core.h
> > @@ -120,6 +120,7 @@
> >       FN(ARP_PVLAN_DISABLE)           \
> >       FN(MAC_IEEE_MAC_CONTROL)        \
> >       FN(BRIDGE_INGRESS_STP_STATE)    \
> > +     FN(DUALPI2_STEP_DROP)           \
> >       FNe(MAX)
> >
> >  /**
> > @@ -570,6 +571,11 @@ enum skb_drop_reason {
> >        * ingress bridge port does not allow frames to be forwarded.
> >        */
> >       SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE,
> > +     /**
> > +      * @SKB_DROP_REASON_DUALPI2_STEP_DROP: dropped by the step drop
> > +      * threshold of DualPI2 qdisc.
> > +      */
> > +     SKB_DROP_REASON_DUALPI2_STEP_DROP,
> >       /**
> >        * @SKB_DROP_REASON_MAX: the maximum of core drop reasons, which
> >        * shouldn't be used as a real 'reason' - only for tracing code 
> > gen diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 
> > 9f0b3f943fca..dda66a3590d8 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -415,6 +415,18 @@ config NET_SCH_BPF
> >
> >         If unsure, say N.
> >
> > +config NET_SCH_DUALPI2
> > +     tristate "Dual Queue PI Square (DUALPI2) scheduler"
> > +     help
> > +       Say Y here if you want to use the Dual Queue Proportional Integral
> > +       Controller Improved with a Square scheduling algorithm.
> > +       For more information, please see 
> > +https://tools.ietf.org/html/rfc9332
> > +
> > +       To compile this driver as a module, choose M here: the module
> > +       will be called sch_dualpi2.
> > +
> > +       If unsure, say N.
> > +
> >  menuconfig NET_SCH_DEFAULT
> >       bool "Allow override default queue discipline"
> >       help
> > diff --git a/net/sched/Makefile b/net/sched/Makefile index 
> > 904d784902d1..5078ea84e6ad 100644
> > --- a/net/sched/Makefile
> > +++ b/net/sched/Makefile
> > @@ -63,6 +63,7 @@ obj-$(CONFIG_NET_SCH_CBS)   += sch_cbs.o
> >  obj-$(CONFIG_NET_SCH_ETF)    += sch_etf.o
> >  obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o
> >  obj-$(CONFIG_NET_SCH_BPF)    += bpf_qdisc.o
> > +obj-$(CONFIG_NET_SCH_DUALPI2)        += sch_dualpi2.o
> >
> >  obj-$(CONFIG_NET_CLS_U32)    += cls_u32.o
> >  obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git 
> > a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index 
> > 97986c754e47..7ecd7502332c 100644
> > --- a/net/sched/sch_dualpi2.c
> > +++ b/net/sched/sch_dualpi2.c
> > @@ -113,8 +113,44 @@ struct dualpi2_sched_data {
> >       u32     step_marks;     /* ECN mark pkt counter due to step AQM */
> >       u32     memory_used;    /* Memory used of both queues */
> >       u32     max_memory_used;/* Maximum used memory */
> > +
> > +     /* Deferred drop statistics */
> > +     u32     deferred_drops_cnt;     /* Packets dropped */
> > +     u32     deferred_drops_len;     /* Bytes dropped */
> > +};
> > +
> > +struct dualpi2_skb_cb {
> > +     u64 ts;                 /* Timestamp at enqueue */
> > +     u8 apply_step:1,        /* Can we apply the step threshold */
> > +        classified:2,        /* Packet classification results */
> > +        ect:2;               /* Packet ECT codepoint */
> > +};
> > +
> > +enum dualpi2_classification_results {
> > +     DUALPI2_C_CLASSIC       = 0,    /* C-queue */
> > +     DUALPI2_C_L4S           = 1,    /* L-queue (scale mark/classic drop) */
> > +     DUALPI2_C_LLLL          = 2,    /* L-queue (no drops/marks) */
> > +     __DUALPI2_C_MAX                 /* Keep last*/
> >  };
> >
> > +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb) {
> > +     qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
> > +     return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data; }
> > +
> > +static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference) {
> > +     return reference - dualpi2_skb_cb(skb)->ts; }
> > +
> > +static u64 head_enqueue_time(struct Qdisc *q) {
> > +     struct sk_buff *skb = qdisc_peek_head(q);
> > +
> > +     return skb ? dualpi2_skb_cb(skb)->ts : 0; }
> > +
> >  static u32 dualpi2_scale_alpha_beta(u32 param)  {
> >       u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING); @@ 
> > -136,6 +172,25 @@ static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
> >       return ktime_add_ns(ktime_get_ns(), q->pi2_tupdate);  }
> >
> > +static bool skb_is_l4s(struct sk_buff *skb) {
> > +     return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S; }
> > +
> > +static bool skb_in_l_queue(struct sk_buff *skb) {
> > +     return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC; }
> > +
> > +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff 
> > +*skb) {
> > +     if (INET_ECN_set_ce(skb)) {
> > +             q->ecn_mark++;
> > +             return true;
> > +     }
> > +     return false;
> > +}
> > +
> >  static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)  
> > {
> >       q->c_protection_credit = q->c_protection_init; @@ -155,6 
> > +210,398 @@ static void dualpi2_calculate_c_protection(struct Qdisc *sch,
> >       dualpi2_reset_c_protection(q);
> >  }
> >
> > +static bool dualpi2_roll(u32 prob)
> > +{
> > +     return get_random_u32() <= prob; }
> > +
> > +/* Packets in the C-queue are subject to a marking probability pC, 
> > +which is the
> > + * square of the internal PI probability (i.e., have an overall lower 
> > +mark/drop
> > + * probability). If the qdisc is overloaded, ignore ECT values and only drop.
> > + *
> > + * Note that this marking scheme is also applied to L4S packets during overload.
> > + * Return true if packet dropping is required in C queue  */ static 
> > +bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
> > +                                 struct sk_buff *skb, u32 prob,
> > +                                 bool overload) {
> > +     if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
> > +             if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> > +                     return true;
> > +             dualpi2_mark(q, skb);
> > +     }
> > +     return false;
> > +}
> > +
> > +/* Packets in the L-queue are subject to a marking probability pL 
> > +given by the
> > + * internal PI probability scaled by the coupling factor.
> > + *
> > + * On overload (i.e., @local_l_prob is >= 100%):
> > + * - if the qdisc is configured to trade losses to preserve latency (i.e.,
> > + *   @q->drop_overload), apply classic drops first before marking.
> > + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
> > + *   delay, eventually resulting in taildrop behavior once sch->limit is
> > + *   reached.
> > + * Return true if packet dropping is required in L queue  */ static 
> > +bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
> > +                                  struct sk_buff *skb,
> > +                                  u64 local_l_prob, u32 prob,
> > +                                  bool overload) {
> > +     if (overload) {
> > +             /* Apply classic drop */
> > +             if (!q->drop_overload ||
> > +                 !(dualpi2_roll(prob) && dualpi2_roll(prob)))
> > +                     goto mark;
> > +             return true;
> > +     }
> > +
> > +     /* We can safely cut the upper 32b as overload==false */
> > +     if (dualpi2_roll(local_l_prob)) {
> > +             /* Non-ECT packets could have classified as L4S by filters. */
> > +             if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> > +                     return true;
> > +mark:
> > +             dualpi2_mark(q, skb);
> > +     }
> > +     return false;
> > +}
> > +
> > +/* Decide whether a given packet must be dropped (or marked if ECT), 
> > +according
> > + * to the PI2 probability.
> > + *
> > + * Never mark/drop if we have a standing queue of less than 2 MTUs.
> > + */
> > +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
> > +                   struct sk_buff *skb) {
> > +     u64 local_l_prob;
> > +     bool overload;
> > +     u32 prob;
> > +
> > +     if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
> > +             return false;
> > +
> > +     prob = READ_ONCE(q->pi2_prob);
> > +     local_l_prob = (u64)prob * q->coupling_factor;
> > +     overload = local_l_prob > MAX_PROB;
> > +
> > +     switch (dualpi2_skb_cb(skb)->classified) {
> > +     case DUALPI2_C_CLASSIC:
> > +             return dualpi2_classic_marking(q, skb, prob, overload);
> > +     case DUALPI2_C_L4S:
> > +             return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
> > +                                             overload);
> > +     default: /* DUALPI2_C_LLLL */
> > +             return false;
> > +     }
> > +}
> > +
> > +static void dualpi2_read_ect(struct sk_buff *skb) {
> > +     struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> > +     int wlen = skb_network_offset(skb);
> > +
> > +     switch (skb_protocol(skb, true)) {
> > +     case htons(ETH_P_IP):
> > +             wlen += sizeof(struct iphdr);
> > +             if (!pskb_may_pull(skb, wlen) ||
> > +                 skb_try_make_writable(skb, wlen))
> > +                     goto not_ecn;
> > +
> > +             cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
> > +             break;
> > +     case htons(ETH_P_IPV6):
> > +             wlen += sizeof(struct ipv6hdr);
> > +             if (!pskb_may_pull(skb, wlen) ||
> > +                 skb_try_make_writable(skb, wlen))
> > +                     goto not_ecn;
> > +
> > +             cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
> > +             break;
> > +     default:
> > +             goto not_ecn;
> > +     }
> > +     return;
> > +
> > +not_ecn:
> > +     /* Non pullable/writable packets can only be dropped hence are
> > +      * classified as not ECT.
> > +      */
> > +     cb->ect = INET_ECN_NOT_ECT;
> > +}
> > +
> > +static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
> > +                             struct sk_buff *skb) {
> > +     struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> > +     struct tcf_result res;
> > +     struct tcf_proto *fl;
> > +     int result;
> > +
> > +     dualpi2_read_ect(skb);
> > +     if (cb->ect & q->ecn_mask) {
> > +             cb->classified = DUALPI2_C_L4S;
> > +             return NET_XMIT_SUCCESS;
> > +     }
> > +
> > +     if (TC_H_MAJ(skb->priority) == q->sch->handle &&
> > +         TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
> > +             cb->classified = TC_H_MIN(skb->priority);
> > +             return NET_XMIT_SUCCESS;
> > +     }
> > +
> > +     fl = rcu_dereference_bh(q->tcf_filters);
> > +     if (!fl) {
> > +             cb->classified = DUALPI2_C_CLASSIC;
> > +             return NET_XMIT_SUCCESS;
> > +     }
> > +
> > +     result = tcf_classify(skb, NULL, fl, &res, false);
> > +     if (result >= 0) {
> > +#ifdef CONFIG_NET_CLS_ACT
> > +             switch (result) {
> > +             case TC_ACT_STOLEN:
> > +             case TC_ACT_QUEUED:
> > +             case TC_ACT_TRAP:
> > +                     return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> > +             case TC_ACT_SHOT:
> > +                     return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > +             }
> > +#endif
> > +             cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
> > +                     TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
> > +     }
> > +     return NET_XMIT_SUCCESS;
> > +}
> > +
> > +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
> > +                            struct sk_buff **to_free) {
> > +     struct dualpi2_sched_data *q = qdisc_priv(sch);
> > +     struct dualpi2_skb_cb *cb;
> > +
> > +     if (unlikely(qdisc_qlen(sch) >= sch->limit) ||
> > +         unlikely((u64)q->memory_used + skb->truesize > q->memory_limit)) {
> > +             qdisc_qstats_overlimit(sch);
> > +             if (skb_in_l_queue(skb))
> > +                     qdisc_qstats_overlimit(q->l_queue);
> > +             return qdisc_drop_reason(skb, sch, to_free,
> > +                                      SKB_DROP_REASON_QDISC_OVERLIMIT);
> > +     }
> > +
> > +     if (q->drop_early && must_drop(sch, q, skb)) {
> > +             qdisc_drop_reason(skb, sch, to_free,
> > +                               SKB_DROP_REASON_QDISC_OVERLIMIT);
> 
> I think it's better to use a different drop reason here or it will be hard to distinguish between packets dropped here or above.
> Possibly SKB_DROP_REASON_QDISC_CONGESTED.
> 
Hi Paolo,

Thanks for the feedback.
This will be changed into SKB_DROP_REASON_QDISC_CONGESTED as suggested.

> > +             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > +     }
> > +
> > +     cb = dualpi2_skb_cb(skb);
> > +     cb->ts = ktime_get_ns();
> > +     q->memory_used += skb->truesize;
> > +     if (q->memory_used > q->max_memory_used)
> > +             q->max_memory_used = q->memory_used;
> > +
> > +     if (qdisc_qlen(sch) > q->maxq)
> > +             q->maxq = qdisc_qlen(sch);
> > +
> > +     if (skb_in_l_queue(skb)) {
> > +             /* Only apply the step if a queue is building up */
> > +             dualpi2_skb_cb(skb)->apply_step = skb_is_l4s(skb) &&
> > +                     qdisc_qlen(q->l_queue) >= q->min_qlen_step;
> > +             /* Keep the overall qdisc stats consistent */
> > +             ++sch->q.qlen;
> > +             qdisc_qstats_backlog_inc(sch, skb);
> > +             ++q->packets_in_l;
> > +             if (!q->l_head_ts)
> > +                     q->l_head_ts = cb->ts;
> 
> This chunck is very hard to read. Please insert a black line before the 2nd comment and consider introducing an helper to compute the 'apply_step' value.

A new helper function will be added to represent "qdisc_qlen(q->l_queue) >= q->min_qlen_step".

> 
> > +             return qdisc_enqueue_tail(skb, q->l_queue);
> > +     }
> > +     ++q->packets_in_c;
> > +     if (!q->c_head_ts)
> > +             q->c_head_ts = cb->ts;
> > +     return qdisc_enqueue_tail(skb, sch); }
> > +
> > +/* By default, dualpi2 will split GSO skbs into independent skbs and 
> > +enqueue
> > + * each of those individually. This yields the following benefits, at 
> > +the
> > + * expense of CPU usage:
> > + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
> > + *   same fate (e.g., the random mark/drop probability is applied individually)
> > + * - Improved precision of the starvation protection/WRR scheduler at dequeue,
> > + *   as the size of the dequeued packets will be smaller.
> > + */
> > +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > +                              struct sk_buff **to_free) {
> > +     struct dualpi2_sched_data *q = qdisc_priv(sch);
> > +     int err;
> > +
> > +     err = dualpi2_skb_classify(q, skb);
> > +     if (err != NET_XMIT_SUCCESS) {
> > +             if (err & __NET_XMIT_BYPASS)
> > +                     qdisc_qstats_drop(sch);
> > +             __qdisc_drop(skb, to_free);
> > +             return err;
> > +     }
> > +
> > +     if (q->split_gso && skb_is_gso(skb)) {
> > +             netdev_features_t features;
> > +             struct sk_buff *nskb, *next;
> > +             int cnt, byte_len, orig_len;
> > +             int err;
> > +
> > +             features = netif_skb_features(skb);
> > +             nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> > +             if (IS_ERR_OR_NULL(nskb))
> > +                     return qdisc_drop(skb, sch, to_free);
> > +
> > +             cnt = 1;
> > +             byte_len = 0;
> > +             orig_len = qdisc_pkt_len(skb);
> > +             skb_list_walk_safe(nskb, nskb, next) {
> > +                     skb_mark_not_on_list(nskb);
> > +                     qdisc_skb_cb(nskb)->pkt_len = nskb->len;
> > +                     dualpi2_skb_cb(nskb)->classified =
> > +                             dualpi2_skb_cb(skb)->classified;
> 
> Possibly just:
>                         *qdisc_skb_cb(nskb) = *dualpi2_skb_cb(skb);

However, I was thinking to keep the original code.
Because it clearly shows we got classified & ect values from dualpi2_skb_cb(skb), pkt_len value from nskb->len, and ts value in dualpi2_enqueue_skb().
I would add comments for that, hope this is clear.

> 
> > +                     dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
> > +                     err = dualpi2_enqueue_skb(nskb, sch, to_free);
> > +                     if (err == NET_XMIT_SUCCESS) {
> > +                             /* Compute the backlog adjustment that needs
> > +                              * to be propagated in the qdisc tree to reflect
> > +                              * all new skbs successfully enqueued.
> > +                              */
> > +                             ++cnt;
> > +                             byte_len += nskb->len;
> > +                     }
> > +             }
> > +             if (err == NET_XMIT_SUCCESS) {
> > +                     /* The caller will add the original skb stats to its
> > +                      * backlog, compensate this.
> > +                      */
> > +                     --cnt;
> > +                     byte_len -= orig_len;
> > +             }
> > +             qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
> > +             consume_skb(skb);
> > +             return err;
> 
> Using the return value from the last enqueue operation is IMHO inaccurate and confusing. Instead you could return NET_XMIT_SUCCESS if at least a skb is enqueued successfully, or possibly always unconditionally NET_XMIT_SUCCESS (will simplify the code a bit)

Indeed, I will replace "if (err == NET_XMIT_SUCCESS)" into "if (cnt > 1)". Then it corresponds to the former case you mentioned.

BRs,
Chia-Yu
> 
> > +     }
> > +     return dualpi2_enqueue_skb(skb, sch, to_free); }
> > +
> > +/* Select the queue from which the next packet can be dequeued, 
> > +ensuring that
> > + * neither queue can starve the other with a WRR scheduler.
> > + *
> > + * The sign of the WRR credit determines the next queue, while the 
> > +size of
> > + * the dequeued packet determines the magnitude of the WRR credit 
> > +change. If
> > + * either queue is empty, the WRR credit is kept unchanged.
> > + *
> > + * As the dequeued packet can be dropped later, the caller has to 
> > +perform the
> > + * qdisc_bstats_update() calls.
> > + */
> > +static struct sk_buff *dequeue_packet(struct Qdisc *sch,
> > +                                   struct dualpi2_sched_data *q,
> > +                                   int *credit_change,
> > +                                   u64 now) {
> > +     struct sk_buff *skb = NULL;
> > +     int c_len;
> > +
> > +     *credit_change = 0;
> > +     c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> > +     if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection_credit <= 0)) {
> > +             skb = __qdisc_dequeue_head(&q->l_queue->q);
> > +             WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
> > +             if (c_len)
> > +                     *credit_change = q->c_protection_wc;
> > +             qdisc_qstats_backlog_dec(q->l_queue, skb);
> 
> Please add an empty line here.
> 
> > +             /* Keep the global queue size consistent */
> > +             --sch->q.qlen;
> > +             q->memory_used -= skb->truesize;
> > +     } else if (c_len) {
> > +             skb = __qdisc_dequeue_head(&sch->q);
> > +             WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
> > +             if (qdisc_qlen(q->l_queue))
> > +                     *credit_change = ~((s32)q->c_protection_wl) + 1;
> > +             q->memory_used -= skb->truesize;
> > +     } else {
> > +             dualpi2_reset_c_protection(q);
> > +             return NULL;
> > +     }
> > +     *credit_change *= qdisc_pkt_len(skb);
> > +     qdisc_qstats_backlog_dec(sch, skb);
> > +     return skb;
> > +}
> > +
> > +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
> > +                    u64 now)
> > +{
> > +     u64 qdelay = 0;
> > +
> > +     if (q->step_in_packets)
> > +             qdelay = qdisc_qlen(q->l_queue);
> > +     else
> > +             qdelay = dualpi2_sojourn_time(skb, now);
> > +
> > +     if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step_thresh) {
> > +             if (!dualpi2_skb_cb(skb)->ect)
> > +                     /* Drop this non-ECT packet */
> > +                     return 1;
> 
> Please add brackets and an empty line:
> 
>                 if (!dualpi2_skb_cb(skb)->ect) {
>                         /* Drop this non-ECT packet */
>                         return 1;
>                 }
> 
>                 if (dualpi2_mark(q, skb))
> 
> 
> /P