lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <009e968cc984b563c375cb5be1999486b05db626.camel@inf.elte.hu>
Date:   Thu, 16 Feb 2023 14:47:11 +0100
From:   Ferenc Fejes <fejes@....elte.hu>
To:     Vladimir Oltean <vladimir.oltean@....com>
Cc:     peti.antal99@...il.com, "andrew@...n.ch" <andrew@...n.ch>,
        Claudiu Manoil <claudiu.manoil@....com>,
        "vivien.didelot@...il.com" <vivien.didelot@...il.com>,
        "vinicius.gomes@...el.com" <vinicius.gomes@...el.com>,
        "shuah@...nel.org" <shuah@...nel.org>,
        "bigeasy@...utronix.de" <bigeasy@...utronix.de>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Yannick Vignon <yannick.vignon@....com>,
        Xiaoliang Yang <xiaoliang.yang_1@....com>,
        "UNGLinuxDriver@...rochip.com" <UNGLinuxDriver@...rochip.com>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "richardcochran@...il.com" <richardcochran@...il.com>,
        "idosch@...dia.com" <idosch@...dia.com>,
        "gerhard@...leder-embedded.com" <gerhard@...leder-embedded.com>,
        "Y.B. Lu" <yangbo.lu@....com>, "jiri@...dia.com" <jiri@...dia.com>,
        "alexandre.belloni@...tlin.com" <alexandre.belloni@...tlin.com>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "kurt@...utronix.de" <kurt@...utronix.de>,
        Rui Sousa <rui.sousa@....com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
        "f.fainelli@...il.com" <f.fainelli@...il.com>
Subject: Re: [PATCH v2 net-next] selftests: forwarding: add Per-Stream
 Filtering and Policing test for Ocelot

Hi!

Sorry for the delay.

On Thu, 2022-05-26 at 09:30 +0000, Vladimir Oltean wrote:
> On Thu, May 26, 2022 at 06:40:22AM +0000, Ferenc Fejes wrote:
> > Hi Vladimir!
> > 
> > On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
> > > On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
> > > > Hi Vladimir!
> > > > 
> > > > On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
> > > > > Hi Ferenc,
> > > > > 
> > > > > On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
> > > > > This is correct. I have been testing only with the offloaded
> > > > > tc-
> > > > > gate
> > > > > action so I did not notice that the software does not act
> > > > > upon
> > > > > the ipv.
> > > > > Your proposal sounds straightforward enough. Care to send a
> > > > > bug
> > > > > fix patch?
> > > > 
> > > > Unfortunately I cant, our company policy does not allow direct 
> > > > open-source contributions :-(
> > > > 
> > > > However I would be more than happy if you can fix it.
> > > 
> > > That's too bad.
> > > 
> > > I have a patch which I am still testing, but you've managed to
> > > captivate
> > > my attention for saying that you are testing 802.1Qch with a
> > > software
> > > implementation of tc-gate.
> > > 
> > > Do you have a use case for this? What cycle times are you
> > > targeting?
> > > How are you ensuring that you are deterministically meeting the
> > > deadlines?
> > 
> > The cycle times I targeted were nowhere near to a realistic TSN
> > scenario:
> > I "generated" ping packets in every 100 msecs and on the ingress
> > port
> > and I marked them with prio 1 for 500ms (gate 1) and prio 2 for
> > another
> > 500ms (gate 2). On the egress port I applied taprio with the same
> > base-
> > time and same 500-500ms cycles but reverse ordered gates (that's
> > the
> > "definition" of the Qch), so while act_gate on the ingress is in
> > gate 1
> > cycle, the taprio kept open gate 2 and gate 1 closed, etc.
> > For "verification" I simply run a tcpdump on the listener machine
> > what
> > I pinged from the talker and eyeballed wether the 5-5 packets
> > bursts
> > shows up arrival timestamps.
> > 
> > > Do you also have a software tc-taprio on the egress device or is
> > > that
> > > at least offloaded?
> > 
> > No, I experimented with the software version, but that worked with
> > my
> > netns tests and on physical machines too (after the IPV patch).
> > 
> > > 
> > > I'm asking these questions because the peristaltic shaper is
> > > primarily
> > > intended to be used on hardware switches. The patch I'm preparing
> > > includes changes to struct sk_buff. I just want to know how well
> > > I'll
> > > be
> > > able to sell these changes to maintainers strongly opposing the
> > > growth
> > > of this structure for an exceedingly small niche :)
> > 
> > Can you tell me about the intention behind the sk_buff changes?
> > Does
> > that required because of some offloading scenario? In my case
> > putting
> > the IPV into the skb->priority was good enough because the taprio
> > using
> > that field by default to select the traffic class for the packet.
> > 
> > Thanks,
> > Ferenc
> > 
> 
> Hopefully this patch explains it:
> 
> -----------------------------[ cut here ]----------------------------
> -
> From c8d33e0d65a4a63884a626dc04c7190a2bbe122b Mon Sep 17 00:00:00
> 2001
> From: Vladimir Oltean <vladimir.oltean@....com>
> Date: Wed, 25 May 2022 16:27:52 +0300
> Subject: [PATCH] net/sched: act_gate: act on Internal Priority Value
> 
> IEEE 802.1Q specifies in Annex T ("Cyclic queuing and forwarding")
> the
> use case for an Internal Priority Value set by the PSFP function.
> Essentially, the intended use is to be able to select a packet's
> egress
> queue squarely based on its arrival time. Essentially, this means
> that a
> packet with the same VLAN PCP 5 can be delivered to TC 7 or 6
> depending
> on whether it was received in an "odd" or "even" cycle. To quote the
> spec, "The use of the IPV allows this direction of frames to outbound
> queues to be independent of the received priority, and also does not
> affect the priority associated with the frame on transmission."
> 
> The problem is that the software implementation of act_gate takes the
> IPV as entry from user space, but does not act on it in its software
> implementation - only offloading drivers do.
> 
> To fix this, we need to keep a current_ipv variable according to the
> gate entry that's currently executed by act_gate, and use this IPV to
> overwrite the skb->priority.
> 
> In fact, a complication arises due to the following clause from
> 802.1Q:
> 
> > 8.6.6.1 PSFP queuing
> > If PSFP is supported (8.6.5.1), and the IPV associated with the
> > stream
> > filter that passed the frame is anything other than the null value,
> > then
> > that IPV is used to determine the traffic class of the frame, in
> > place
> > of the frame's priority, via the Traffic Class Table specified in
> > 8.6.6.
> > In all other respects, the queuing actions specified in 8.6.6 are
> > unchanged. The IPV is used only to determine the traffic class
> > associated with a frame, and hence select an outbound queue; for
> > all
> > other purposes, the received priority is used.

Interesting. In my understanding this indeed something like "modify the
queueing decision while preserve the original PCP value".

> 
> An example of layer that we don't want to see the IPV is the egress-
> qos-map
> of a potential VLAN output device. Instead, the VLAN PCP should still
> be
> populated based on the original skb->priority.
> 
> I'm aware of the existence of a skb_update_prio() function in
> __dev_queue_xmit(), right before passing the skb to the qdisc layer,
> for the use case where net_prio cgroups are used to assign processes
> to
> network priorities on a per-interface basis. But rewriting the
> skb->priority with the skb->ipv there simply wouldn't work, exactly
> because there might be a VLAN in the output path of the skb.

So the problem is currently we use skb->priority with two semantics:
1. for egress-qos-mapping which set the PCP of the VLAN header based on
skb->priority
2. for selecting the traffic class (and the tx queues with that as
well).

> 
> One might suggest: just delay the IPV rewriting in the presence of
> stacked devices until it is actually needed and likely to be used,
> like when dev->num_tc != 0 (which VLAN doesn't have). But there are
> still other uses of skb->priority in the qdisc layer, like skbprio,
> htb etc. From the spec's wording it isn't clear that we want these
> functions to look at the proper packet user priority or at the PSFP
> IPV.
> Probably the former.
> 
> So the only implementation that conforms to these requirements
> seems to be the invasive one, where we introduce a dedicated
> helper for the pattern where drivers and the core ask for
> netdev_get_prio_tc_map(dev, skb->priority). We replace all such
> instances with a helper that selects the TC based on IPV, if the skb
> was filtered by PSFP time gates on ingress, and falls back to
> skb->priority otherwise.
> 
> Fixes: a51c328df310 ("net: qos: introduce a gate control flow
> action")
> Reported-by: Ferenc Fejes <ferenc.fejes@...csson.com>
> Link:
> https://lore.kernel.org/netdev/87o80h1n65.fsf@kurt/T/#meaec9c24b3add9cb11edd411278a3a6ecf01573e
> Signed-off-by: Vladimir Oltean <vladimir.oltean@....com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
>  include/linux/netdevice.h                     | 13 +++++++++++++
>  include/linux/skbuff.h                        | 11 +++++++++++
>  include/net/tc_act/tc_gate.h                  |  1 +
>  net/core/dev.c                                |  4 ++--
>  net/dsa/tag_ocelot.c                          |  2 +-
>  net/sched/act_gate.c                          |  6 ++++++
>  net/sched/sch_taprio.c                        | 12 +++---------
>  8 files changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 77c2e70b0860..719259af9aaa 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8531,7 +8531,7 @@ static u16 ixgbe_select_queue(struct net_device
> *dev, struct sk_buff *skb,
>         int txq;
>  
>         if (sb_dev) {
> -               u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> +               u8 tc = skb_get_prio_tc_map(skb, dev);
>                 struct net_device *vdev = sb_dev;
>  
>                 txq = vdev->tc_to_txq[tc].offset;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 47b59f99b037..8b87d017288f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2366,6 +2366,19 @@ int netdev_set_prio_tc_map(struct net_device
> *dev, u8 prio, u8 tc)
>         return 0;
>  }
>  
> +static inline int skb_get_prio_tc_map(const struct sk_buff *skb,
> +                                     const struct net_device *dev)
> +{
> +       __u32 prio = skb->priority;
> +
> +#if IS_ENABLED(CONFIG_NET_ACT_GATE)
> +       if (skb->use_ipv)
> +               prio = skb->ipv;
> +#endif
> +
> +       return netdev_get_prio_tc_map(dev, prio);
> +}
> +
>  int netdev_txq_to_tc(struct net_device *dev, unsigned int txq);
>  void netdev_reset_tc(struct net_device *dev);
>  int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count,
> u16 offset);
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index da96f0d3e753..b0a463c0bc65 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -913,6 +913,9 @@ typedef unsigned char *sk_buff_data_t;
>   *     @csum_start: Offset from skb->head where checksumming should
> start
>   *     @csum_offset: Offset from csum_start where checksum should be
> stored
>   *     @priority: Packet queueing priority
> + *     @use_ipv: Packet internal priority was altered by PSFP
> + *     @ipv: Internal Priority Value, overrides priority for traffic
> class
> + *             selection
>   *     @ignore_df: allow local fragmentation
>   *     @cloned: Head may be cloned (check refcnt to be sure)
>   *     @ip_summed: Driver fed us an IP checksum
> @@ -1145,6 +1148,10 @@ struct sk_buff {
>         __u8                    slow_gro:1;
>         __u8                    csum_not_inet:1;
>  
> +#ifdef CONFIG_NET_ACT_GATE
> +       __u8                    use_ipv:1;
> +#endif
> +
>  #ifdef CONFIG_NET_SCHED
>         __u16                   tc_index;       /* traffic control
> index */
>  #endif
> @@ -1209,6 +1216,10 @@ struct sk_buff {
>         /* only useable after checking ->active_extensions != 0 */
>         struct skb_ext          *extensions;
>  #endif
> +
> +#ifdef CONFIG_NET_ACT_GATE
> +       __u32                   ipv;
> +#endif
>  };
>  
>  /* if you move pkt_type around you also must adapt those constants
> */
> diff --git a/include/net/tc_act/tc_gate.h
> b/include/net/tc_act/tc_gate.h
> index c8fa11ebb397..b05c2c7d78e4 100644
> --- a/include/net/tc_act/tc_gate.h
> +++ b/include/net/tc_act/tc_gate.h
> @@ -44,6 +44,7 @@ struct tcf_gate {
>         ktime_t                 current_close_time;
>         u32                     current_entry_octets;
>         s32                     current_max_octets;
> +       s32                     current_ipv;
>         struct tcfg_gate_entry  *next_entry;
>         struct hrtimer          hitimer;
>         enum tk_offsets         tk_offset;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 721ba9c26554..956aa227c260 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3194,7 +3194,7 @@ static u16 skb_tx_hash(const struct net_device
> *dev,
>         u16 qcount = dev->real_num_tx_queues;
>  
>         if (dev->num_tc) {
> -               u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> +               u8 tc = skb_get_prio_tc_map(skb, dev);
>  
>                 qoffset = sb_dev->tc_to_txq[tc].offset;
>                 qcount = sb_dev->tc_to_txq[tc].count;
> @@ -4002,7 +4002,7 @@ EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
>  static int __get_xps_queue_idx(struct net_device *dev, struct
> sk_buff *skb,
>                                struct xps_dev_maps *dev_maps,
> unsigned int tci)
>  {
> -       int tc = netdev_get_prio_tc_map(dev, skb->priority);
> +       int tc = skb_get_prio_tc_map(skb, dev);
>         struct xps_map *map;
>         int queue_index = -1;
>  
> diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c
> index 0d81f172b7a6..036e746f18eb 100644
> --- a/net/dsa/tag_ocelot.c
> +++ b/net/dsa/tag_ocelot.c
> @@ -52,7 +52,7 @@ static void ocelot_xmit_common(struct sk_buff *skb,
> struct net_device *netdev,
>         ocelot_xmit_get_vlan_info(skb, dp, &vlan_tci, &tag_type);
>  
>         qos_class = netdev_get_num_tc(netdev) ?
> -                   netdev_get_prio_tc_map(netdev, skb->priority) :
> skb->priority;
> +                   skb_get_prio_tc_map(skb, netdev) : skb->priority;
>  
>         injection = skb_push(skb, OCELOT_TAG_LEN);
>         prefix = skb_push(skb, OCELOT_SHORT_PREFIX_LEN);
> diff --git a/net/sched/act_gate.c b/net/sched/act_gate.c
> index fd5155274733..9fb248b104f8 100644
> --- a/net/sched/act_gate.c
> +++ b/net/sched/act_gate.c
> @@ -81,6 +81,7 @@ static enum hrtimer_restart gate_timer_func(struct
> hrtimer *timer)
>         gact->current_gate_status = next->gate_state ?
> GATE_ACT_GATE_OPEN : 0;
>         gact->current_entry_octets = 0;
>         gact->current_max_octets = next->maxoctets;
> +       gact->current_ipv = next->ipv;
>  
>         gact->current_close_time = ktime_add_ns(gact-
> >current_close_time,
>                                                 next->interval);
> @@ -140,6 +141,11 @@ static int tcf_gate_act(struct sk_buff *skb,
> const struct tc_action *a,
>                 }
>         }
>  
> +       if (gact->current_ipv >= 0) {
> +               skb->use_ipv = true;
> +               skb->ipv = gact->current_ipv;
> +       }
> +
>         spin_unlock(&gact->tcf_lock);
>  
>         return gact->tcf_action;
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index b9c71a304d39..fb8bc17e38bb 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -201,7 +201,7 @@ static struct sched_entry
> *find_entry_to_transmit(struct sk_buff *skb,
>         s32 cycle_elapsed;
>         int tc, n;
>  
> -       tc = netdev_get_prio_tc_map(dev, skb->priority);
> +       tc = skb_get_prio_tc_map(skb, dev);
>         packet_transmit_time = length_to_duration(q,
> qdisc_pkt_len(skb));
>  
>         *interval_start = 0;
> @@ -509,7 +509,6 @@ static struct sk_buff *taprio_peek_soft(struct
> Qdisc *sch)
>  
>         for (i = 0; i < dev->num_tx_queues; i++) {
>                 struct Qdisc *child = q->qdiscs[i];
> -               int prio;
>                 u8 tc;
>  
>                 if (unlikely(!child))
> @@ -522,9 +521,7 @@ static struct sk_buff *taprio_peek_soft(struct
> Qdisc *sch)
>                 if (TXTIME_ASSIST_IS_ENABLED(q->flags))
>                         return skb;
>  
> -               prio = skb->priority;
> -               tc = netdev_get_prio_tc_map(dev, prio);
> -
> +               tc = skb_get_prio_tc_map(skb, dev);
>                 if (!(gate_mask & BIT(tc)))
>                         continue;
>  
> @@ -579,7 +576,6 @@ static struct sk_buff *taprio_dequeue_soft(struct
> Qdisc *sch)
>         for (i = 0; i < dev->num_tx_queues; i++) {
>                 struct Qdisc *child = q->qdiscs[i];
>                 ktime_t guard;
> -               int prio;
>                 int len;
>                 u8 tc;
>  
> @@ -597,9 +593,7 @@ static struct sk_buff *taprio_dequeue_soft(struct
> Qdisc *sch)
>                 if (!skb)
>                         continue;
>  
> -               prio = skb->priority;
> -               tc = netdev_get_prio_tc_map(dev, prio);
> -
> +               tc = skb_get_prio_tc_map(skb, dev);
>                 if (!(gate_mask & BIT(tc))) {
>                         skb = NULL;
>                         continue;
> -----------------------------[ cut here ]----------------------------
> -


Your solution certainly correct and do the differentiation between the
cases where we have PSFP at ingress or not. However in my understanding
the only purpose of the IPV is the traffic class selection. 

Setting skb->queue_mapping to IPV probably wont work, because of two
reasons:
1. it brings inconsistency with the mqprio/taprio queues and the actual
hw rings the traffic sent
2. some drivers dont check if skb->queue_mapping is bounded, they
expect its smaller than the num_tx_queues.

The 2. might be solvable, but the 1. is more problematic. However with
a helper, we might check if skb->queue_mapping is already set and use
that as a traffic class. Is that possible? I dont really see any other
codepath where that value can be other than zero before the qdisc
layer. That way one flag (use_ipv) might be enough which tells that we
should use the skb->queue_mapping as is (set by act_gate) and preserve
skb->priority.

What do you think? Again, sorry for being late here, but I'm following
the list and see that recently you did major mqprio/taprio fixes and
refactors, so I hope your cache line is hot.

Best,
Ferenc

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ