lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8476530638a5f4381d64db0e024ed49c2db3b02.camel@gmail.com>
Date:   Tue, 07 Feb 2023 08:28:56 -0800
From:   Alexander H Duyck <alexander.duyck@...il.com>
To:     "Nambiar, Amritha" <amritha.nambiar@...el.com>,
        netdev@...r.kernel.org
Cc:     davem@...emloft.net, kuba@...nel.org, edumazet@...gle.com,
        pabeni@...hat.com, Saeed Mahameed <saeed@...nel.org>,
        "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Subject: Re: Kernel interface to configure queue-group parameters

On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
> Hello,
> 
> We are looking for feedback on the kernel interface to configure 
> queue-group level parameters.
> 
> Queues are primary residents in the kernel and there are multiple 
> interfaces to configure queue-level parameters. For example, tx_maxrate 
> for a transmit queue can be controlled via the sysfs interface. Ethtool 
> is another option to change the RX/TX ring parameters of the specified 
> network device (example, rx-buf-len, tx-push etc.).
> 
> Queue_groups are a set of queues grouped together into a single object. 
> For example, tx_queue_group-0 is a transmit queue_group with index 0 and 
> can have transmit queues say 0-31, similarly rx_queue_group-0 is a 
> receive queue_group with index 0 and can have receive queues 0-31, 
> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127 
> respectively. Currently, upstream drivers for both ice and mlx5 support 
> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
> 
> At this point, the kernel does not have an abstraction for queue_group. 
> A close equivalent in the kernel is a 'traffic class' which consists of 
> a set of transmit queues. Today, traffic classes are created using TC's 
> mqprio scheduler. Only a limited set of parameters can be configured on 
> each traffic class via mqprio, example priority per traffic class, min 
> and max bandwidth rates per traffic class etc. Mqprio also supports 
> offload of these parameters to the hardware. The parameters set for the 
> traffic class (tx queue_group) is applicable to all transmit queues 
> belonging to the queue_group. However, introducing additional parameters 
> for queue_groups and configuring them via mqprio makes the interface 
> less user-friendly (as the command line gets cumbersome due to the 
> number of qdisc parameters). Although, mqprio is the interface to create 
> transmit queue_groups, and is also the interface to configure and 
> offload certain transmit queue_group parameters, due to these 
> limitations we are wondering if it is worth considering other interface 
> options for configuring queue_group parameters.
> 

I think much of this depends on exactly what functionality we are
talking about. The problem is the Intel use case conflates interrupts
w/ queues w/ the applications themselves since what it is trying to do
is a poor imitation of RDMA being implemented using something akin to
VMDq last I knew.

So for example one of the things you are asking about below is
establishing a minimum rate for outgoing Tx packets. In my mind we
would probably want to use something like mqprio to set that up since
it is Tx rate limiting and if we were to configure it to happen in
software it would need to be handled in the Qdisc layer.

As far as the NAPI pollers attribute that seems like something that
needs further clarification. Are you limiting the number of busy poll
instances that can run on a single queue group? Is there a reason for
doing it per queue group instead of this being something that could be
enabled on a specific set of queues within the group?

> Likewise, receive queue_groups can be created using the ethtool 
> interface as RSS contexts. Next step would be to configure 
> per-rx_queue_group parameters. Based on the discussion in 
> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
> it looks like ethtool may not be the right interface to configure 
> rx_queue_group parameters (that are unrelated to flow<->queue 
> assignment), example NAPI configurations on the queue_group.
> 
> The key gaps in the kernel to support queue-group parameters are:
> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
> 2. Offload hooks for TX/RX queue_group parameters depending on the 
> chosen interface.
> 
> Following are the options we have investigated:
> 
> 1. tc-mqprio:
>     Pros:
>     - Already supports creating queue_groups, offload of certain parameters
> 
>     Cons:
>     - Introducing new parameters makes the interface less user-friendly. 
>   TC qdisc parameters are specified at the qdisc creation, larger the 
> number of traffic classes and their respective parameters, lesser the 
> usability.

Yes and no. The TC layer is mostly meant for handling the Tx side of
things. For something like the rate limiting it might make sense since
there is already logic there to do it in mqprio. But if you are trying
to pull in NAPI or RSS attributes then I agree it would hurt usability.

> 2. Ethtool:
>     Pros:
>     - Already creates RX queue_groups as RSS contexts
> 
>     Cons:
>     - May not be the right interface for non-RSS related parameters
> 
>     Example for configuring number of napi pollers for a queue group:
>     ethtool -X <iface> context <context_num> num_pollers <n>

One thing that might make sense would be to look at adding a possible
alias for context that could work with something like DCB or the queue
groups use case. I believe that for DCB there is a similar issue where
the various priorities could have seperate RSS contexts so it might
make sense to look at applying a similar logic. Also there has been
talk about trying to do the the round robin on SYN type logic. That
might make sense to expose as a hfunc type since it would be overriding
RSS for TCP flows.

The num_pollers can be problematic though as we don't really have
anything like that in ethtool currently. Probably the closest thing I
can think of is interrupt moderation. It depends on if it has to be a
per queue group attribute or if it could be a per-queue attrtibute.
Specifically I am referring to the -Q option that is currently applied
to the coalescing functions in ethtool. 

> 3. sysfs:
>     Pros:
>     - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
>     - Makes it possible to support some existing per-netdev napi 
> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be 
> per-queue-group parameters.
> 
>     Cons:
>     - Requires introducing new queue_group structures for TX and RX 
> queue groups and references for it, kset references for queue_group in 
> struct net_device
>     - Additional ndo ops in net_device_ops for each parameter for 
> hardware offload.
> 
>     Examples :
>     /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
>     /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate

So min_rate is something already handled in mqprio since it is so DCB
like. You are essentially guaranteeing bandwidth aren't you? Couldn't
you just define a bw_rlimit shaper for mqprio and then use the existing
bw_rlimit values to define the min_rate?

As far as adding the queue_groups interface one ugly bit would be that
we would probably need to have links between the queues and these
groups which would start to turn the sysfs into a tangled mess.

The biggest issue I see is that there isn't any sort of sysfs interface
exposed for NAPI which is what you would essentially need to justify
something like this since that is what you are modifying.

> 4. Devlink:
>     Pros:
>     - New parameters can be added without any changes to the kernel or 
> userspace.
> 
>     Cons:
>     - Queue/Queue_group is a function-wide entity, Devlink is for 
> device-wide stuff. Devlink being device centric is not suitable for 
> queue parameters such as rates, NAPI etc.

Yeah, I wouldn't expect something like this to be a good fit.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ