netdev - Re: Kernel interface to configure queue-group parameters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <893f1f98-7f64-4f66-3f08-75865a6916c3@intel.com>
Date:   Thu, 16 Feb 2023 02:35:35 -0800
From:   "Nambiar, Amritha" <amritha.nambiar@...el.com>
To:     Alexander H Duyck <alexander.duyck@...il.com>,
        <netdev@...r.kernel.org>
CC:     <davem@...emloft.net>, <kuba@...nel.org>, <edumazet@...gle.com>,
        <pabeni@...hat.com>, Saeed Mahameed <saeed@...nel.org>,
        "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Subject: Re: Kernel interface to configure queue-group parameters

On 2/7/2023 8:28 AM, Alexander H Duyck wrote:
> On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
>> Hello,
>>
>> We are looking for feedback on the kernel interface to configure
>> queue-group level parameters.
>>
>> Queues are primary residents in the kernel and there are multiple
>> interfaces to configure queue-level parameters. For example, tx_maxrate
>> for a transmit queue can be controlled via the sysfs interface. Ethtool
>> is another option to change the RX/TX ring parameters of the specified
>> network device (example, rx-buf-len, tx-push etc.).
>>
>> Queue_groups are a set of queues grouped together into a single object.
>> For example, tx_queue_group-0 is a transmit queue_group with index 0 and
>> can have transmit queues say 0-31, similarly rx_queue_group-0 is a
>> receive queue_group with index 0 and can have receive queues 0-31,
>> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
>> respectively. Currently, upstream drivers for both ice and mlx5 support
>> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
>>
>> At this point, the kernel does not have an abstraction for queue_group.
>> A close equivalent in the kernel is a 'traffic class' which consists of
>> a set of transmit queues. Today, traffic classes are created using TC's
>> mqprio scheduler. Only a limited set of parameters can be configured on
>> each traffic class via mqprio, example priority per traffic class, min
>> and max bandwidth rates per traffic class etc. Mqprio also supports
>> offload of these parameters to the hardware. The parameters set for the
>> traffic class (tx queue_group) is applicable to all transmit queues
>> belonging to the queue_group. However, introducing additional parameters
>> for queue_groups and configuring them via mqprio makes the interface
>> less user-friendly (as the command line gets cumbersome due to the
>> number of qdisc parameters). Although, mqprio is the interface to create
>> transmit queue_groups, and is also the interface to configure and
>> offload certain transmit queue_group parameters, due to these
>> limitations we are wondering if it is worth considering other interface
>> options for configuring queue_group parameters.
>>
> 
> I think much of this depends on exactly what functionality we are
> talking about. The problem is the Intel use case conflates interrupts
> w/ queues w/ the applications themselves since what it is trying to do
> is a poor imitation of RDMA being implemented using something akin to
> VMDq last I knew.
> 
> So for example one of the things you are asking about below is
> establishing a minimum rate for outgoing Tx packets. In my mind we
> would probably want to use something like mqprio to set that up since
> it is Tx rate limiting and if we were to configure it to happen in
> software it would need to be handled in the Qdisc layer.
> 

Configuring min and max rates for outgoing TX packets are already 
supported in ice driver using mqprio. The issue is that dynamically 
changing the rates per traffic class/queue_group via mqprio is not 
straightforward, the "tc qdisc change" command will need all the rates 
for traffic classes again, even for the tcs where rates are not being 
changed.
For example, here's the sample command to configure min and max rates on 
4 TX queue groups:

# tc qdisc add dev $iface root mqprio \
                       num_tc 4        \
                       map 0 1 2 3     \
                       queues 2@0 4@2 8@6 16@14  \
                       hw 1 mode channel \
                       shaper bw_rlimit \
                       min_rate 1Gbit 2Gbit 2Gbit 1Gbit\
                       max_rate 4Gbit 5Gbit 5Gbit 10Gbit

Now, changing TX min_rate for TC1 to 20 Gbit:

# tc qdisc change dev $iface root mqprio \
   shaper bw_rlimit min_rate 1Gbit 20Gbit 2Gbit 1Gbit

Although this is not a major concern, I was looking for the simplicity 
that something like sysfs provides with tx_maxrate for a queue, so that 
when there are large number of tcs, just the ones that are being changed 
needs to be dealt with (if we were to have sysfs rates per queue_group).

> As far as the NAPI pollers attribute that seems like something that
> needs further clarification. Are you limiting the number of busy poll
> instances that can run on a single queue group? Is there a reason for
> doing it per queue group instead of this being something that could be
> enabled on a specific set of queues within the group?
> 

Yes, we are trying to limit the number of napi instances for the queues 
within a queue-group. Some options we could use:

1. A 'num_poller' attribute on a queue_group level - The initial idea 
was to configure the number of poller threads that would be handling the 
queues within the queue_group, as an example, a num_poller value of 4 on 
a queue_group consisting of 4 queues would imply that there is a poller 
per queue. This could also be changed to something like a single poller 
for all 4 queues within the group.

2. A poller bitmap for each queue (both TX and RX) - The main concern 
with the queue-level maps is that it would still be nice to have a 
higher level queue-group isolation, so that a poller is not shared among 
queues belonging to distinct queue-groups. Also, a queue-group level 
config would consolidate the mapping of queues and vectors in the driver 
in batches, instead of the driver having to update the queue<->vector 
mapping in response to each queue's poller configuration.

But we could do away with having these at queue-group level, and instead 
use a different method as the third option below:
3. A queues bitmap per napi instance - So the default arrangement today 
is 1:1 mapping between queues and interrupt vectors and hence 1:1 
queue<->poller association. If the user could configure one interrupt 
vector to serve different queues, these queues can be serviced by the 
poller/napi instance for the vector.
One way to do this is to have a bitmap of queues for each IRQ allocated 
for the device (similar to smp_affinity CPUs bitmap for the given IRQ). 
So, /sys/class/net/<iface>/device/msi_irqs/ lists all the IRQs 
associated with the network interface. If the IRQ can take additional 
attribute like queues_affinity for the IRQs on the network device (use 
/sys/class/net/<iface>/device/msi_irqs/N/ since queues_affinity would be 
specific to the network subsystem), this would enable multiple queues 
<-> single vector association configurable by the user. The driver would 
validate that a queue is not mapped multiple interrupts. This way an 
interrupt can be shared among different queues as configured by the user.
Another approach is to expose the napi-ids via sysfs and support 
per-napi attributes.
/sys/class/net/<iface>/napis/napi<0-N>
Each numbered sub-directory N contains attributes of that napi. A 
'napi_queues' attribute would be a bitmap of queues associated with the 
napi-N enabling many queues <-> single napi association. Example, 
/sys/class/net/<iface>/napis/napi-N/napi_queues

We also plan to introduce an additional napi attribute for each napi 
instance called 'poller_timeout' indicating the timeout in jiffies. 
Exposing the napi-ids would also enable moving some existing napi 
attributes such as 'threaded' and 'napi_defer_hard_irqs' etc. (which are 
currently per netdev) to be per napi instance.

>> Likewise, receive queue_groups can be created using the ethtool
>> interface as RSS contexts. Next step would be to configure
>> per-rx_queue_group parameters. Based on the discussion in
>> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
>> it looks like ethtool may not be the right interface to configure
>> rx_queue_group parameters (that are unrelated to flow<->queue
>> assignment), example NAPI configurations on the queue_group.
>>
>> The key gaps in the kernel to support queue-group parameters are:
>> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
>> 2. Offload hooks for TX/RX queue_group parameters depending on the
>> chosen interface.
>>
>> Following are the options we have investigated:
>>
>> 1. tc-mqprio:
>>      Pros:
>>      - Already supports creating queue_groups, offload of certain parameters
>>
>>      Cons:
>>      - Introducing new parameters makes the interface less user-friendly.
>>    TC qdisc parameters are specified at the qdisc creation, larger the
>> number of traffic classes and their respective parameters, lesser the
>> usability.
> 
> Yes and no. The TC layer is mostly meant for handling the Tx side of
> things. For something like the rate limiting it might make sense since
> there is already logic there to do it in mqprio. But if you are trying
> to pull in NAPI or RSS attributes then I agree it would hurt usability.
> 

The TX queue-group parameters supported via mqprio are limited to 
priority, min and max rates. I think extending mqprio for a larger set 
of TX parameters beyond just rates (say max burst) would bloat up the 
command line. But yes, I agree, the TC layer is not the place for NAPI 
attributes on TX queues.

>> 2. Ethtool:
>>      Pros:
>>      - Already creates RX queue_groups as RSS contexts
>>
>>      Cons:
>>      - May not be the right interface for non-RSS related parameters
>>
>>      Example for configuring number of napi pollers for a queue group:
>>      ethtool -X <iface> context <context_num> num_pollers <n>
> 
> One thing that might make sense would be to look at adding a possible
> alias for context that could work with something like DCB or the queue
> groups use case. I believe that for DCB there is a similar issue where
> the various priorities could have seperate RSS contexts so it might
> make sense to look at applying a similar logic. Also there has been
> talk about trying to do the the round robin on SYN type logic. That
> might make sense to expose as a hfunc type since it would be overriding
> RSS for TCP flows.
> 

For the round robin flow steering of TCP flows (on SYN by overriding RSS 
hash), the plan was to add a new 'inline_fd' parameter to ethtool rss 
context. Will look into your suggestion for using hfunc type.

> The num_pollers can be problematic though as we don't really have
> anything like that in ethtool currently. Probably the closest thing I
> can think of is interrupt moderation. It depends on if it has to be a
> per queue group attribute or if it could be a per-queue attrtibute.
> Specifically I am referring to the -Q option that is currently applied
> to the coalescing functions in ethtool.
> 
>> 3. sysfs:
>>      Pros:
>>      - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
>>      - Makes it possible to support some existing per-netdev napi
>> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
>> per-queue-group parameters.
>>
>>      Cons:
>>      - Requires introducing new queue_group structures for TX and RX
>> queue groups and references for it, kset references for queue_group in
>> struct net_device
>>      - Additional ndo ops in net_device_ops for each parameter for
>> hardware offload.
>>
>>      Examples :
>>      /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
>>      /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
> 
> So min_rate is something already handled in mqprio since it is so DCB
> like. You are essentially guaranteeing bandwidth aren't you? Couldn't
> you just define a bw_rlimit shaper for mqprio and then use the existing
> bw_rlimit values to define the min_rate?
> 

The ice driver already supports min_rate per queue_group using mqprio. I 
was suggesting this in case we happen to have a TX queue_group object, 
since dynamically changing rates via mqprio was not handy enough as I 
mentioned above.

> As far as adding the queue_groups interface one ugly bit would be that
> we would probably need to have links between the queues and these
> groups which would start to turn the sysfs into a tangled mess.
>
Agree, maintaining the links between queues and groups is not trivial.

> The biggest issue I see is that there isn't any sort of sysfs interface
> exposed for NAPI which is what you would essentially need to justify
> something like this since that is what you are modifying.
> 

Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
Maybe, initially there would be as many napis as queues due to 1:1 
association, but as the queues bitmap is tuned for the napi, only those 
napis that have queue[s] associated with it would be exposed.

>> 4. Devlink:
>>      Pros:
>>      - New parameters can be added without any changes to the kernel or
>> userspace.
>>
>>      Cons:
>>      - Queue/Queue_group is a function-wide entity, Devlink is for
>> device-wide stuff. Devlink being device centric is not suitable for
>> queue parameters such as rates, NAPI etc.
> 
> Yeah, I wouldn't expect something like this to be a good fit.