netdev - Re: Kernel interface to configure queue-group parameters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9c89f21-6a21-e52f-fcbd-b3b8e7f623d6@intel.com>
Date:   Fri, 24 Feb 2023 01:17:33 -0800
From:   "Nambiar, Amritha" <amritha.nambiar@...el.com>
To:     Alexander H Duyck <alexander.duyck@...il.com>,
        <netdev@...r.kernel.org>
CC:     <davem@...emloft.net>, <kuba@...nel.org>, <edumazet@...gle.com>,
        <pabeni@...hat.com>, Saeed Mahameed <saeed@...nel.org>,
        "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Subject: Re: Kernel interface to configure queue-group parameters

On 2/19/2023 9:39 AM, Alexander H Duyck wrote:
> On Thu, 2023-02-16 at 02:35 -0800, Nambiar, Amritha wrote:
>> On 2/7/2023 8:28 AM, Alexander H Duyck wrote:
>>> On Mon, 2023-02-06 at 16:15 -0800, Nambiar, Amritha wrote:
>>>> Hello,
>>>>
>>>> We are looking for feedback on the kernel interface to configure
>>>> queue-group level parameters.
>>>>
>>>> Queues are primary residents in the kernel and there are multiple
>>>> interfaces to configure queue-level parameters. For example, tx_maxrate
>>>> for a transmit queue can be controlled via the sysfs interface. Ethtool
>>>> is another option to change the RX/TX ring parameters of the specified
>>>> network device (example, rx-buf-len, tx-push etc.).
>>>>
>>>> Queue_groups are a set of queues grouped together into a single object.
>>>> For example, tx_queue_group-0 is a transmit queue_group with index 0 and
>>>> can have transmit queues say 0-31, similarly rx_queue_group-0 is a
>>>> receive queue_group with index 0 and can have receive queues 0-31,
>>>> tx/rx_queue_group_1 may consist of TX and RX queues say 32-127
>>>> respectively. Currently, upstream drivers for both ice and mlx5 support
>>>> creating TX and RX queue groups via the tc-mqprio and ethtool interfaces.
>>>>
>>>> At this point, the kernel does not have an abstraction for queue_group.
>>>> A close equivalent in the kernel is a 'traffic class' which consists of
>>>> a set of transmit queues. Today, traffic classes are created using TC's
>>>> mqprio scheduler. Only a limited set of parameters can be configured on
>>>> each traffic class via mqprio, example priority per traffic class, min
>>>> and max bandwidth rates per traffic class etc. Mqprio also supports
>>>> offload of these parameters to the hardware. The parameters set for the
>>>> traffic class (tx queue_group) is applicable to all transmit queues
>>>> belonging to the queue_group. However, introducing additional parameters
>>>> for queue_groups and configuring them via mqprio makes the interface
>>>> less user-friendly (as the command line gets cumbersome due to the
>>>> number of qdisc parameters). Although, mqprio is the interface to create
>>>> transmit queue_groups, and is also the interface to configure and
>>>> offload certain transmit queue_group parameters, due to these
>>>> limitations we are wondering if it is worth considering other interface
>>>> options for configuring queue_group parameters.
>>>>
>>>
>>> I think much of this depends on exactly what functionality we are
>>> talking about. The problem is the Intel use case conflates interrupts
>>> w/ queues w/ the applications themselves since what it is trying to do
>>> is a poor imitation of RDMA being implemented using something akin to
>>> VMDq last I knew.
>>>
>>> So for example one of the things you are asking about below is
>>> establishing a minimum rate for outgoing Tx packets. In my mind we
>>> would probably want to use something like mqprio to set that up since
>>> it is Tx rate limiting and if we were to configure it to happen in
>>> software it would need to be handled in the Qdisc layer.
>>>
>>
>> Configuring min and max rates for outgoing TX packets are already
>> supported in ice driver using mqprio. The issue is that dynamically
>> changing the rates per traffic class/queue_group via mqprio is not
>> straightforward, the "tc qdisc change" command will need all the rates
>> for traffic classes again, even for the tcs where rates are not being
>> changed.
>> For example, here's the sample command to configure min and max rates on
>> 4 TX queue groups:
>>
>> # tc qdisc add dev $iface root mqprio \
>>                         num_tc 4        \
>>                         map 0 1 2 3     \
>>                         queues 2@0 4@2 8@6 16@14  \
>>                         hw 1 mode channel \
>>                         shaper bw_rlimit \
>>                         min_rate 1Gbit 2Gbit 2Gbit 1Gbit\
>>                         max_rate 4Gbit 5Gbit 5Gbit 10Gbit
>>
>> Now, changing TX min_rate for TC1 to 20 Gbit:
>>
>> # tc qdisc change dev $iface root mqprio \
>>     shaper bw_rlimit min_rate 1Gbit 20Gbit 2Gbit 1Gbit
>>
>> Although this is not a major concern, I was looking for the simplicity
>> that something like sysfs provides with tx_maxrate for a queue, so that
>> when there are large number of tcs, just the ones that are being changed
>> needs to be dealt with (if we were to have sysfs rates per queue_group).
> 
> So it sounds like there is an interface already, you may just not like
> having to work with it due to the userspace tooling. Perhaps the
> solution would be to look at fixing things up so that the tooling would
> allow you to make changes to individual values. I haven't looked into
> the interface much but is there any way to retrieve the current
> settings from the Qdisc? If so you might be able to just update tc so
> that it would allow incremental updates and fill in the gaps with the
> config it already has.
> 
>>> As far as the NAPI pollers attribute that seems like something that
>>> needs further clarification. Are you limiting the number of busy poll
>>> instances that can run on a single queue group? Is there a reason for
>>> doing it per queue group instead of this being something that could be
>>> enabled on a specific set of queues within the group?
>>>
>>
>> Yes, we are trying to limit the number of napi instances for the queues
>> within a queue-group. Some options we could use:
>>
>> 1. A 'num_poller' attribute on a queue_group level - The initial idea
>> was to configure the number of poller threads that would be handling the
>> queues within the queue_group, as an example, a num_poller value of 4 on
>> a queue_group consisting of 4 queues would imply that there is a poller
>> per queue. This could also be changed to something like a single poller
>> for all 4 queues within the group.
>>
>> 2. A poller bitmap for each queue (both TX and RX) - The main concern
>> with the queue-level maps is that it would still be nice to have a
>> higher level queue-group isolation, so that a poller is not shared among
>> queues belonging to distinct queue-groups. Also, a queue-group level
>> config would consolidate the mapping of queues and vectors in the driver
>> in batches, instead of the driver having to update the queue<->vector
>> mapping in response to each queue's poller configuration.
>>
>> But we could do away with having these at queue-group level, and instead
>> use a different method as the third option below:
>> 3. A queues bitmap per napi instance - So the default arrangement today
>> is 1:1 mapping between queues and interrupt vectors and hence 1:1
>> queue<->poller association. If the user could configure one interrupt
>> vector to serve different queues, these queues can be serviced by the
>> poller/napi instance for the vector.
>> One way to do this is to have a bitmap of queues for each IRQ allocated
>> for the device (similar to smp_affinity CPUs bitmap for the given IRQ).
>> So, /sys/class/net/<iface>/device/msi_irqs/ lists all the IRQs
>> associated with the network interface. If the IRQ can take additional
>> attribute like queues_affinity for the IRQs on the network device (use
>> /sys/class/net/<iface>/device/msi_irqs/N/ since queues_affinity would be
>> specific to the network subsystem), this would enable multiple queues
>> <-> single vector association configurable by the user. The driver would
>> validate that a queue is not mapped multiple interrupts. This way an
>> interrupt can be shared among different queues as configured by the user.
>> Another approach is to expose the napi-ids via sysfs and support
>> per-napi attributes.
>> /sys/class/net/<iface>/napis/napi<0-N>
>> Each numbered sub-directory N contains attributes of that napi. A
>> 'napi_queues' attribute would be a bitmap of queues associated with the
>> napi-N enabling many queues <-> single napi association. Example,
>> /sys/class/net/<iface>/napis/napi-N/napi_queues
>>
>> We also plan to introduce an additional napi attribute for each napi
>> instance called 'poller_timeout' indicating the timeout in jiffies.
>> Exposing the napi-ids would also enable moving some existing napi
>> attributes such as 'threaded' and 'napi_defer_hard_irqs' etc. (which are
>> currently per netdev) to be per napi instance.
> 
> This one will require more thought and discussion as the NAPI instances
> themselves have been something that was largely hidden and not exposed
> to userspace up until now.
> 
> However with that said I am pretty certain sysfs isn't the way to go.
> 
>>>> Likewise, receive queue_groups can be created using the ethtool
>>>> interface as RSS contexts. Next step would be to configure
>>>> per-rx_queue_group parameters. Based on the discussion in
>>>> https://lore.kernel.org/netdev/20221114091559.7e24c7de@kernel.org/,
>>>> it looks like ethtool may not be the right interface to configure
>>>> rx_queue_group parameters (that are unrelated to flow<->queue
>>>> assignment), example NAPI configurations on the queue_group.
>>>>
>>>> The key gaps in the kernel to support queue-group parameters are:
>>>> 1. 'queue_group' abstraction in the kernel for both TX and RX distinctly
>>>> 2. Offload hooks for TX/RX queue_group parameters depending on the
>>>> chosen interface.
>>>>
>>>> Following are the options we have investigated:
>>>>
>>>> 1. tc-mqprio:
>>>>       Pros:
>>>>       - Already supports creating queue_groups, offload of certain parameters
>>>>
>>>>       Cons:
>>>>       - Introducing new parameters makes the interface less user-friendly.
>>>>     TC qdisc parameters are specified at the qdisc creation, larger the
>>>> number of traffic classes and their respective parameters, lesser the
>>>> usability.
>>>
>>> Yes and no. The TC layer is mostly meant for handling the Tx side of
>>> things. For something like the rate limiting it might make sense since
>>> there is already logic there to do it in mqprio. But if you are trying
>>> to pull in NAPI or RSS attributes then I agree it would hurt usability.
>>>
>>
>> The TX queue-group parameters supported via mqprio are limited to
>> priority, min and max rates. I think extending mqprio for a larger set
>> of TX parameters beyond just rates (say max burst) would bloat up the
>> command line. But yes, I agree, the TC layer is not the place for NAPI
>> attributes on TX queues.
> 
> The problem is you are reinventing the wheel. It sounds like this
> mostly does what you are looking for. If you are going to look at
> extending it then you should do so. Otherwise maybe you need to look at
> putting together a new Qdisc instead of creating an entirely new
> infrastructure since Qdisc is how we would deal with implementing
> something like this in software. We shouldn't be bypassing that as we
> should be implementing an equivilent for what we are wanting to do in
> hardware in the software.
> 
>>>> 2. Ethtool:
>>>>       Pros:
>>>>       - Already creates RX queue_groups as RSS contexts
>>>>
>>>>       Cons:
>>>>       - May not be the right interface for non-RSS related parameters
>>>>
>>>>       Example for configuring number of napi pollers for a queue group:
>>>>       ethtool -X <iface> context <context_num> num_pollers <n>
>>>
>>> One thing that might make sense would be to look at adding a possible
>>> alias for context that could work with something like DCB or the queue
>>> groups use case. I believe that for DCB there is a similar issue where
>>> the various priorities could have seperate RSS contexts so it might
>>> make sense to look at applying a similar logic. Also there has been
>>> talk about trying to do the the round robin on SYN type logic. That
>>> might make sense to expose as a hfunc type since it would be overriding
>>> RSS for TCP flows.
>>>
>>
>> For the round robin flow steering of TCP flows (on SYN by overriding RSS
>> hash), the plan was to add a new 'inline_fd' parameter to ethtool rss
>> context. Will look into your suggestion for using hfunc type.
> 
> It sounds like we are generally thinking in the same area so that is a
> good start there.
> 
>>>
>>>> 3. sysfs:
>>>>       Pros:
>>>>       - Ideal to configure parameters such as NAPI/IRQ for Rx queue_group.
>>>>       - Makes it possible to support some existing per-netdev napi
>>>> parameters like 'threaded' and 'napi_defer_hard_irqs' etc. to be
>>>> per-queue-group parameters.
>>>>
>>>>       Cons:
>>>>       - Requires introducing new queue_group structures for TX and RX
>>>> queue groups and references for it, kset references for queue_group in
>>>> struct net_device
>>>>       - Additional ndo ops in net_device_ops for each parameter for
>>>> hardware offload.
>>>>
>>>>       Examples :
>>>>       /sys/class/net/<iface>/queue_groups/rxqg-<0-n>/num_pollers
>>>>       /sys/class/net/<iface>/queue_groups/txqg-<0-n>/min_rate
>>>
>>> So min_rate is something already handled in mqprio since it is so DCB
>>> like. You are essentially guaranteeing bandwidth aren't you? Couldn't
>>> you just define a bw_rlimit shaper for mqprio and then use the existing
>>> bw_rlimit values to define the min_rate?
>>>
>>
>> The ice driver already supports min_rate per queue_group using mqprio. I
>> was suggesting this in case we happen to have a TX queue_group object,
>> since dynamically changing rates via mqprio was not handy enough as I
>> mentioned above.
> 
> Yeah, but based on the description you are rewriting the kernel side
> because you don't like dealing with the userspace tools. Again maybe
> the solution here would be to look at cleaning up the userspace
> interface to add support for reading/retrieving the existing values and
> then updating instead of requiring a complete update every time.
> 
> What we want to avoid is creating new overhead in the kernel where we
> now have yet another way to control Tx rates as each redundant
> interface added is that much more overhead that has to be dealt with
> throughout the Tx path. If we already have a way to do this with mqprio
> lets support just offloading that into hardware rather than adding yet
> another Tx rate control.
> 
>>> As far as adding the queue_groups interface one ugly bit would be that
>>> we would probably need to have links between the queues and these
>>> groups which would start to turn the sysfs into a tangled mess.
>>>
>> Agree, maintaining the links between queues and groups is not trivial.
>>
>>> The biggest issue I see is that there isn't any sort of sysfs interface
>>> exposed for NAPI which is what you would essentially need to justify
>>> something like this since that is what you are modifying.
>>>
>>
>> Right. Something like /sys/class/net/<iface>/napis/napi<0-N>
>> Maybe, initially there would be as many napis as queues due to 1:1
>> association, but as the queues bitmap is tuned for the napi, only those
>> napis that have queue[s] associated with it would be exposed.
> 
> As Jakub already pointed out adding more sysfs is generally frowned
> upon.

Agree. We do not wish to have another interface to control rates and 
that overhead can be avoided. We can continue to use mqprio and 
dynamically change the rates using "tc qdisc change" command.

For exposing napis, can ethtool netlink be an option as I detailed in 
the reply to Jakub?