[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49a53cb9-e04d-4afa-86e8-15b975741e4d@nvidia.com>
Date: Tue, 5 Mar 2024 16:27:50 -0800
From: William Tu <witu@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, jiri@...dia.com, bodong@...dia.com,
tariqt@...dia.com, yossiku@...dia.com
Subject: Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor
eswitch attr
On 3/4/24 8:37 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, 1 Mar 2024 03:11:18 +0200 William Tu wrote:
>> Add two eswitch attrs: shrdesc_mode and shrdesc_count.
>>
>> 1. shrdesc_mode: to enable a sharing memory buffer for
>> representor's rx buffer,
> Let's narrow down the terminology. "Shared memory buffer"
> and "shared memory pool" and "shrdesc" all refer to the same
> thing. Let's stick to shared pool?
ok, will use share pool.
>> and 2. shrdesc_count: to control the
>> number of buffers in this shared memory pool.
> _default_ number of buffers in shared pool used by representors?
>
> If/when the API to configure shared pools becomes real it will
> presumably take precedence over this default?
yes, if that's the case.
>> When using switchdev mode, the representor ports handles the slow path
>> traffic, the traffic that can't be offloaded will be redirected to the
>> representor port for processing. Memory consumption of the representor
>> port's rx buffer can grow to several GB when scaling to 1k VFs reps.
>> For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
>> consumes 3MB of DMA memory for packet buffer in WQEs, and with four
>> channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
>> ports are for slow path traffic, most of these rx DMA memory are idle.
>>
>> Add shrdesc_mode configuration, allowing multiple representors
>> to share a rx memory buffer pool. When enabled, individual representor
>> doesn't need to allocate its dedicated rx buffer, but just pointing
>> its rq to the memory pool. This could make the memory being better
>> utilized. The shrdesc_count represents the number of rx ring
>> entries, e.g., same meaning as ethtool -g, that's shared across other
>> representors. Users adjust it based on how many reps, total system
>> memory, or performance expectation.
> Can we use bytes as the unit? Like the page pool. Descriptors don't
> mean much to the user.
But how about the unit size? do we assume unit size = 1 page?
so page pool has
order: 2^order pages on allocation
pool_size: size of ptr_ring
How about we assume that order is 0, and let user set pool_size (number
of page-size entries).
>
>> The two params are also useful for other vendors such as Intel ICE
>> drivers and Broadcom's driver, which also have representor ports for
>> slow path traffic.
>>
>> An example use case:
>> $ devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic \
>> shrdesc-mode none shrdesc-count 0
>> $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
>> shrdesc-mode basic shrdesc-count 1024
>> $ devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic \
>> shrdesc-mode basic shrdesc-count 1024
>>
>> Note that new configurations are set at legacy mode, and enabled at
>> switchdev mode.
>> Documentation/netlink/specs/devlink.yaml | 17 ++++++++++
>> include/net/devlink.h | 8 +++++
>> include/uapi/linux/devlink.h | 7 ++++
>> net/devlink/dev.c | 43 ++++++++++++++++++++++++
>> net/devlink/netlink_gen.c | 6 ++--
>> 5 files changed, 79 insertions(+), 2 deletions(-)
> ENODOCS
will add docs in next version, thanks.
>> diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
>> index cf6eaa0da821..58f31d99b8b3 100644
>> --- a/Documentation/netlink/specs/devlink.yaml
>> +++ b/Documentation/netlink/specs/devlink.yaml
>> @@ -119,6 +119,14 @@ definitions:
>> name: none
>> -
>> name: basic
>> + -
>> + type: enum
>> + name: eswitch-shrdesc-mode
>> + entries:
>> + -
>> + name: none
>> + -
>> + name: basic
> Do we need this knob?
> Can we not assume that shared-pool-count == 0 means disabled?
do you mean assume or not assume?
I guess you mean assume, so use "shared-pool-count == 0" to indicate
disable?
That will also work so we only need to introduce 1 attribute.
> We can always add the knob later if needed, right now it's
> just on / off with some less direct names.
>
>> -
>> type: enum
>> name: dpipe-header-id
>> @@ -429,6 +437,13 @@ attribute-sets:
>> name: eswitch-encap-mode
>> type: u8
>> enum: eswitch-encap-mode
>> + -
>> + name: eswitch-shrdesc-mode
>> + type: u8
> u32, netlink rounds sizes up to 4B, anyway
ok, thanks!
>
>> + enum: eswitch-shrdesc-mode
>> + -
>> + name: eswitch-shrdesc-count
>> + type: u32
>> -
>> name: resource-list
>> type: nest
Powered by blists - more mailing lists