[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izMYF__OsryoH6wyvv8wf57RHWHH8i4z8AggYZVvNqH2TQ@mail.gmail.com>
Date: Wed, 23 Apr 2025 13:02:52 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: davem@...emloft.net, netdev@...r.kernel.org, edumazet@...gle.com,
pabeni@...hat.com, andrew+netdev@...n.ch, horms@...nel.org,
donald.hunter@...il.com, sdf@...ichev.me, dw@...idwei.uk,
asml.silence@...il.com, ap420073@...il.com, jdamato@...tly.com,
dtatulea@...dia.com, michael.chan@...adcom.com
Subject: Re: [RFC net-next 00/22] net: per-queue rx-buf-len configuration
On Mon, Apr 21, 2025 at 3:28 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> Add support for per-queue rx-buf-len configuration.
>
> I'm sending this as RFC because I'd like to ponder the uAPI side
> a little longer but it's good enough for people to work on
> the memory provider side and support in other drivers.
>
May be silly question, but I assume opting into this is optional for
queue API drivers? Or do you need GVE to implement dependencies of
this very soon otherwise it's blocking your work?
> The direct motivation for the series is that zero-copy Rx queues would
> like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
> and can coalesce payloads into pages much larger than than the MTU.
> Enabling larger buffers globally is a bit precarious as it exposes us
> to potentially very inefficient memory use. Also allocating large
> buffers may not be easy or cheap under load. Zero-copy queues service
> only select traffic and have pre-allocated memory so the concerns don't
> apply as much.
>
> The per-queue config has to address 3 problems:
> - user API
> - driver API
> - memory provider API
>
> For user API the main question is whether we expose the config via
> ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
> than extending the ethtool RINGS_GET API. I worry slightly that queue
> GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
> settings we have in ethtool which are not going via RINGS_SET is
> IRQ coalescing.
>
> My goal for the driver API was to avoid complexity in the drivers.
> The queue management API has gained two ops, responsible for preparing
> configuration for a given queue, and validating whether the config
> is supported. The validating is used both for NIC-wide and per-queue
> changes. Queue alloc/start ops have a new "config" argument which
> contains the current config for a given queue (we use queue restart
> to apply per-queue settings). Outside of queue reset paths drivers
> can call netdev_queue_config() which returns the config for an arbitrary
> queue. Long story short I anticipate it to be used during ndo_open.
>
> In the core I extended struct netdev_config with per queue settings.
> All in all this isn't too far from what was there in my "queue API
> prototype" a few years ago. One thing I was hoping to support but
> haven't gotten to is providing the settings at the RSS context level.
> Zero-copy users often depend on RSS for load spreading. It'd be more
> convenient for them to provide the settings per RSS context.
> We may be better off converting the QUEUE_SET netlink op to CONFIG_SET
> and accept multiple "scopes" (queue, rss context)?
>
> Memory provider API is a bit tricky. Initially I wasn't sure whether
> the buffer size should be a MP attribute or a device attribute.
> IOW whether it's the device that should be telling the MP what page
> size it wants, or the MP telling the device what page size it has.
I think it needs to be the former. Memory providers will have wildly
differing restrictions in regards to size. I think already the dmabuf
mp can allocate any byte size net_iov. I think the io_uring mp can
allocate any multiple of PAGE_SIZE net_iov. MPs communicating their
restrictions over a uniform interface with the driver seems difficult
to define. Better for the driver to ask the pp/mp what it wants, and
the mp can complain if it doesn't support it.
Also this mirrors what we do today with page_pool_params.order arg
IIRC. You probably want to piggy back off that or rework it.
--
Thanks,
Mina
Powered by blists - more mailing lists