[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241011184527.16393-1-jdamato@fastly.com>
Date: Fri, 11 Oct 2024 18:44:55 +0000
From: Joe Damato <jdamato@...tly.com>
To: netdev@...r.kernel.org
Cc: mkarsten@...terloo.ca,
skhawaja@...gle.com,
sdf@...ichev.me,
bjorn@...osinc.com,
amritha.nambiar@...el.com,
sridhar.samudrala@...el.com,
willemdebruijn.kernel@...il.com,
edumazet@...gle.com,
Joe Damato <jdamato@...tly.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Breno Leitao <leitao@...ian.org>,
Daniel Jurgens <danielj@...dia.com>,
David Ahern <dsahern@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Donald Hunter <donald.hunter@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Jiri Pirko <jiri@...nulli.us>,
Johannes Berg <johannes.berg@...el.com>,
Jonathan Corbet <corbet@....net>,
Leon Romanovsky <leon@...nel.org>,
linux-doc@...r.kernel.org (open list:DOCUMENTATION),
linux-kernel@...r.kernel.org (open list),
linux-rdma@...r.kernel.org (open list:MELLANOX MLX4 core VPI driver),
Lorenzo Bianconi <lorenzo@...nel.org>,
Michael Chan <michael.chan@...adcom.com>,
Mina Almasry <almasrymina@...gle.com>,
Paolo Abeni <pabeni@...hat.com>,
Saeed Mahameed <saeedm@...dia.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Tariq Toukan <tariqt@...dia.com>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: [net-next v6 0/9] Add support for per-NAPI config via netlink
Greetings:
Welcome to v6. Minor changes from v5 [1], please see changelog below.
There were no explicit comments from reviewers on the call outs in my
v5, so I'm retaining them from my previous cover letter just in case :)
A few important call outs for reviewers:
1. This revision seems to work (see below for a full walk through). I
think this is the behavior we talked about, but please let me know if
a use case is missing.
2. Re a previous point made by Stanislav regarding "taking over a NAPI
ID" when the channel count changes: mlx5 seems to call napi_disable
followed by netif_napi_del for the old queues and then calls
napi_enable for the new ones. In this RFC, the NAPI ID generation is
deferred to napi_enable. This means we won't end up with two of the
same NAPI IDs added to the hash at the same time.
Can we assume all drivers will napi_disable the old queues before
napi_enable the new ones?
- If yes: we might not need to worry about a NAPI ID takeover
function.
- If no: I'll need to make a change so that the NAPI ID generation
is deferred only for drivers which have opted into the config
space via calls to netif_napi_add_config
3. I made the decision to remove the WARN_ON_ONCE that (I think?)
Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
!= rxqs);) because this was triggering on every kernel boot with my
mlx5 NIC.
4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
thinking this is a bit strange. I think it's strange that we might
be short some number of NAPI configs, but it seems like most people
are in favor of this approach, so I've left it.
I'd appreciate thoughts from reviewers on the above items, if at all
possible.
Now, on to the implementation.
Firstly, this implementation moves certain settings to napi_struct so that
they are "per-NAPI", while taking care to respect existing sysfs
parameters which are interface wide and affect all NAPIs:
- NAPI ID
- gro_flush_timeout
- defer_hard_irqs
Furthermore:
- NAPI ID generation and addition to the hash is now deferred to
napi_enable, instead of during netif_napi_add
- NAPIs are removed from the hash during napi_disable, instead of
netif_napi_del.
- An array of "struct napi_config" is allocated in net_device.
IMPORTANT: The above changes affect all network drivers.
Optionally, drivers may opt-in to using their config space by calling
netif_napi_add_config instead of netif_napi_add.
If a driver does this, the NAPI being added is linked with an allocated
"struct napi_config" and the per-NAPI settings (including NAPI ID) are
persisted even as hardware queues are destroyed and recreated.
To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
- mlx5 which is the basis of the examples below
- mlx4 which has TX only NAPIs, just to highlight that case. I have
only compile tested this patch; I don't have this hardware.
- bnxt which I have only compiled tested. I don't have this
hardware.
NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)
Here's how it works when I test it on my mlx5 system:
# start with 2 queues
$ ethtool -l eth4 | grep Combined | tail -1
Combined: 2
First, output the current NAPI settings:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, set the global sysfs parameters:
$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'
Output current NAPI settings again:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345, via its NAPI ID to specific values:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set \
--json='{"id": 345,
"defer-hard-irqs": 111,
"gro-flush-timeout": 11111}'
None
Now output current NAPI settings again to ensure only NAPI ID 345
changed:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 11111,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, increase gro-flush-timeout only:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 44444}'
None
Now output the current NAPI settings once more:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 44444,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345 to have gro_flush_timeout of 0:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 0}'
None
Check that NAPI ID 345 has a value of 0:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Change the queue count, ensuring that NAPI ID 345 retains its settings:
$ sudo ethtool -L eth4 combined 4
Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 347,
'ifindex': 7,
'irq': 529},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now reduce the queue count below where NAPI ID 345 is indexed:
$ sudo ethtool -L eth4 combined 1
Check the output:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:
$ sudo ethtool -L eth4 combined 2
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:
$ sudo ethtool -L eth4 combined 8
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:
$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'
Check that worked:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 344,
'ifindex': 7,
'irq': 327}]
Thanks,
Joe
[1]: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@fastly.com/
v6:
- Added Jakub's Reviewed-by to all commit messages
- Added Eric's Reviewed-by to the commits for which he provided his
review (i.e. all patches except 4, 5, and 6)
- Updated the netlink doc for gro_flush_timeout in patch 4
- Added WARN_ON_ONCE to napi_hash_add_with_id as suggested by Eric
Dumazet to patch 5
v5: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@fastly.com/
- Converted from an RFC to a PATCH
- Moved defer_hard_irqs above control-fields comment in napi_struct in
patch 1
- Moved gro_flush_timeout above control-fields comment in napi_struct in
patch 3
- Removed unnecessary idpf changes from patch 3
- Removed unnecessary kdoc in patch 5 for a parameter removed in an
earlier revision
- Removed unnecessary NULL check in patch 5
- Used tooling to autogenerate code for patch 6, which fixed the type and
range of NETDEV_A_NAPI_DEFER_HARD_IRQ.
rfcv4: https://lore.kernel.org/lkml/20241001235302.57609-1-jdamato@fastly.com/
- Updated commit messages of most patches
- Renamed netif_napi_add_storage to netif_napi_add_config in patch 5
- Added a NULL check in netdev_set_defer_hard_irqs and
netdev_set_gro_flush_timeout for netdev->napi_config in patch 5
- Removed the WARN_ON_ONCE suggested in an earlier revision
in alloc_netdev_mqs from patch 5; it triggers every time on my mlx5
machine at boot and needlessly spams the log
- Added a locking adjustment suggested by Stanislav to patch 6 to
protect napi_id in patch 5
- Removed napi_hash_del from netif_napi_del in patch 5. netif_napi_del
calls __netif_napi_del which itself calls napi_hash_del. The
original code thus resulted in two napi_hash_del calls, which is
incorrect.
- Removed the napi_hash_add from netif_napi_add_weight in patch 5.
NAPIs are added to the hash when napi_enable is called, instead.
- Moved the napi_restore_config to the top of napi_enable in patch 5.
- Simplified the logic in __netif_napi_del and removed napi_hash_del.
NAPIs are removed in napi_disable.
- Fixed merge conflicts in patch 6 so it applies cleanly
rfcv3: https://lore.kernel.org/netdev/20240912100738.16567-8-jdamato@fastly.com/T/
- Renamed napi_storage to napi_config
- Reordered patches
- Added defer_hard_irqs and gro_flush_timeout to napi_struct
- Attempt to save and restore settings on napi_disable/napi_enable
- Removed weight as a parameter to netif_napi_add_storage
- Updated driver patches to no longer pass in weight
rfcv2: https://lore.kernel.org/netdev/20240908160702.56618-1-jdamato@fastly.com/
- Almost total rewrite from v1
Joe Damato (9):
net: napi: Make napi_defer_hard_irqs per-NAPI
netdev-genl: Dump napi_defer_hard_irqs
net: napi: Make gro_flush_timeout per-NAPI
netdev-genl: Dump gro_flush_timeout
net: napi: Add napi_config
netdev-genl: Support setting per-NAPI config values
bnxt: Add support for persistent NAPI config
mlx5: Add support for persistent NAPI config
mlx4: Add support for persistent NAPI config to RX CQs
Documentation/netlink/specs/netdev.yaml | 28 ++++++
.../networking/net_cachelines/net_device.rst | 3 +
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +-
drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
include/linux/netdevice.h | 42 +++++++-
include/uapi/linux/netdev.h | 3 +
net/core/dev.c | 96 ++++++++++++++++---
net/core/dev.h | 88 +++++++++++++++++
net/core/net-sysfs.c | 4 +-
net/core/netdev-genl-gen.c | 18 ++++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 57 +++++++++++
tools/include/uapi/linux/netdev.h | 3 +
14 files changed, 326 insertions(+), 25 deletions(-)
base-commit: d677aebd663ddc287f2b2bda098474694a0ca875
--
2.25.1
Powered by blists - more mailing lists