[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 7 Jul 2023 11:39:23 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: netdev@...r.kernel.org
Cc: almasrymina@...gle.com,
hawk@...nel.org,
ilias.apalodimas@...aro.org,
edumazet@...gle.com,
dsahern@...il.com,
michael.chan@...adcom.com,
willemb@...gle.com,
Jakub Kicinski <kuba@...nel.org>
Subject: [RFC 00/12] net: huge page backed page_pool
Hi!
This is an "early PoC" at best. It seems to work for a basic
traffic test but there's no uAPI and a lot more general polish
is needed.
The problem we're seeing is that performance of some older NICs
degrades quite a bit when IOMMU is used (in non-passthru mode).
There is a long tail of old NICs deployed, especially in PoPs/
/on edge. From a conversation I had with Eric a few months
ago it sounded like others may have similar issues. So I thought
I'd take a swing at getting page pool to feed drivers huge pages.
1G pages require hooking into early init via CMA but it works
just fine.
I haven't tested this with a real workload, because I'm still
waiting to get my hands on the right machine. But the experiment
with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
In terms of the missing parts - uAPI is definitely needed.
The rough plan would be to add memory config via the netdev
genl family. Should fit nicely there. Have the config stored
in struct netdevice. When page pool is created get to the netdev
and automatically select the provider without the driver even
knowing. Two problems with that are - 1) if the driver follows
the recommended flow of allocating new queues before freeing
old ones we will have page pools created before the old ones
are gone, which means we'd need to reserve 2x the number of
1G pages; 2) there's no callback to the driver to say "I did
something behind your back, don't worry about it, but recreate
your queues, please" so the change will not take effect until
some unrelated change like installing XDP. Which may be fine
in practice but is a bit odd.
Then we get into hand-wavy stuff like - if we can link page
pools to netdevs, we should also be able to export the page pool
stats via the netdev family instead doing it the ethtool -S.. ekhm..
"way". And if we start storing configs behind driver's back why
don't we also store other params, like ring size and queue count...
A lot of potential improvements as we iron out a new API...
Live tree: https://github.com/kuba-moo/linux/tree/pp-providers
Jakub Kicinski (12):
net: hack together some page sharing
net: create a 1G-huge-page-backed allocator
net: page_pool: hide page_pool_release_page()
net: page_pool: merge page_pool_release_page() with
page_pool_return_page()
net: page_pool: factor out releasing DMA from releasing the page
net: page_pool: create hooks for custom page providers
net: page_pool: add huge page backed memory providers
eth: bnxt: let the page pool manage the DMA mapping
eth: bnxt: use the page pool for data pages
eth: bnxt: make sure we make for recycle skbs before freeing them
eth: bnxt: wrap coherent allocations into helpers
eth: bnxt: hack in the use of MEP
Documentation/networking/page_pool.rst | 10 +-
arch/x86/kernel/setup.c | 6 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 154 +++--
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 5 +
drivers/net/ethernet/engleder/tsnep_main.c | 2 +-
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +-
include/net/dcalloc.h | 28 +
include/net/page_pool.h | 36 +-
net/core/Makefile | 2 +-
net/core/dcalloc.c | 615 +++++++++++++++++
net/core/dcalloc.h | 96 +++
net/core/page_pool.c | 625 +++++++++++++++++-
12 files changed, 1478 insertions(+), 105 deletions(-)
create mode 100644 include/net/dcalloc.h
create mode 100644 net/core/dcalloc.c
create mode 100644 net/core/dcalloc.h
--
2.41.0
Powered by blists - more mailing lists