lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 7 Jul 2023 12:45:26 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, hawk@...nel.org, ilias.apalodimas@...aro.org, 
	edumazet@...gle.com, dsahern@...il.com, michael.chan@...adcom.com, 
	willemb@...gle.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool

On Fri, Jul 7, 2023 at 11:39 AM Jakub Kicinski <kuba@...nel.org> wrote:
>
> Hi!
>
> This is an "early PoC" at best. It seems to work for a basic
> traffic test but there's no uAPI and a lot more general polish
> is needed.
>
> The problem we're seeing is that performance of some older NICs
> degrades quite a bit when IOMMU is used (in non-passthru mode).
> There is a long tail of old NICs deployed, especially in PoPs/
> /on edge. From a conversation I had with Eric a few months
> ago it sounded like others may have similar issues. So I thought
> I'd take a swing at getting page pool to feed drivers huge pages.
> 1G pages require hooking into early init via CMA but it works
> just fine.
>
> I haven't tested this with a real workload, because I'm still
> waiting to get my hands on the right machine. But the experiment
> with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
>

Thanks for CCing me Jakub. I'm working on a proposal for device memory
TCP, and I recently migrated it to be on top of your pp-provider idea
and I think I can share my test results as well. I had my code working
on top of your slightly older API I found here a few days ago:
https://github.com/kuba-moo/linux/tree/pp-providers

On top of the old API I had something with all my functionality tests
passing and performance benchmarking hitting ~96.5% line rate (with
all data going straight to the device - GPU - memory, which is the
point of the proposal). Of course, when you look at the code you may
not like the approach and I may need to try something else, which is
perfectly fine, but my current implementation is pp-provider based.

I'll look into rebasing my changes on top of this RFC and retesting,
but I should share my RFC either way sometime next week maybe. I took
a quick look at the changes you made here, and I don't think you
changed anything that would break my use case.

> In terms of the missing parts - uAPI is definitely needed.
> The rough plan would be to add memory config via the netdev
> genl family. Should fit nicely there. Have the config stored
> in struct netdevice. When page pool is created get to the netdev
> and automatically select the provider without the driver even
> knowing.

I guess I misunderstood the intent behind the original patches. I
thought you wanted the user to tell the driver what memory provider to
use, and the driver to recreate the page pool with that provider. What
you're saying here sounds much better, and less changes to the driver.

>  Two problems with that are - 1) if the driver follows
> the recommended flow of allocating new queues before freeing
> old ones we will have page pools created before the old ones
> are gone, which means we'd need to reserve 2x the number of
> 1G pages; 2) there's no callback to the driver to say "I did
> something behind your back, don't worry about it, but recreate
> your queues, please" so the change will not take effect until
> some unrelated change like installing XDP. Which may be fine
> in practice but is a bit odd.
>

I have the same problem with device memory TCP. I solved it in a
similar way, doing something else in the driver that triggers
gve_close() & gve_open(). I wonder if the cleanest way to do this is
calling ethtool_ops->reset() or something like that? That was my idea
at least. I haven't tested it, but from reading the code it should do
a gve_close() + gve_open() like I want.

> Then we get into hand-wavy stuff like - if we can link page
> pools to netdevs, we should also be able to export the page pool
> stats via the netdev family instead doing it the ethtool -S.. ekhm..
> "way". And if we start storing configs behind driver's back why
> don't we also store other params, like ring size and queue count...
> A lot of potential improvements as we iron out a new API...
>
> Live tree: https://github.com/kuba-moo/linux/tree/pp-providers
>
> Jakub Kicinski (12):
>   net: hack together some page sharing
>   net: create a 1G-huge-page-backed allocator
>   net: page_pool: hide page_pool_release_page()
>   net: page_pool: merge page_pool_release_page() with
>     page_pool_return_page()
>   net: page_pool: factor out releasing DMA from releasing the page
>   net: page_pool: create hooks for custom page providers
>   net: page_pool: add huge page backed memory providers
>   eth: bnxt: let the page pool manage the DMA mapping
>   eth: bnxt: use the page pool for data pages
>   eth: bnxt: make sure we make for recycle skbs before freeing them
>   eth: bnxt: wrap coherent allocations into helpers
>   eth: bnxt: hack in the use of MEP
>
>  Documentation/networking/page_pool.rst        |  10 +-
>  arch/x86/kernel/setup.c                       |   6 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 154 +++--
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +
>  drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
>  .../net/ethernet/stmicro/stmmac/stmmac_main.c |   4 +-
>  include/net/dcalloc.h                         |  28 +
>  include/net/page_pool.h                       |  36 +-
>  net/core/Makefile                             |   2 +-
>  net/core/dcalloc.c                            | 615 +++++++++++++++++
>  net/core/dcalloc.h                            |  96 +++
>  net/core/page_pool.c                          | 625 +++++++++++++++++-
>  12 files changed, 1478 insertions(+), 105 deletions(-)
>  create mode 100644 include/net/dcalloc.h
>  create mode 100644 net/core/dcalloc.c
>  create mode 100644 net/core/dcalloc.h
>
> --
> 2.41.0
>


-- 
Thanks,
Mina

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ