netdev - Re: [RFC 00/12] net: huge page backed page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230707154503.57cc834e@kernel.org>
Date: Fri, 7 Jul 2023 15:45:03 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Mina Almasry <almasrymina@...gle.com>
Cc: netdev@...r.kernel.org, hawk@...nel.org, ilias.apalodimas@...aro.org,
 edumazet@...gle.com, dsahern@...il.com, michael.chan@...adcom.com,
 willemb@...gle.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool

On Fri, 7 Jul 2023 12:45:26 -0700 Mina Almasry wrote:
> > This is an "early PoC" at best. It seems to work for a basic
> > traffic test but there's no uAPI and a lot more general polish
> > is needed.
> >
> > The problem we're seeing is that performance of some older NICs
> > degrades quite a bit when IOMMU is used (in non-passthru mode).
> > There is a long tail of old NICs deployed, especially in PoPs/
> > /on edge. From a conversation I had with Eric a few months
> > ago it sounded like others may have similar issues. So I thought
> > I'd take a swing at getting page pool to feed drivers huge pages.
> > 1G pages require hooking into early init via CMA but it works
> > just fine.
> >
> > I haven't tested this with a real workload, because I'm still
> > waiting to get my hands on the right machine. But the experiment
> > with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> 
> Thanks for CCing me Jakub. I'm working on a proposal for device memory
> TCP, and I recently migrated it to be on top of your pp-provider idea
> and I think I can share my test results as well. I had my code working
> on top of your slightly older API I found here a few days ago:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> On top of the old API I had something with all my functionality tests
> passing and performance benchmarking hitting ~96.5% line rate (with
> all data going straight to the device - GPU - memory, which is the
> point of the proposal). Of course, when you look at the code you may
> not like the approach and I may need to try something else, which is
> perfectly fine, but my current implementation is pp-provider based.
> 
> I'll look into rebasing my changes on top of this RFC and retesting,
> but I should share my RFC either way sometime next week maybe. I took
> a quick look at the changes you made here, and I don't think you
> changed anything that would break my use case.

Oh, sorry I didn't realize you were working on top of my changes
already. Yes, the memory provider API should not have changed much.
I mostly reshuffled the MEP code to have both a coherent and
non-coherent buddy allocator since then.

> > In terms of the missing parts - uAPI is definitely needed.
> > The rough plan would be to add memory config via the netdev
> > genl family. Should fit nicely there. Have the config stored
> > in struct netdevice. When page pool is created get to the netdev
> > and automatically select the provider without the driver even
> > knowing.  
> 
> I guess I misunderstood the intent behind the original patches. I
> thought you wanted the user to tell the driver what memory provider to
> use, and the driver to recreate the page pool with that provider. What
> you're saying here sounds much better, and less changes to the driver.
> 
> >  Two problems with that are - 1) if the driver follows
> > the recommended flow of allocating new queues before freeing
> > old ones we will have page pools created before the old ones
> > are gone, which means we'd need to reserve 2x the number of
> > 1G pages; 2) there's no callback to the driver to say "I did
> > something behind your back, don't worry about it, but recreate
> > your queues, please" so the change will not take effect until
> > some unrelated change like installing XDP. Which may be fine
> > in practice but is a bit odd.
> 
> I have the same problem with device memory TCP. I solved it in a
> similar way, doing something else in the driver that triggers
> gve_close() & gve_open(). I wonder if the cleanest way to do this is
> calling ethtool_ops->reset() or something like that? That was my idea
> at least. I haven't tested it, but from reading the code it should do
> a gve_close() + gve_open() like I want.

The prevailing wisdom so far was that close() + open() is not a good
idea. Some NICs will require large contiguous allocations for rings 
and context memory and there's no guarantee that open() will succeed
in prod when memory is fragmented. So you may end up with a close()d
NIC and a failure to open(), and the machine dropping off the net.

But if we don't close() before we open() and the memory provider is
single consumer we'll have problem #1 :(

BTW are you planning to use individual queues in prod? I anticipated
that for ZC we'll need to tie multiple queues into an RSS context, 
and then configure at the level of an RSS context.