netdev - Re: [RFC 00/12] net: huge page backed page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izOdqajoPbzPHE-_e1BF+xpVgUphwknPSEqcHDJDcYtKng@mail.gmail.com>
Date: Mon, 10 Jul 2023 10:31:53 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, hawk@...nel.org, ilias.apalodimas@...aro.org, 
	edumazet@...gle.com, dsahern@...il.com, michael.chan@...adcom.com, 
	willemb@...gle.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool

On Fri, Jul 7, 2023 at 3:45 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Fri, 7 Jul 2023 12:45:26 -0700 Mina Almasry wrote:
> > > This is an "early PoC" at best. It seems to work for a basic
> > > traffic test but there's no uAPI and a lot more general polish
> > > is needed.
> > >
> > > The problem we're seeing is that performance of some older NICs
> > > degrades quite a bit when IOMMU is used (in non-passthru mode).
> > > There is a long tail of old NICs deployed, especially in PoPs/
> > > /on edge. From a conversation I had with Eric a few months
> > > ago it sounded like others may have similar issues. So I thought
> > > I'd take a swing at getting page pool to feed drivers huge pages.
> > > 1G pages require hooking into early init via CMA but it works
> > > just fine.
> > >
> > > I haven't tested this with a real workload, because I'm still
> > > waiting to get my hands on the right machine. But the experiment
> > > with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> >
> > Thanks for CCing me Jakub. I'm working on a proposal for device memory
> > TCP, and I recently migrated it to be on top of your pp-provider idea
> > and I think I can share my test results as well. I had my code working
> > on top of your slightly older API I found here a few days ago:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > On top of the old API I had something with all my functionality tests
> > passing and performance benchmarking hitting ~96.5% line rate (with
> > all data going straight to the device - GPU - memory, which is the
> > point of the proposal). Of course, when you look at the code you may
> > not like the approach and I may need to try something else, which is
> > perfectly fine, but my current implementation is pp-provider based.
> >
> > I'll look into rebasing my changes on top of this RFC and retesting,
> > but I should share my RFC either way sometime next week maybe. I took
> > a quick look at the changes you made here, and I don't think you
> > changed anything that would break my use case.
>
> Oh, sorry I didn't realize you were working on top of my changes
> already. Yes, the memory provider API should not have changed much.
> I mostly reshuffled the MEP code to have both a coherent and
> non-coherent buddy allocator since then.
>

No worries at all. I don't mind rebasing to new versions (and finding
out if they work for me).

> > > In terms of the missing parts - uAPI is definitely needed.
> > > The rough plan would be to add memory config via the netdev
> > > genl family. Should fit nicely there. Have the config stored
> > > in struct netdevice. When page pool is created get to the netdev
> > > and automatically select the provider without the driver even
> > > knowing.
> >
> > I guess I misunderstood the intent behind the original patches. I
> > thought you wanted the user to tell the driver what memory provider to
> > use, and the driver to recreate the page pool with that provider. What
> > you're saying here sounds much better, and less changes to the driver.
> >
> > >  Two problems with that are - 1) if the driver follows
> > > the recommended flow of allocating new queues before freeing
> > > old ones we will have page pools created before the old ones
> > > are gone, which means we'd need to reserve 2x the number of
> > > 1G pages; 2) there's no callback to the driver to say "I did
> > > something behind your back, don't worry about it, but recreate
> > > your queues, please" so the change will not take effect until
> > > some unrelated change like installing XDP. Which may be fine
> > > in practice but is a bit odd.
> >
> > I have the same problem with device memory TCP. I solved it in a
> > similar way, doing something else in the driver that triggers
> > gve_close() & gve_open(). I wonder if the cleanest way to do this is
> > calling ethtool_ops->reset() or something like that? That was my idea
> > at least. I haven't tested it, but from reading the code it should do
> > a gve_close() + gve_open() like I want.
>
> The prevailing wisdom so far was that close() + open() is not a good
> idea. Some NICs will require large contiguous allocations for rings
> and context memory and there's no guarantee that open() will succeed
> in prod when memory is fragmented. So you may end up with a close()d
> NIC and a failure to open(), and the machine dropping off the net.
>
> But if we don't close() before we open() and the memory provider is
> single consumer we'll have problem #1 :(
>
> BTW are you planning to use individual queues in prod? I anticipated
> that for ZC we'll need to tie multiple queues into an RSS context,
> and then configure at the level of an RSS context.

Our configuration:

- We designate a number of RX queues as devmem TCP queues.
- We designate the rest as regular TCP queues.
- We use RSS to steer all incoming traffic to the regular TCP queues.
- We use flow steering to steer specific TCP flows to the devmem TCP queues.

-- 
Thanks,
Mina