[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250506112012.5779d652@kernel.org>
Date: Tue, 6 May 2025 11:20:12 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: David Howells <dhowells@...hat.com>
Cc: Andrew Lunn <andrew@...n.ch>, Eric Dumazet <edumazet@...gle.com>, "David
S. Miller" <davem@...emloft.net>, David Hildenbrand <david@...hat.com>,
John Hubbard <jhubbard@...dia.com>, Christoph Hellwig <hch@...radead.org>,
willy@...radead.org, netdev@...r.kernel.org, linux-mm@...ck.org, Willem de
Bruijn <willemb@...gle.com>
Subject: Re: Reorganising how the networking layer handles memory
On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> Jakub Kicinski <kuba@...nel.org> wrote:
> > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > > it doesn't use page pinning. It needs to use the GUP routines.
> >
> > We end up calling iov_iter_get_pages2(). Is it not setting
> > FOLL_PIN is a conscious choice, or nobody cared until now?
>
> iov_iter_get_pages*() predates GUP, I think. There's now an
> iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> different cleanup, which is why I created a new API call.
>
> iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> non-user iterator (e.g. ITER_BVEC).
FWIW it occurred to me after hitting send that we may not care.
We're talking about Tx, so the user pages are read only for the kernel.
I don't think we have the "child gets the read data" problem?
> > > (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> > > used with certain memory types (e.g. slab). It takes a ref on whatever
> > > it is given - which is wrong if it should pin this instead.
> >
> > s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants
> > a ref to the stack, right? But yes, the networking stack will try to
> > release it.
>
> I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
> needs to be changed.
>
> Christoph Hellwig would like to make it such that the extractor gets
> {phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
> is that you've got a span of memory to use as your buffer. How that span of
> memory is managed is the responsibility of whoever called sendmsg() - and they
> need a callback to be able to handle that.
Sure, there may be things to iron out as data in networking is not
opaque. We need to handle the firewalling and inspection cases.
Likely all this will work well for ZC but not sure if we can "convert"
the stack to phyaddr+len.
> > TAL at struct ubuf_info
>
> I've looked at it, yes, however, I'm wondering if we can make it more generic
> and usable by regular file DIO and splice also.
Okay, just keep in mind that we are working on 800Gbps NIC support these
days, and MTU does not grow. So whatever we do - it must be fast fast.
> Further, we need a way to track pages we've pinned. One way to do that is to
> simply rely on the sk_buff fragment array and keep track of which particular
> bits need putting/unpinning/freeing/kfreeing/etc - but really that should be
> handled by the caller unless it costs too much performance (which it might).
>
> Once advantage of delegating it to the caller, though, and having the caller
> keep track of which bits in still needs to hold on to by transmission
> completion position is that we don't need to manage refs/pins across sk_buff
> duplication - let alone what we should do with stuff that's kmalloc'd.
>
> > > (3) We also pass an optional 'refill' function to sendmsg. As data is
> > > sent, the code that extracts the data will call this to pin more user
> > > bufs (we don't necessarily want to pin everything up front). The
> > > refill function is permitted to sleep to allow the amount of pinned
> > > memory to subside.
> >
> > Why not feed the data as you get the notifications for completion?
>
> Because there are multiple factors that govern the size of the chunks in which
> the refilling is done:
>
> (1) We want to get user pages in batches to reduce the cost of the
> synchronisation MM has to do. Further, the individual spans in the
> batches will be of variable size (folios can be of different sizes, for
> example). The idea of the 'refill' is that we go and refill as each
> batch is transcribed into skbuffs.
>
> (2) We don't want to run extraction too far ahead as that will delay the
> onset of transmission.
>
> (3) We don't want to pin too much at any one time as that builds memory
> pressure and in the worst case will cause OOM conditions.
>
> So we need to balance things - particularly (1) and (2) - and accept that we
> may get multiple refils in order to fill the socket transmission buffer.
Hard to comment without concrete workload at hand.
Ideally the interface would be good enough for the application
to dependably drive the transmission in an efficient way.
Powered by blists - more mailing lists