[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1352674.1746625556@warthog.procyon.org.uk>
Date: Wed, 07 May 2025 14:45:56 +0100
From: David Howells <dhowells@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: dhowells@...hat.com, Andrew Lunn <andrew@...n.ch>,
Eric Dumazet <edumazet@...gle.com>,
"David
S. Miller" <davem@...emloft.net>,
David Hildenbrand <david@...hat.com>,
John Hubbard <jhubbard@...dia.com>,
Christoph Hellwig <hch@...radead.org>, willy@...radead.org,
netdev@...r.kernel.org, linux-mm@...ck.org,
Willem de Bruijn <willemb@...gle.com>
Subject: Re: Reorganising how the networking layer handles memory
Jakub Kicinski <kuba@...nel.org> wrote:
> On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote:
> > Jakub Kicinski <kuba@...nel.org> wrote:
> > > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> > > > it doesn't use page pinning. It needs to use the GUP routines.
> > >
> > > We end up calling iov_iter_get_pages2(). Is it not setting
> > > FOLL_PIN is a conscious choice, or nobody cared until now?
> >
> > iov_iter_get_pages*() predates GUP, I think. There's now an
> > iov_iter_extract_pages() that does the pinning stuff, but you have to do a
> > different cleanup, which is why I created a new API call.
> >
> > iov_iter_extract_pages() also does no pinning at all on pages extracted from a
> > non-user iterator (e.g. ITER_BVEC).
>
> FWIW it occurred to me after hitting send that we may not care.
> We're talking about Tx, so the user pages are read only for the kernel.
> I don't think we have the "child gets the read data" problem?
Worse: if the child alters the data in the buffer to be transmitted after the
fork() (say it calls free() and malloc()), it can do so; if the parent tries
that, there will be no effect.
> Likely all this will work well for ZC but not sure if we can "convert"
> the stack to phyaddr+len.
Me neither. We also use bio_vec[] to hold lists of memory and then trawl them
to do cleanup, but a conversion to holding {phys,len} will mandate being able
to do some sort of reverse lookup.
> Okay, just keep in mind that we are working on 800Gbps NIC support these
> days, and MTU does not grow. So whatever we do - it must be fast fast.
Crazy:-)
One thing I've noticed in the uring stuff is that it doesn't seem to like the
idea of having an sk_buff pointing to more than one ubuf_info, presumably
because the sk_buff will point to the ubuf_info holding the zerocopyable data.
Is that actually necessary for SOCK_STREAM, though?
My thought for SOCK_STREAM is to have an ordered list of zerocopy source
records on the socket and a completion counter and not tag the skbuffs at all.
That way, an skbuff can carry data for multiple zerocopy send requests.
David
Powered by blists - more mailing lists