lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1216273.1746539449@warthog.procyon.org.uk>
Date: Tue, 06 May 2025 14:50:49 +0100
From: David Howells <dhowells@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: dhowells@...hat.com, Andrew Lunn <andrew@...n.ch>,
    Eric Dumazet <edumazet@...gle.com>,
    "David
 S. Miller" <davem@...emloft.net>,
    David Hildenbrand <david@...hat.com>,
    John Hubbard <jhubbard@...dia.com>,
    Christoph Hellwig <hch@...radead.org>, willy@...radead.org,
    netdev@...r.kernel.org, linux-mm@...ck.org,
    Willem de Bruijn <willemb@...gle.com>
Subject: Re: Reorganising how the networking layer handles memory

Jakub Kicinski <kuba@...nel.org> wrote:

> > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because
> >      it doesn't use page pinning.  It needs to use the GUP routines.
> 
> We end up calling iov_iter_get_pages2(). Is it not setting
> FOLL_PIN is a conscious choice, or nobody cared until now?

iov_iter_get_pages*() predates GUP, I think.  There's now an
iov_iter_extract_pages() that does the pinning stuff, but you have to do a
different cleanup, which is why I created a new API call.

iov_iter_extract_pages() also does no pinning at all on pages extracted from a
non-user iterator (e.g. ITER_BVEC).

> 
> >  (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be
> >      used with certain memory types (e.g. slab).  It takes a ref on whatever
> >      it is given - which is wrong if it should pin this instead.
> 
> s/takes a ref/requires a ref/ ? I mean - the caller implicitly grants 
> a ref  to the stack, right? But yes, the networking stack will try to
> release it.

I mean 'takes' as in skb_append_pagefrags() calls get_page() - something that
needs to be changed.

Christoph Hellwig would like to make it such that the extractor gets
{phyaddr,len} rather than {page,off,len} - so all you, the network layer, see
is that you've got a span of memory to use as your buffer.  How that span of
memory is managed is the responsibility of whoever called sendmsg() - and they
need a callback to be able to handle that.

> TAL at struct ubuf_info

I've looked at it, yes, however, I'm wondering if we can make it more generic
and usable by regular file DIO and splice also.

Further, we need a way to track pages we've pinned.  One way to do that is to
simply rely on the sk_buff fragment array and keep track of which particular
bits need putting/unpinning/freeing/kfreeing/etc - but really that should be
handled by the caller unless it costs too much performance (which it might).

Once advantage of delegating it to the caller, though, and having the caller
keep track of which bits in still needs to hold on to by transmission
completion position is that we don't need to manage refs/pins across sk_buff
duplication - let alone what we should do with stuff that's kmalloc'd.

> >  (3) We also pass an optional 'refill' function to sendmsg.  As data is
> >      sent, the code that extracts the data will call this to pin more user
> >      bufs (we don't necessarily want to pin everything up front).  The
> >      refill function is permitted to sleep to allow the amount of pinned
> >      memory to subside.
> 
> Why not feed the data as you get the notifications for completion?

Because there are multiple factors that govern the size of the chunks in which
the refilling is done:

 (1) We want to get user pages in batches to reduce the cost of the
     synchronisation MM has to do.  Further, the individual spans in the
     batches will be of variable size (folios can be of different sizes, for
     example).  The idea of the 'refill' is that we go and refill as each
     batch is transcribed into skbuffs.

 (2) We don't want to run extraction too far ahead as that will delay the
     onset of transmission.

 (3) We don't want to pin too much at any one time as that builds memory
     pressure and in the worst case will cause OOM conditions.

So we need to balance things - particularly (1) and (2) - and accept that we
may get multiple refils in order to fill the socket transmission buffer.

> >  (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by
> >      the cleanup function.
> 
> Already the case? :)

This is more a note-to-self, but in what I'm thinking of doing would have the
sendmsg() handler inserting SO_EE_ORIGIN_ZEROCOPY into the socket receive
queue.

David


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ