lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDnTsvbyKCTkZbOR@mini-arch>
Date: Fri, 30 May 2025 08:50:10 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: David Howells <dhowells@...hat.com>
Cc: Mina Almasry <almasrymina@...gle.com>, willy@...radead.org,
	hch@...radead.org, Jakub Kicinski <kuba@...nel.org>,
	Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: Device mem changes vs pinning/zerocopy changes

On 05/30, David Howells wrote:
> Hi Mina,
> 
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do.  I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
> 
> So, to summarise what we need to change (you may already know all of this):
> 
>  (*) The refcount in struct page is going to go away.  The sk_buff fragment
>      wrangling code, however, occasionally decides to override the zerocopy
>      mode and grab refs on the pages pointed to by those fragments.  sk_buffs
>      *really* want those page refs - and it does simplify memory handling.
>      But.
> 
>      Anyway, we need to stop taking refs where possible.  A fragment may in
>      future point to a sequence of pages and we would only be getting a ref on
>      one of them.
> 
>  (*) Further, the page struct is intended to be slimmed down to a single typed
>      pointer if possible, so all the metadata in the net_iov struct will have
>      to be separately allocated.
> 
>  (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
>      pages specified by the iterator but we need to stop doing that.  We need
>      to call GUP to take a "pin" instead (and must not take any refs).  The
>      pages we get access to may be folio-type, anon-type, some sort of device
>      type.
> 
>  (*) It would be good to do a batch lookup of user buffers to cut down on the
>      number of page table trawls we do - but, on the other hand, that might
>      generate more page faults upfront.
> 
>  (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
>      us buffers from a pipe - but the buffers come with destructors and should
>      not have refs taken on the pages we might think they have, but use the
>      destructor instead.
> 
>  (*) The intention is to change struct bio_vec to be just physical address and
>      length, with no page pointer.  You'd then use, say, kmap_local_phys() or
>      kmap_local_bvec() to access the contents from the cpu.  We could then
>      revert the fragment pointers to being bio_vecs.
> 
>  (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
>      to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
>      any case, the object lifetime is not managed by refcount.  However, if we
>      had a destructor, this restriction could go away.
> 
> 
> So what I'd like to do is:

[..]

>  (1) Separate fragment lifetime management from sk_buff.  No more wangling of
>      refcounts in the skbuff code.  If you clone an skb, you stick an extra
>      ref on the lifetime management struct, not the page.

For device memory TCP we already have this: net_devmem_dmabuf_binding
is the owner of the frags. And when we reference skb frag we reference
only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
net_devmem_get_net_iov (ref on the binding).

Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
cases? From what I can tell, this is somewhat equivalent of your net_txbuf.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ