[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDnTsvbyKCTkZbOR@mini-arch>
Date: Fri, 30 May 2025 08:50:10 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: David Howells <dhowells@...hat.com>
Cc: Mina Almasry <almasrymina@...gle.com>, willy@...radead.org,
hch@...radead.org, Jakub Kicinski <kuba@...nel.org>,
Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: Device mem changes vs pinning/zerocopy changes
On 05/30, David Howells wrote:
> Hi Mina,
>
> I've seen your transmission-side TCP devicemem stuff has just gone in and it
> conflicts somewhat with what I'm trying to do. I think you're working on the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
>
> So, to summarise what we need to change (you may already know all of this):
>
> (*) The refcount in struct page is going to go away. The sk_buff fragment
> wrangling code, however, occasionally decides to override the zerocopy
> mode and grab refs on the pages pointed to by those fragments. sk_buffs
> *really* want those page refs - and it does simplify memory handling.
> But.
>
> Anyway, we need to stop taking refs where possible. A fragment may in
> future point to a sequence of pages and we would only be getting a ref on
> one of them.
>
> (*) Further, the page struct is intended to be slimmed down to a single typed
> pointer if possible, so all the metadata in the net_iov struct will have
> to be separately allocated.
>
> (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
> pages specified by the iterator but we need to stop doing that. We need
> to call GUP to take a "pin" instead (and must not take any refs). The
> pages we get access to may be folio-type, anon-type, some sort of device
> type.
>
> (*) It would be good to do a batch lookup of user buffers to cut down on the
> number of page table trawls we do - but, on the other hand, that might
> generate more page faults upfront.
>
> (*) Splice and vmsplice. If only I could uninvent them... Anyway, they give
> us buffers from a pipe - but the buffers come with destructors and should
> not have refs taken on the pages we might think they have, but use the
> destructor instead.
>
> (*) The intention is to change struct bio_vec to be just physical address and
> length, with no page pointer. You'd then use, say, kmap_local_phys() or
> kmap_local_bvec() to access the contents from the cpu. We could then
> revert the fragment pointers to being bio_vecs.
>
> (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
> to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
> any case, the object lifetime is not managed by refcount. However, if we
> had a destructor, this restriction could go away.
>
>
> So what I'd like to do is:
[..]
> (1) Separate fragment lifetime management from sk_buff. No more wangling of
> refcounts in the skbuff code. If you clone an skb, you stick an extra
> ref on the lifetime management struct, not the page.
For device memory TCP we already have this: net_devmem_dmabuf_binding
is the owner of the frags. And when we reference skb frag we reference
only this owner, not individual chunks: __skb_frag_ref -> get_netmem ->
net_devmem_get_net_iov (ref on the binding).
Will it be possible to generalize this to cover MSG_ZEROCOPY and splice
cases? From what I can tell, this is somewhat equivalent of your net_txbuf.
Powered by blists - more mailing lists