linux-kernel - Re: Device mem changes vs pinning/zerocopy changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1098853.1749051265@warthog.procyon.org.uk>
Date: Wed, 04 Jun 2025 16:34:25 +0100
From: David Howells <dhowells@...hat.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: dhowells@...hat.com, willy@...radead.org, hch@...radead.org,
    Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <edumazet@...gle.com>,
    netdev@...r.kernel.org, linux-mm@...ck.org,
    linux-kernel@...r.kernel.org
Subject: Re: Device mem changes vs pinning/zerocopy changes

Mina Almasry <almasrymina@...gle.com> wrote:

> Hi David! Yes, very happy to collaborate.

:-)

> FWIW, my initial gut feeling is that the work doesn't conflict that much.
> The tcp devmem netmem/net_iov stuff is designed to follow the page stuff,
> and as the usage of struct page changes we're happy moving net_iovs and
> netmems to do the same thing. My read is that it will take a small amount of
> extra work, but there are no in-principle design conflicts, at least AFAICT
> so far.

The problem is more the code you changed in the current merge window I'm also
wanting to change, so merge conflicts will arise.

However, I'm also looking to move the points at which refs are taken/dropped
which will directly inpinge on the design of the code that's currently
upstream.

Would it help if I created some diagrams to show what I'm thinking of?

> I believe the main challenge here is that there are many code paths in
> the net stack that expect to be able to grab a ref on skb frags. See
> all the callers of skb_frag_ref/skb_frag_unref:
> 
> tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone,
> pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header,
> pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment.

Oh, yes, I've come to appreciate that well.  A chunk of that can actually go
away, I think.

> I think to accomplish what you're describing we need to modify
> skb_frag_ref to do something else other than taking a reference on the
> page or net_iov. I think maybe taking a reference on the skb itself
> may be acceptable, and the skb can 'guarantee' that the individual
> frags underneath it don't disappear while these functions are
> executing.

Maybe.  There is an issue with that, though it may not be insurmountable: If a
userspace process does, say, a MSG_ZEROCOPY send of a page worth of data over
TCP, under a typicalish MTU, say, 1500, this will be split across at least
three skbuffs.

This would involve making a call into GUP to get a pin - but we'd need a
separate pin for each skbuff and we might (in fact we currently do) end up
calling into GUP thrice to do the address translation and page pinning.

What I want to do is to put this outside of the skbuff so that GUP pin can be
shared - but if, instead, we attach a pin to each skbuff, we need to get that
extra pin in some way.  Now, it may be reasonable to add a "get me an extra
pin for such-and-such a range" thing and store the {physaddr,len} in the
skbuff fragment, but we also have to be careful not to overrun the pin count -
if there's even a pin count per se.

> But devmem TCP doesn't get much in the way here, AFAICT. It's really
> the fact that so many of the networking code paths want to obtain page
> refs via skb_frag_ref so there is potentially a lot of code to touch.

Yep.

> But, AFAICT, skb_frag_t needs a struct page inside of it, not just a
> physical address. skb_frags can mmap'd into userspace for TCP
> zerocopy, see tcp_zerocopy_vm_insert_batch (which is a very old
> feature, it's not a recent change). There may be other call paths in
> the net stack that require a full page and just a physical address
> will do. (unless somehow we can mmap a physical address to the
> userspace).

Yeah - I think this needs very careful consideration and will need some
adjustment.  Some of the pages that may, in the future, get zerocopied or
spliced into the socket *really* shouldn't be spliced out into some random
process's address space - and, in fact, may not even have a refcount (say they
come from the slab).  Basically, TCP has implemented async vmsplice()...

> Is struct net_txbuf intended to replace struct sk_buff in the tx path
> only? If so, I'm not sure that works.

No - the idea is that it runs a parallel track to it and holds "references" to
the buffer memory.  This is then divided up amongst a number of sk_buffs that
hold refs on the first txbuf that it uses memory from.

txbufs would allow us to take and hold a single ref or pin (or even nothing,
just a destructor) on each piece of supplied buffer memory and for that to be
shared between a sequence of skbufs.

> Currently TX and RX memory share a single data structure (sk_buff), and I
> believe that is critical.

Yep.  And we need to keep sk_buff because an sk_buff is more than just a
memory pinning device - it also retains the metadata for a packet.

> ... So I think, maybe, instead of introducing a new struct, you have to make
> the modifications you envision to struct sk_buff itself?

It may be possible.  But see above.  I want to be able to share pins between
sk_buffs.

> OK, you realize that TX packets can be forwarded to RX. The opposite
> is true, RX can be forwarded to TX. And it's not just AF_UNIX and
> loopback. Packets can be forwarded via ip forwarding, and tc, and
> probably another half dozen features in the net stack I don't know
> about.

Yeah, I'd noticed that.

> I think you need to modify the existing sk_buff. I think adding
> a new struct and migrating the entire net stack to use that is a bit
> too ambitious. But up to you. Just my 2 cents here.