netdev - Re: GSO with udp_tunnel_xmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHmME9qYAvewm6m-ex0Ds051mgDPtpmYniHzZtgM+B6seo8Cng@mail.gmail.com>
Date:	Sun, 8 Nov 2015 11:36:53 +0100
From:	"Jason A. Donenfeld" <Jason@...c4.com>
To:	Maciej Żenczykowski <zenczykowski@...il.com>,
	Herbert Xu <herbert@...dor.apana.org.au>
Cc:	Tom Herbert <tom@...bertland.com>, Jiri Benc <jbenc@...hat.com>,
	Netdev <netdev@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: GSO with udp_tunnel_xmit_skb

Hi Maciej,

On Sun, Nov 8, 2015 at 12:40 AM, Maciej Żenczykowski
<zenczykowski@...il.com> wrote:
> This isn't particularly efficient.  This is basically equivalent to doing
> GSO before the superpacket reaches your driver (you might get some
> savings by not bothering to look at the packet headers of the second
> and on packets, but that's most likely minimal savings).

Actually, in my benchmarking, this results in enormous speedups in two places:

- In fact, I do have to examine the header of each incoming packet in
ndo_start_xmit(), and make a potentially expensive calculation on it
(due to the nature of my particular virtual driver). Having to only do
this once gets me about 100 additional megabits of bandwidth.
- Before sending the packet with udp_tunnel_xmit_skb, I can do only
one ip_route_output_flow() call, and reuse the rtable/dst_entry struct
for each send, instead of having to recompute it each time. This winds
up getting me around 400 more megabits.

> In particular you're allocating a new skb and clearing it for each of those
> 1300 byte packets (and deallocating the superpacket skb).  And then you
> are presumably deallocating all those freshly allocated skbs - since
> I'm guessing
> you are creating new skbs for transmit.
>
> What you really want to do (although of course it's much harder)
> is not call skb_gso_segment() at all for packet formats you know how
> to handle (ideally you can handle anything you claim to be able to
> handle via the features bits)
> and instead reach directly into the skb and grab the right portions
> of it and handle them directly.  This way you only ever have the one
> incoming skb,
> but yes it requires considerable effort.
>
> This should get you a fair bit of savings.

Yes, I agree wholeheartedly; it would be much nicer to not have to
call skb_gso_segment at all, and just being able to operate on the
superpacket directly. Unfortunately, I'm not able to do this, because
I'm not simply adding or changing a header on the packet. I'm actually
making a calculation on the full bytes of the packet, which includes
the UDP and IP headers that are only added by skb_gso_segment, and
then I'm playing with ("scrambling" in some way) all of the bytes of
the entire packet. So, I really do need to decompose it into
individual packets, unfortunately.

>
> Are you in control of the receiver?  Can you modify packet format?

Yes, I am in control of the receiver. I suppose I could augment the
protocol to do this kind of reassembly. But that might conflict with
some other design goals, so I don't think that's going to happen.

>
> Theoretically you could manually add the proper headers to each
> of the new packets, and create a chain and send that, although
> honestly I'm not sure if the stack is at all capable of dealing with
> that atm.
>
> Alternatively instead of sending through the stack, put on full ethernet
> headers and send straight to the nic via the nic's xmit function.

My initial prototype did that, actually, simply because I knew how to
build an ethernet frame but I didn't know (yet) how to use the
kernel's various APIs. This wasn't viable in the end, though, because
I do need to run the packets through netfilter and the full stack.

> UFO = UDP Fragmentation Offload = really meaning 'UDP transmit
> checksum offload + IP fragmentation offload'
>
> so when you send that out you get ip fragments of 1 udp packet, not
> many individual udp packets.

Shucks, really? So UFO really only works for single UDP packets?
That's a shame. I had hoped that since all the packets are the same
size, I could set gso_size to that, and then the splitting would take
place on those boundaries precisely. But I guess since the
fragmentation here actually does IP fragmentation, this would run
counter to my goals, since new UDP headers wouldn't be added in the
end. Total bummer.

Wouldn't there be some significant savings from bundling together
several UDP packets meant for the same destination, and sending those
all as one super-packet, so they don't each have to traverse the whole
networking and netfilter stack? By asking that question, it doesn't
feel as though I've come up with a new idea; is there a reason why
that isn't implemented or why (if) it was rejected?

>
> It is possible some hardware (possibly some intel nics, maybe bnx2x)
> could be tricked into doing udp segmentation with their tcp segmentation
> engine.  Theoretically (based on having glanced at the datasheets) the
> intel nic segmentation is pretty generic, and it would appear at first
> glance that with the right driver hacks (populating the transmit descriptor
> correctly) it could be made to work.  I mention bnx2x because
> they managed to make tcp segmentation work with tunnels,
> so it's possible that the support is generic enough for it to be possible (with
> driver changes).  Who knows.
>
> It may or may not require putting on a fake 20 byte TCP header.
> There's some tunnel spec that basically does that (should be able to find
> an RFC online [perhaps I'm thinking of STT - Stateless Transport Tunneling].
>
> I don't think there is currently any way to setup a linux skb with the
> right metadata for it to just happen though.
>
> It does seem like something that could be potentially worth adding though.

These are glorious dirty tricks. Awesome. It appears, NIC-wise, that
only the neterion driver currently supports UFO natively. I wonder if
the Intel folks will add it to their drivers, since the segmentation
is generic as you said.

Still, though, regardless of NIC support, using superpackets to reduce
the number of skbs that have to traverse the networking stack appears
to be worthwhile. It'd be nice to make this happen for clusters of UDP
packets.


I'm adding to the CC Herbert Xu, who mentioned in another thread:

> I don't see anything fundamentally wrong with your idea.  After
> all what you're describing is the basis of GSO, i.e., letting
> data stay in the form of super-packets for as long as we can.
>
> Of course there's going to be a lot of niggly bits that you'll
> have to sort out to get it to work.

So I wonder if he has any ideas about this too.


Anyway, thanks so much for your insight about this. I really
appreciate the pointers.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html