netdev - Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJaqyWdZTfjeDgUj1Rindufvq=XYMEdQP8gfGZ3i0a4khKAWxA@mail.gmail.com>
Date:   Fri, 9 Jul 2021 17:04:26 +0200
From:   Eugenio Perez Martin <eperezma@...hat.com>
To:     David Woodhouse <dwmw2@...radead.org>
Cc:     Jason Wang <jasowang@...hat.com>, netdev@...r.kernel.org,
        Willem de Bruijn <willemb@...gle.com>,
        "Michael S.Tsirkin" <mst@...hat.com>
Subject: Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let
 tun/tap do it themselves

On Fri, Jul 2, 2021 at 10:08 AM David Woodhouse <dwmw2@...radead.org> wrote:
>
> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
> > 在 2021/7/2 上午1:39, David Woodhouse 写道:
> > >
> > > Right, but the VMM (or the guest, if we're letting the guest choose)
> > > wouldn't have to use it for those cases.
> >
> >
> > I'm not sure I get here. If so, simply write to TUN directly would work.
>
> As noted, that works nicely for me in OpenConnect; I just write it to
> the tun device *instead* of putting it in the vring. My TX latency is
> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
> vhost thread, wakes guest).
>

Maybe we can do a small test to see the effect of warming up the userland?
* Make vhost to write irqfd BEFORE add the packet to the ring, not after.
* Make userland (I think your selftest would be fine for this) to spin
reading used idx until it sees at least one buffer.

I think this introduces races in the general virtio ring management
but should work well for the testing. Any thoughts?

We could also check what happens in case of burning the userland CPU
checking for used_idx and disable notifications, and see if it is
worth keeping shaving latency in that direction :).

> But it's not clear to me that a VMM could use it. Because the guest has
> already put that packet *into* the vring. Now if the VMM is in the path
> of all wakeups for that vring, I suppose we *might* be able to contrive
> some hackish way to be 'sure' that the kernel isn't servicing it, so we
> could try to 'steal' that packet from the ring in order to send it
> directly... but no. That's awful :)
>
> I do think it'd be interesting to look at a way to reduce the latency
> of the vring wakeup especially for that case of a virtio-net guest with
> a single small packet to send. But realistically speaking, I'm unlikely
> to get to it any time soon except for showing the numbers with the
> userspace equivalent and observing that there's probably a similar win
> to be had for guests too.
>
> In the short term, I should focus on what we want to do to finish off
> my existing fixes. Did we have a consensus on whether to bother
> supporting PI? As I said, I'm mildly inclined to do so just because it
> mostly comes out in the wash as we fix everything else, and making it
> gracefully *refuse* that setup reliably is just as hard.
>
> I think I'll try to make the vhost-net code much more resilient to
> finding that tun_recvmsg() returns a header other than the sock_hlen it
> expects, and see how much still actually needs "fixing" if we can do
> that.
>
>
> > I think the design is to delay the tx checksum as much as possible:
> >
> > 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
> > 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
> >
> > E.g  if the checksum is supported in all those path, we don't need any
> > software checksum at all in the above path. And if any part is not
> > capable of doing checksum, the checksum will be done by networking core
> > before calling the hard_start_xmit of that device.
>
> Right, but in *any* case where the 'device' is going to memcpy the data
> around (like tun_put_user() does), it's a waste of time having the
> networking core do a *separate* pass over the data just to checksum it.
>
> > > > > We could similarly do a partial checksum in tun_get_user() and hand it
> > > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
> > > >
> > > > I think that's is how it is expected to work (via vnet header), see
> > > > virtio_net_hdr_to_skb().
> > >
> > > But only if the "guest" supports it; it doesn't get handled by the tun
> > > device. It *could*, since we already have the helpers to checksum *as*
> > > we copy to/from userspace.
> > >
> > > It doesn't help for me to advertise that I support TX checksums in
> > > userspace because I'd have to do an extra pass for that. I only do one
> > > pass over the data as I encrypt it, and in many block cipher modes the
> > > encryption of the early blocks affects the IV for the subsequent
> > > blocks... do I can't just go back and "fix" the checksum at the start
> > > of the packet, once I'm finished.
> > >
> > > So doing the checksum as the packet is copied up to userspace would be
> > > very useful.
> >
> >
> > I think I get this, but it requires a new TUN features (and maybe make
> > it userspace controllable via tun_set_csum()).
>
> I don't think it's visible to userspace at all; it's purely between the
> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
> regardless of what the user can cope with. And if the user *didn't*
> support checksum offload then tun will transparently do the checksum
> *during* the copy_to_iter() (in either tun_put_user_xdp() or
> tun_put_user()).
>
> Userspace sees precisely what it did before. If it doesn't support
> checksum offload then it gets a pre-checksummed packet just as before.
> It's just that the kernel will do that checksum *while* it's already
> touching the data as it copies it to userspace, instead of in a
> separate pass.
>
> Although actually, for my *benchmark* case with iperf3 sending UDP, I
> spotted in the perf traces that we actually do the checksum as we're
> copying from userspace in the udp_sendmsg() call. There's a check in
> __ip_append_data() which looks to see if the destination device has
> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
> are definitely use cases which *don't* have that kind of optimisation
> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
> So I think it's worth looking at.
>