netdev - Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <505413fd-f148-b8d8-425d-69e7dcf53548@redhat.com>
Date:   Fri, 2 Jul 2021 16:50:17 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     David Woodhouse <dwmw2@...radead.org>, netdev@...r.kernel.org
Cc:     Eugenio Pérez <eperezma@...hat.com>,
        Willem de Bruijn <willemb@...gle.com>,
        "Michael S.Tsirkin" <mst@...hat.com>
Subject: Re: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let
 tun/tap do it themselves


在 2021/7/2 下午4:08, David Woodhouse 写道:
> On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
>> 在 2021/7/2 上午1:39, David Woodhouse 写道:
>>> Right, but the VMM (or the guest, if we're letting the guest choose)
>>> wouldn't have to use it for those cases.
>>
>> I'm not sure I get here. If so, simply write to TUN directly would work.
> As noted, that works nicely for me in OpenConnect; I just write it to
> the tun device *instead* of putting it in the vring. My TX latency is
> now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
> vhost thread, wakes guest).


Note that busy polling is used for KVM to improve the latency as well. 
It was enabled by default if I was not wrong.


>
> But it's not clear to me that a VMM could use it. Because the guest has
> already put that packet *into* the vring. Now if the VMM is in the path
> of all wakeups for that vring, I suppose we *might* be able to contrive
> some hackish way to be 'sure' that the kernel isn't servicing it, so we
> could try to 'steal' that packet from the ring in order to send it
> directly... but no. That's awful :)


Yes.


>
> I do think it'd be interesting to look at a way to reduce the latency
> of the vring wakeup especially for that case of a virtio-net guest with
> a single small packet to send. But realistically speaking, I'm unlikely
> to get to it any time soon except for showing the numbers with the
> userspace equivalent and observing that there's probably a similar win
> to be had for guests too.
>
> In the short term, I should focus on what we want to do to finish off
> my existing fixes.


I think so.


> Did we have a consensus on whether to bother
> supporting PI?


Michael, any thought on this?


>   As I said, I'm mildly inclined to do so just because it
> mostly comes out in the wash as we fix everything else, and making it
> gracefully *refuse* that setup reliably is just as hard.
>
> I think I'll try to make the vhost-net code much more resilient to
> finding that tun_recvmsg() returns a header other than the sock_hlen it
> expects, and see how much still actually needs "fixing" if we can do
> that.


Let's see how well it goes.


>
>
>> I think the design is to delay the tx checksum as much as possible:
>>
>> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
>> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
>>
>> E.g  if the checksum is supported in all those path, we don't need any
>> software checksum at all in the above path. And if any part is not
>> capable of doing checksum, the checksum will be done by networking core
>> before calling the hard_start_xmit of that device.
> Right, but in *any* case where the 'device' is going to memcpy the data
> around (like tun_put_user() does), it's a waste of time having the
> networking core do a *separate* pass over the data just to checksum it.


See below.


>
>>>>> We could similarly do a partial checksum in tun_get_user() and hand it
>>>>> off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
>>>> I think that's is how it is expected to work (via vnet header), see
>>>> virtio_net_hdr_to_skb().
>>> But only if the "guest" supports it; it doesn't get handled by the tun
>>> device. It *could*, since we already have the helpers to checksum *as*
>>> we copy to/from userspace.
>>>
>>> It doesn't help for me to advertise that I support TX checksums in
>>> userspace because I'd have to do an extra pass for that. I only do one
>>> pass over the data as I encrypt it, and in many block cipher modes the
>>> encryption of the early blocks affects the IV for the subsequent
>>> blocks... do I can't just go back and "fix" the checksum at the start
>>> of the packet, once I'm finished.
>>>
>>> So doing the checksum as the packet is copied up to userspace would be
>>> very useful.
>>
>> I think I get this, but it requires a new TUN features (and maybe make
>> it userspace controllable via tun_set_csum()).
> I don't think it's visible to userspace at all; it's purely between the
> tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
> regardless of what the user can cope with. And if the user *didn't*
> support checksum offload then tun will transparently do the checksum
> *during* the copy_to_iter() (in either tun_put_user_xdp() or
> tun_put_user()).
>
> Userspace sees precisely what it did before. If it doesn't support
> checksum offload then it gets a pre-checksummed packet just as before.
> It's just that the kernel will do that checksum *while* it's already
> touching the data as it copies it to userspace, instead of in a
> separate pass.


So I kind of get what did you meant:

1) Don't disable NETIF_F_HW_CSUM in tun_set_csum() even if userspace 
clear TUN_F_CSUM.
2) Use csum iov iterator helper in tun_put_user() and tun_put_user_xdp()

It may help for the performance since we get better cache locality if 
userspace doesn't support checksum offload.

But in this case we need to know if userspace can do the checksum 
offload which we don't need to care previously (via NETIF_F_HW_CSUM).

And we probably need to sync with tun_set_offload().


>
> Although actually, for my *benchmark* case with iperf3 sending UDP, I
> spotted in the perf traces that we actually do the checksum as we're
> copying from userspace in the udp_sendmsg() call. There's a check in
> __ip_append_data() which looks to see if the destination device has
> HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
> are definitely use cases which *don't* have that kind of optimisation
> though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
> So I think it's worth looking at.


Yes.

Thanks