netdev - Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170824160748-mutt-send-email-mst@kernel.org>
Date:   Thu, 24 Aug 2017 16:50:31 +0300
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     Koichiro Den <den@...ipeden.com>, Jason Wang <jasowang@...hat.com>,
        virtualization@...ts.linux-foundation.org,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit
 path if no tx napi

On Wed, Aug 23, 2017 at 11:28:24PM -0400, Willem de Bruijn wrote:
> >> > * as a generic solution, if we were to somehow overcome the safety issue, track
> >> > the delay and do copy if some threshold is reached could be an answer, but it's
> >> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> >> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> >> > practical compromise.
> >>
> >> The fragility of this mechanism is another argument for switching to tx napi
> >> as default.
> >>
> >> Is there any more data about the windows guest issues when completions
> >> are not queued within a reasonable timeframe? What is this timescale and
> >> do we really need to work around this.
> >
> > I think it's pretty large, many milliseconds.
> >
> > But I wonder what do you mean by "work around". Using buffers within
> > limited time frame sounds like a reasonable requirement to me.
> 
> Vhost-net zerocopy delays completions until the skb is really
> sent.

This is fundamental in any solution. Guest/application can not
write over a memory buffer as long as hardware might be reading it.

> Traffic shaping can introduce msec timescale latencies.
> 
> The delay may actually be a useful signal. If the guest does not
> orphan skbs early, TSQ will throttle the socket causing host
> queue build up.
> 
> But, if completions are queued in-order, unrelated flows may be
> throttled as well. Allowing out of order completions would resolve
> this HoL blocking.

We can allow out of order, no guests that follow virtio spec
will break. But this won't help in all cases
- a single slow flow can occupy the whole ring, you will not
  be able to make any new buffers available for the fast flow
- what host considers a single flow can be multiple flows for guest

There are many other examples.

> > Neither
> > do I see why would using tx interrupts within guest be a work around -
> > AFAIK windows driver uses tx interrupts.
> 
> It does not address completion latency itself. What I meant was
> that in an interrupt-driver model, additional starvation issues,
> such as the potential deadlock raised at the start of this thread,
> or the timer delay observed before packets were orphaned in
> virtio-net in commit b0c39dbdc204, are mitigated.
> 
> Specifically, it breaks the potential deadlock where sockets are
> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> yet completion handling is blocked waiting for a new packet to
> trigger free_old_xmit_skbs from start_xmit.

This talk of potential deadlock confuses me - I think you mean we would
deadlock if we did not orphan skbs in !use_napi - is that right?  If you
mean that you can drop skb orphan and this won't lead to a deadlock if
free skbs upon a tx interrupt, I agree, for sure.

> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >
> > We don't enable network watchdog on virtio but we could and maybe
> > should.
> 
> Can you elaborate?

The issue is that holding onto buffers for very long times makes guests
think they are stuck. This is funamentally because from guest point of
view this is a NIC, so it is supposed to transmit things out in
a timely manner. If host backs the virtual NIC by something that is not
a NIC, with traffic shaping etc introducing unbounded latencies,
guest will be confused.

For example, we could set ndo_tx_timeout within guest. Then
if tx queue is stopped for too long, a watchdog would fire.

We worked around most of the issues by introducing guest/host
copy. This copy, done by vhost-net, allows us to pretend that
a not-nic backend (e.g. a qdisc) is a nic (virtio-net).
This way you can both do traffic shaping in host with
unbounded latencies and limit latency from guest point of view.

Cost is both data copies and loss of end to end credit accounting.

Changing Linux as a host to limit latencies while not doing copies will
not be an easy task but that's the only fix that comes to mind.

-- 
MST