netdev - Re: [PATCH net-next] virtio_net: Prevent misidentified spurious interrupts from killing the irq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240807063041-mutt-send-email-mst@kernel.org>
Date: Wed, 7 Aug 2024 06:37:26 -0400
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Jason Wang <jasowang@...hat.com>
Cc: Heng Qi <hengqi@...ux.alibaba.com>, netdev@...r.kernel.org,
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
	Eugenio Pérez <eperezma@...hat.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Subject: Re: [PATCH net-next] virtio_net: Prevent misidentified spurious
 interrupts from killing the irq

On Wed, Aug 07, 2024 at 12:06:16PM +0800, Jason Wang wrote:
> On Tue, Aug 6, 2024 at 9:25 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> >
> > On Tue, Aug 06, 2024 at 11:18:14AM +0800, Jason Wang wrote:
> > > On Mon, Aug 5, 2024 at 2:29 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > >
> > > > On Mon, Aug 05, 2024 at 11:26:56AM +0800, Jason Wang wrote:
> > > > > On Fri, Aug 2, 2024 at 9:11 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 02, 2024 at 11:41:57AM +0800, Jason Wang wrote:
> > > > > > > On Thu, Aug 1, 2024 at 9:56 PM Heng Qi <hengqi@...ux.alibaba.com> wrote:
> > > > > > > >
> > > > > > > > Michael has effectively reduced the number of spurious interrupts in
> > > > > > > > commit a7766ef18b33 ("virtio_net: disable cb aggressively") by disabling
> > > > > > > > irq callbacks before cleaning old buffers.
> > > > > > > >
> > > > > > > > But it is still possible that the irq is killed by mistake:
> > > > > > > >
> > > > > > > >   When a delayed tx interrupt arrives, old buffers has been cleaned in
> > > > > > > >   other paths (start_xmit and virtnet_poll_cleantx), then the interrupt is
> > > > > > > >   mistakenly identified as a spurious interrupt in vring_interrupt.
> > > > > > > >
> > > > > > > >   We should refrain from labeling it as a spurious interrupt; otherwise,
> > > > > > > >   note_interrupt may inadvertently kill the legitimate irq.
> > > > > > >
> > > > > > > I think the evil came from where we do free_old_xmit() in
> > > > > > > start_xmit(). I know it is for performance, but we may need to make
> > > > > > > the code work correctly instead of adding endless hacks. Personally, I
> > > > > > > think the virtio-net TX path is over-complicated. We probably pay too
> > > > > > > much (e.g there's netif_tx_lock in TX NAPI path) to try to "optimize"
> > > > > > > the performance.
> > > > > > >
> > > > > > > How about just don't do free_old_xmit and do that solely in the TX NAPI?
> > > > > >
> > > > > > Not getting interrupts is always better than getting interrupts.
> > > > >
> > > > > Not sure. For example letting 1 cpu to do the transmission without the
> > > > > dealing of xmit skbs should give us better performance.
> > > >
> > > > Hmm. It's a subtle thing. I suspect until certain limit
> > > > (e.g. ping pong test) free_old_xmit will win anyway.
> > >
> > > Not sure I understand here.
> >
> > If you transmit 1 packet and then wait for another one anyway,
> > you are better off just handling the tx interrupt.
> 
> Yes for light load but not for heavy load like pktgen and others probably.

If you are extermely busy sending packets, and you don't really care
when they are freed, and the vq is deep
and you have another, idle CPU, and sending interrupt does not need
a vmexit - moving work out at the cost of an interrupt will be a win.


> >
> >
> > > >
> > > > > > This is not new code, there are no plans to erase it all and start
> > > > > > anew "to make it work correctly" - it's widely deployed,
> > > > > > you will cause performance regressions and they are hard
> > > > > > to debug.
> > > > >
> > > > > I actually meant the TX NAPI mode, we tried to hold the TX lock in the
> > > > > TX NAPI, which turns out to slow down both the transmission and the
> > > > > NAPI itself.
> > > > >
> > > > > Thanks
> > > >
> > > > We do need to synchronize anyway though, virtio expects drivers to do
> > > > their own serialization of vq operations.
> > >
> > > Right, but currently add and get needs to be serialized which is a
> > > bottleneck. I don't see any issue to parallelize that.
> >
> > Do you see this in traces?
> 
> I mean current virtio_core requires the caller to serialize add/get:
> 
> virtqueue_add() {
> START_USE()
> END_USE()
> }
> 
> virtqueue_get() {
> START_USE()
> END_USE()
> }
> 
> It seems to be a limitation of the current driver not the spec itself
> which means we can find some way to allow those to be executed in
> parallel.
> 
> One example is to use ptr_ring to maintain a free id list or it is not
> even needed in the case of in order.

All quite tricky.

But again - do you have traces showing contention on tx lock?

Until we do, it's pointless to try and optimize: make changes to
code, see if performance changes - is not a good way to do this,
the system is too chaotic for that.


> >
> > > > You could try to instead move
> > > > skbs to some kind of array under the tx lock, then free them all up
> > > > later after unlocking tx.
> > > >
> > > > Can be helpful for batching as well?
> > >
> > > It's worth a try and see.
> >
> > Why not.
> >
> > > >
> > > >
> > > > I also always wondered whether it is an issue that free_old_xmit
> > > > just polls vq until it is empty, without a limit.
> > >
> > > Did you mean schedule a NAPI if free_old_xmit() exceeds the NAPI quota?
> >
> > yes
> >
> > > > napi is supposed to poll until a limit is reached.
> > > > I guess not many people have very deep vqs.
> > >
> > > Current NAPI weight is 64, so I think we can meet it in stressful workload.
> > >
> > > Thanks
> >
> > yes, but it's just a random number.  since we hold the tx lock,
> > we get at most vq size bufs, so it's limited.
> 
> Ok.
> 
> Thanks
> 
> >
> > > >
> > > > --
> > > > MST
> > > >
> >