linux-kernel - Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170331032231.GA2471@redhat.com>
Date:   Fri, 31 Mar 2017 06:22:31 +0300
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Mike Galbraith <efault@....de>
Cc:     Christoph Hellwig <hch@....de>,
        Thorsten Leemhuis <linux@...mhuis.info>,
        virtio-dev@...ts.oasis-open.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        rjones@...hat.com
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared
 interrupts for virtqueues")

On Fri, Mar 31, 2017 at 04:23:35AM +0300, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 09:20:35AM +0200, Mike Galbraith wrote:
> > On Thu, 2017-03-30 at 05:10 +0200, Mike Galbraith wrote:
> > 
> > > WRT spin, you should need do nothing more than boot with threadirqs,
> > > that's 100% repeatable here in absolutely virgin source.
> > 
> > No idea why virtqueue_get_buf() in __send_control_msg() fails forever
> > with threadirqs, but marking that vq as being busted (it clearly is)
> > results in one gripe, and a vbox that seemingly cares not one whit that
> > something went missing.  CONFIG_DEBUG_SHIRQ OTOH notices, mutters
> > something that sounds like "idiot" when I hibernate the thing ;-)
> > 
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index e9b7e0b3cabe..831406dae1cb 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -567,6 +567,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
> >  	struct scatterlist sg[1];
> >  	struct virtqueue *vq;
> >  	unsigned int len;
> > +	unsigned long deadline = jiffies+1;
> >  
> >  	if (!use_multiport(portdev))
> >  		return 0;
> > @@ -583,9 +584,13 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
> >  
> >  	if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) {
> >  		virtqueue_kick(vq);
> > -		while (!virtqueue_get_buf(vq, &len)
> > -			&& !virtqueue_is_broken(vq))
> > +		while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq)) {
> >  			cpu_relax();
> > +			if (time_after(jiffies, deadline)) {
> > +				trace_printk("Aw crap, I'm stuck.. breaking device\n");
> > +				virtio_break_device(portdev->vdev);
> > +			}
> > +		}
> >  	}
> >  
> >  	spin_unlock(&portdev->c_ovq_lock);
> 
> 
> OK so with your help I was able to reproduce. Surprisingly easy:
> 
> 1. add threadirqs
> 2. add to qemu -device virtio-serial-pci -no-shutdown
> 3. within guest, do echo disk > /sys/power/state
> 
> This produces a warning. Looking deeper into it, I find:
> the device has 64 vqs. This line
> 
>                err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
>                                   vring_interrupt, IRQF_SHARED,
>                                   vp_dev->msix_names[j], vqs[i]);
> 
> fails after assigning interrupts to 33 vqs.
> Is there a limit to how many threaded irqs can share a line?

In fact it fails on the 33'rd one, and I see this:

/*
 * Unlikely to have 32 resp 64 irqs sharing one line,
 * but who knows.
 */
if (thread_mask == ~0UL) {
	printk(KERN_ERR "%s +%d\n", __FILE__, __LINE__);
	ret = -EBUSY;
	goto out_mask;
}


I'm not sure why does it fail after 32 on 64 bit, but as
virtio devices aren't limited to 32 vqs it looks like we
should go back to requesting the irq only once for all vqs.

Christoph, should I just revert for now, or do you
want to look into a smaller patch for this?

Another question is looking into intx support - that
should work but it seems to be broken at the moment.


> 
> If so we need to rethink the whole approach.
> 
> Still looking into it.
> 
> Christoph, any idea?
> 
> 
> -- 
> MST