[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB9dFdvJE0LhQsxdUTKmOxp_q1xF1Bpe9E-dNp1Pxg3T0B1xPQ@mail.gmail.com>
Date: Fri, 17 Apr 2020 21:49:30 -0300
From: Marc Dionne <marc.c.dionne@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
x86@...nel.org
Subject: Re: FreeNAS VM disk access errors, bisected to commit 6f1a4891a592
On Fri, Apr 17, 2020 at 5:19 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Marc,
>
> Marc Dionne <marc.c.dionne@...il.com> writes:
>
> > Commit 6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity
> > race") causes Linux VMs hosted on FreeNAS (bhyve hypervisor) to lose
> > access to their disk devices shortly after boot. The disks are zfs
> > zvols on the host, presented to each VM.
> >
> > Background: I recently updated some fedora 31 VMs running under the
> > bhyve hypervisor (hosted on a FreeNAS mini), and they moved to a
> > distro 5.5 kernel (5.5.15). Shortly after reboot, the disks became
> > inaccessible with any operation getting EIO errors. Booting back into
> > a 5.4 kernel, everything was fine. I built a 5.7-rc1 kernel, which
> > showed the same symptoms, and was then able to bisect it down to
> > commit 6f1a4891a592. Note that the symptoms do not occur on every
> > boot, but often enough (roughly 80%) to make bisection possible.
> >
> > Applying a manual revert of 6f1a4891a592 on top of mainline from
> > yesterday gives me a kernel that works fine.
>
> we tested on real hardware and various hypervisors that the fix actually
> works correctly.
>
> That makes me assume that the staged approach of changing affinity for
> this non-maskable MSI mess makes your particular hypervisor unhappy.
>
> Are there any messages like this:
>
> "do_IRQ: 0.83 No irq handler for vector"
I haven't seen those although I only have a VNC console that scrolls
by rather fast.
I did see a report from someone running Ubuntu 18.04 which had this
after the initial errors:
do_IRQ: 2.35 No irq handler for vector
ata1.00: revalidation failed (error=-5)
> in dmesg on the Linux side? If they happen then before the disk timeout
> happens.
>
> I have absolutely zero knowledge about bhyve, so may I suggest to talk
> to the bhyve experts about this.
I opened a ticket with iXsystems. I noticed several people reporting
the same problem in their community forums.
Marc
Powered by blists - more mailing lists