lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87d085zwy9.fsf@nanos.tec.linutronix.de>
Date:   Fri, 17 Apr 2020 22:19:58 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Marc Dionne <marc.c.dionne@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Cc:     x86@...nel.org
Subject: Re: FreeNAS VM disk access errors, bisected to commit 6f1a4891a592

Marc,

Marc Dionne <marc.c.dionne@...il.com> writes:

> Commit 6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity
> race") causes Linux VMs hosted on FreeNAS (bhyve hypervisor) to lose
> access to their disk devices shortly after boot.  The disks are zfs
> zvols on the host, presented to each VM.
>
> Background: I recently updated some fedora 31 VMs running under the
> bhyve hypervisor (hosted on a FreeNAS mini), and they moved to a
> distro 5.5 kernel (5.5.15).  Shortly after reboot, the disks became
> inaccessible with any operation getting EIO errors.  Booting back into
> a 5.4 kernel, everything was fine.  I built a 5.7-rc1 kernel, which
> showed the same symptoms, and was then able to bisect it down to
> commit 6f1a4891a592.  Note that the symptoms do not occur on every
> boot, but often enough (roughly 80%) to make bisection possible.
>
> Applying a manual revert of 6f1a4891a592 on top of mainline from
> yesterday gives me a kernel that works fine.

we tested on real hardware and various hypervisors that the fix actually
works correctly.

That makes me assume that the staged approach of changing affinity for
this non-maskable MSI mess makes your particular hypervisor unhappy.

Are there any messages like this:

 "do_IRQ: 0.83 No irq handler for vector"

in dmesg on the Linux side? If they happen then before the disk timeout
happens.

I have absolutely zero knowledge about bhyve, so may I suggest to talk
to the bhyve experts about this.

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ