linux-kernel - Re: Major KVM issues with kernel 4.5 on the host

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160423160429.GL8531@pd.tnic>
Date:	Sat, 23 Apr 2016 18:04:29 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Marc Haber <mh+linux-kernel@...schlus.de>
Cc:	Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org,
	kvm ML <kvm@...r.kernel.org>
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> Yes, but there are two symptoms. The VM either suffers file system
> issues (garbage read from files, or an aborted ext4 journal and
> following ro remount) or it stops dead in its tracks.

Stops dead? What does that mean exactly? Box is wedged solid and it
doesn't react to any key presses?

Because if so, this could really be a DRAM going bad and a correctable
error turning into an uncorrectable. How old is the DRAM in that box?
Judging by your CPU, it should be a couple of years...

> The longest trigger time I have seen was three hours, I tripled that
> to nine hours, that probably was not enough.

So enlarge even more I guess.

> The box reports about one correctable error per week, so I probably
> have a faulty DIMM, but since the issue only surfaces in VMs while the
> host system is in perfect working order...

So it could be that correctable error turns into an uncorrectable one at
some point. But then you should be getting an exception...

> And yes, I am pondering to simply replace the box with an Intel CPU.

Your CPU is fine, from what I've seen so far.

> I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
> 30 "Machine check events logged" since January. How do I see which
> events were logged?

Hmm, you have

[   18.149300] MCE: In-kernel MCE decoding enabled.

that's CONFIG_EDAC_DECODE_MCE, so you should have some "Hardware Error"
lines in dmesg, I'd guess, decoding the errors.

> So you basically select the default for new options.

Yap.

> I go the way of Debian packages since it is easier to handle the
> crypto file systems when the machine is booting up.

As long as you're testing the correct bisection kernels...

> And yes, I think about doing a test reinstall on unencrypted disk to
> find out whether encryption plays a role, but I currently need the
> machine to urgently to take it out of serice for half a month, and,
> again, the host system is in perfect working order, it is just VMs
> that barf.

Yeah, I can't reproduce it here and I have a very similar box to yours
which is otherwise idle, more or less.

Another fact which points to potentially DIMM going bad...

> I check the date of the package I am installing and the date stamp of
> the kernels being installed to /boot. I'm reasonably sure I have that
> under control.

Good.

> ... and if testing a "good" kernel means a day.

Yeah, it is annoying. In a perfect world, we all should have two
identical boxes so that we use one as a workstation and the second for
testing when the first one, the workstation barfs. I should bring that
up with my manager next time... :-)

> And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
> right?

Yap.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.