linux-kernel - Re: [BUG] Guest OSes die simultaneously (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b327b546-4a5f-462d-baeb-804a33bd3f6a@redhat.com>
Date: Thu, 4 Jan 2024 17:32:34 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: paulmck@...nel.org
Cc: Sean Christopherson <seanjc@...gle.com>, Like Xu
 <like.xu@...ux.intel.com>, Andi Kleen <ak@...ux.intel.com>,
 Kan Liang <kan.liang@...ux.intel.com>, Luwei Kang <luwei.kang@...el.com>,
 Peter Zijlstra <peterz@...radead.org>, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
 Breno Leitao <leitao@...ian.org>, Arnaldo Carvalho de Melo
 <acme@...nel.org>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On 1/4/24 17:06, Paul E. McKenney wrote:
> Although I am happy to have been able to locate the commit (and even
> happier that Sean spotted the problem and that you quickly pushed the
> fix to mainline!), chasing this consumed a lot of time and systems over
> an embarrassingly large number of months.  As in I first spotted this
> bug in late July.  Despite a number of increasingly complex attempts,
> bisection became feasible only after the buggy commit was backported to
> our internal v5.19 code base.  🙁

Yes, this strikes two sore points.

One is that I have also experienced being able to bisect only with a 
somewhat more linear history (namely the CentOS Stream 9 aka c9s 
frankenkernel [1]) and not with upstream.  Even if the c9s kernel is not 
a fully linear set of commits, there's some benefit from merge commits 
that consist of slightly more curated set of patches, where each merge 
commit includes both new features and bugfixes.  Unfortunately, whether 
you'll be able to do this with the c9s kernel depends a lot on the 
subsystems involved and on the bug.  Both are factors that may or may 
not be known in advance.

The other, of course, is testing.  The KVM selftests infrastructure is 
meant for this kind of white box problem, but the space of tests that 
can be written is so large, that there's always too few tests.  It 
shines when you have a clear bisection but an unclear fix (in the past I 
have had cases where spending two days to write a test led me to writing 
a fix in thirty minutes), but boosting the reproducibility is always a 
good thing.

> And please understand that I am not casting shade on those who wrote,
> reviewed, and committed that buggy commit.  As in I freely confess that
> I had to stare at Sean's fix for a few minutes before I figured out what
> was going on.

Oh don't worry about that---rather, I am going to cast a shade on those 
that did not review the commit, namely me.  I am somewhat obsessed with 
Boolean logic and *probably* I would have caught it, or would have asked 
to split the use of designated initializers to a separate patch.  Any of 
the two could, at least potentially, have saved you quite some time.

> Instead, the point I am trying to make is that carefully
> constructed tests can serve as tireless and accurate code reviewers.
> This won't ever replace actual code review, but my experience indicates
> that it will help find more bugs more quickly and more easily.

TBH this (conflict between virtual addresses on the host and the guest 
leading to corruption of the guest) is probably not the kind of 
adversarial test that one would have written or suggested right off the 
bat.  But it should be written now indeed.

Paolo

[1] 
https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/