linux-kernel - Re: [BUG] Guest OSes die simultaneously (bisected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <acb26142-622a-4f5d-aba6-c92601f95fbf@paulmck-laptop>
Date: Thu, 4 Jan 2024 11:23:52 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Sean Christopherson <seanjc@...gle.com>,
	Like Xu <like.xu@...ux.intel.com>, Andi Kleen <ak@...ux.intel.com>,
	Kan Liang <kan.liang@...ux.intel.com>,
	Luwei Kang <luwei.kang@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org, Breno Leitao <leitao@...ian.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Thu, Jan 04, 2024 at 05:32:34PM +0100, Paolo Bonzini wrote:
> On 1/4/24 17:06, Paul E. McKenney wrote:
> > Although I am happy to have been able to locate the commit (and even
> > happier that Sean spotted the problem and that you quickly pushed the
> > fix to mainline!), chasing this consumed a lot of time and systems over
> > an embarrassingly large number of months.  As in I first spotted this
> > bug in late July.  Despite a number of increasingly complex attempts,
> > bisection became feasible only after the buggy commit was backported to
> > our internal v5.19 code base.  🙁
> 
> Yes, this strikes two sore points.
> 
> One is that I have also experienced being able to bisect only with a
> somewhat more linear history (namely the CentOS Stream 9 aka c9s
> frankenkernel [1]) and not with upstream.  Even if the c9s kernel is not a
> fully linear set of commits, there's some benefit from merge commits that
> consist of slightly more curated set of patches, where each merge commit
> includes both new features and bugfixes.  Unfortunately, whether you'll be
> able to do this with the c9s kernel depends a lot on the subsystems involved
> and on the bug.  Both are factors that may or may not be known in advance.

I guess I am glad that it is not just me.  ;-)

> The other, of course, is testing.  The KVM selftests infrastructure is meant
> for this kind of white box problem, but the space of tests that can be
> written is so large, that there's always too few tests.  It shines when you
> have a clear bisection but an unclear fix (in the past I have had cases
> where spending two days to write a test led me to writing a fix in thirty
> minutes), but boosting the reproducibility is always a good thing.

Agreed, validation never will be perfect, and so improving the test
suite based on production experience is a good thing, as is creating
test cases based on the behavior of important production workloads for
those who run them.

> > And please understand that I am not casting shade on those who wrote,
> > reviewed, and committed that buggy commit.  As in I freely confess that
> > I had to stare at Sean's fix for a few minutes before I figured out what
> > was going on.
> 
> Oh don't worry about that---rather, I am going to cast a shade on those that
> did not review the commit, namely me.  I am somewhat obsessed with Boolean
> logic and *probably* I would have caught it, or would have asked to split
> the use of designated initializers to a separate patch.  Any of the two
> could, at least potentially, have saved you quite some time.

We have all done similar things.  I certainly have!

> > Instead, the point I am trying to make is that carefully
> > constructed tests can serve as tireless and accurate code reviewers.
> > This won't ever replace actual code review, but my experience indicates
> > that it will help find more bugs more quickly and more easily.
> 
> TBH this (conflict between virtual addresses on the host and the guest
> leading to corruption of the guest) is probably not the kind of adversarial
> test that one would have written or suggested right off the bat.  But it
> should be written now indeed.

Very good, looking forward to seeing it!

							Thanx, Paul

> Paolo
> 
> [1]
> https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/
>