linux-kernel - Re: Fwd: About patch bdedff263132

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAE8KmOyrBmUq-38aVig16mEc5h4jwaTYHRY2rRWvMAn6wmKkAg@mail.gmail.com>
Date: Wed, 3 Jan 2024 13:34:35 +0530
From: Prasad Pandit <ppandit@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: Fwd: About patch bdedff263132 - KVM: x86: Route pending NMIs

Hello Sean,

On Wed, 3 Jan 2024 at 04:30, Sean Christopherson <seanjc@...gle.com> wrote:
> Heh, I don't know that I would describe "412 microseconds" as "indefinitely", but
> it's certainly a long time, especially during boot.

* Indefinitely because it does not come out of it. I've left the guest
overnight and still it did not boot.

> Piecing things together, the issue is I was wrong about the -EAGAIN exit being
> benign.
>
> QEMU responds to the spurious exit by bailing from the vCPU's inner runloop, and
> when that happens, the associated task (briefly) acquires a global mutex, the
> so called BQL (Big QEMU Lock).  I assumed that QEMU would eat the -EAGAIN and do
> nothing interesting, but QEMU interprets the -EAGAIN as "there might be a global
> state change the vCPU needs to handle".
>
> As you discovered, having 9 vCPUs constantly acquiring and releasing a single
> mutex makes for slow going when vCPU0 needs to acquire said mutex, e.g. to do
> emulated MMIO.
>
> Ah, and the other wrinkle is that KVM won't actually yield during KVM_RUN for
> UNINITIALIZED vCPUs, i.e. all those vCPU tasks will stay at 100% utilization even
> though there's nothing for them to do.  That may or may not matter in your case,
> but it would be awful behavior in a setup with oversubscribed vCPUs.
...
> Yeah, that's kinda sorta what's happening, although that comment is about requests
> that are never cleared in *any* path, e.g. violation of that rule causes a vCPU
> to be 100% stuck.

* I see, interesting.

> I'm not 100% confident there isn't something else going on, e.g. a 400+ microsecond
> wait time is a little odd,

* It could be vCPU thread's sched priority/policy.

> but this is inarguably a KVM regression and I doubt it's worth anyone's time to dig deeper.
> Can you give me a Signed-off-by for this?  I'll write a changelog and post a proper patch.

* I have sent a formal patch to you. Please feel free to edit the
commit/change log as you see fit. Thanks so much.

Thank you.
---
  - Prasad