linux-kernel - Re: [PATCH] x86/CPU/AMD: Ignore invalid reset reason value

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3cc16f7d-c650-43f2-b0ca-d99c427cd69b@amd.com>
Date: Thu, 24 Jul 2025 16:02:34 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Sean Christopherson <seanjc@...gle.com>, Borislav Petkov <bp@...en8.de>
Cc: Yazen Ghannam <yazen.ghannam@....com>, x86@...nel.org,
 linux-kernel@...r.kernel.org, Libing He <libhe@...hat.com>,
 David Arcari <darcari@...hat.com>
Subject: Re: [PATCH] x86/CPU/AMD: Ignore invalid reset reason value

On 7/24/2025 3:58 PM, Sean Christopherson wrote:
> On Wed, Jul 23, 2025, Borislav Petkov wrote:
>> On July 23, 2025 9:34:26 PM GMT+03:00, Yazen Ghannam <yazen.ghannam@....com> wrote:
>>> On Tue, Jul 22, 2025 at 06:56:15PM +0200, Borislav Petkov wrote:
>>>> On Mon, Jul 21, 2025 at 06:11:54PM +0000, Yazen Ghannam wrote:
>>>>> The reset reason value may be "all bits set", e.g. 0xFFFFFFFF. This is a
>>>>> commonly used error response from hardware. This may occur due to a real
>>>>> hardware issue or when running in a VM.
>>>>
>>>> Well, which is it Libing is reporting? VM or a real hw issue?
>>>>
>>>
>>> In this case, it was a VM.
>>>
>>>> If it is a VM, is that -1 the only thing a VMM returns when reading that
>>>> MMIO address or can it be anything?
>>>>
>>>> If latter, you need to check X86_FEATURE_HYPERVISOR.
>>>>
>>>> Same for a real hw issue.
>>>>
>>>> IOW, is -1 the *only* invalid data we can read here or are we playing
>>>> whack-a-mole with it?
>>>>
>>>
>>> I see you're point, but I don't think we can know for sure all possible
>>> cases. There are some reserved bits that shouldn't be set. But these
>>> definitions could change in the future.
>>>
>>> And it'd be a pain to try and verify combinations of bits and configs.
>>> Like can bit A and B be set together, or can bit C be set while running
>>> in a VM, or can bit D ever be set on Model Z?
>>>
>>> The -1 (all bits set) is the only "applies to all cases" invalid data,
>>> since this is a common hardware error response. So we can at least check
>>> for this.
>>>
>>> Thanks,
>>> Yazen
>>
>> I think you should check both: HV or -1.
>>
>> HV covers the VM angle as they don't emulate this
> 
> You can't possibly know that.  If there exists a hardware spec of any kind, it's
> fair game for emulation.
> 
>> and we simply should disable this functionality when running as a guest.
>>
>> -1 covers the known-bad hw value.
> 
> And in a guest, -1, i.e. 0xffffffff is all but guaranteed to come from the VMM
> providing PCI master abort semantics for reads to MMIO where no device exists.
> That's about as "architectural" of behavior as you're going to get, so I don't
> see any reason to assume no VMM will every emulate whatever this feature is.

I don't really understand why there would be any value in a VMM 
emulating this feature.  It's specifically about the reason the hardware 
saw for the last reboot.  Those reasons are *hardware reasons*.  IE, 
you're never going to see a thermal event as the reason a guest was 
rebooted.

CF9 reset or ACPI power state transition are about all I can envision 
for guest reboot reasons.  And even then do you *want* the to really 
have the VMM track the reasons for a guest reboot?