lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 2 Dec 2022 15:56:49 +0100
From:   Juergen Gross <jgross@...e.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     linux-kernel@...r.kernel.org, x86@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v5 13/16] x86: decouple PAT and MTRR handling

On 02.12.22 15:33, Kirill A. Shutemov wrote:
> On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
>> On 02.12.22 14:27, Kirill A. Shutemov wrote:
>>> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
>>>> On 02.12.22 00:57, Kirill A. Shutemov wrote:
>>>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>>>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>>>>>> MTRR.
>>>>>>>>
>>>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>>>>>> changes, require a similar sequence and so full PAT support was added
>>>>>>>> using the already available MTRR handling.
>>>>>>>>
>>>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>>>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>>>>>> to change it. This in turn has resulted in a convoluted initialization
>>>>>>>> sequence and wrong decisions regarding cache mode availability due to
>>>>>>>> misguiding PAT availability flags.
>>>>>>>>
>>>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>>>>>> the current PAT initialization sequence to match better with the newly
>>>>>>>> introduced generic cache initialization.
>>>>>>>>
>>>>>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>>>>>> remove the remnants of the patch adding it.
>>>>>>>>
>>>>>>>> Signed-off-by: Juergen Gross <jgross@...e.com>
>>>>>>>
>>>>>>> This patch breaks boot for TDX guest.
>>>>>>>
>>>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>>>>>> causes #VE:
>>>>>>>
>>>>>>> 	tdx: Unexpected #VE: 28
>>>>>>> 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>> 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>>>>>> 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>>>>> 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>>>>>> 	 Call Trace:
>>>>>>> 	  <TASK>
>>>>>>> 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
>>>>>>> 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
>>>>>>> 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
>>>>>>> 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
>>>>>>> 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>>>>>> 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>>>>>> 	  </TASK>
>>>>>>>
>>>>>>> Any suggestion how to fix it?
>>>>>>>
>>>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>>>>>
>>>>>> What was the solution before?
>>>>>>
>>>>>> I guess MTRR was disabled, so there was no PAT, too?
>>>>>
>>>>> Right:
>>>>>
>>>>> Linus' tree:
>>>>>
>>>>> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
>>>>> [    0.003976] Disabled
>>>>> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
>>>>> [    0.005856] CPU MTRRs all blank - virtualized system.
>>>>> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
>>>>>
>>>>> tip/master:
>>>>>
>>>>> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
>>>>> [    0.005220] Disabled
>>>>> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
>>>>> [    0.007752] tdx: Unexpected #VE: 28
>>>>>
>>>>> The dangling "Disabled" comes mtrr_bp_init().
>>>>>
>>>>>
>>>>>> If this is the case, you can go the same route as Xen PV guests do.
>>>>>
>>>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
>>>>> X86_FEATURE_XENPV there?
>>>>>
>>>>> Do we have any virtualized platform that supports it?
>>>>
>>>> Yes, of course. Any hardware virtualized guest should be able to use it,
>>>> obviously TDX guests are the first ones not being able to do so.
>>>>
>>>> And above dmesg snipplets are showing rather nicely that not disabling
>>>> PAT completely should be a benefit for TDX guests, as all caching modes
>>>> would be usable (the PAT MSR seems to be initialized quite fine).
>>>>
>>>> Instead of X86_FEATURE_XENPV we could introduce something like
>>>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
>>>> TDX guests.
>>>
>>> Technically, the MSR is writable on TDX. But it seems there's no way to
>>> properly change it, following the protocol of changing on MP systems.
>>
>> Why not? I don't see why it is possible in a non-TDX system, but not within
>> a TDX guest.
> 
> Because the protocol you described below requires setting CR0.CD which is
> not allowed in TDX guest and causes #VE.

Hmm, yes, seems to be a valid reason. :-)

> 
>>> Although, I don't quite follow what role cache disabling playing on system
>>> with self-snoop support. Hm?
>>
>> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
>> Control"):
>>
>> The operating system is responsible for insuring that changes to a PAT entry
>> occur in a manner that maintains the consistency of the processor caches and
>> translation lookaside buffers (TLB). This is accomplished by following the
>> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
>> ”for changing the value of an MTRR in a multiple processor system. It requires
>> a specific sequence of operations that includes flushing the processors caches
>> and TLBs.
>>
>> And the sequence for the MTRRs is:
>>
>> 1. Broadcast to all processors to execute the following code sequence.
>> 2. Disable interrupts.
>> 3. Wait for all processors to reach this point.
>> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>>     and the NW flag to 0.)
>> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>>     supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
>> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>>     that flag.
>> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>>     a MOV from control register CR3 to another register and then a MOV from that
>>     register back to CR3.
>> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>>     If only variable ranges are being modified, software may clear the valid bits
>>     for the affected register pairs instead.
>> 9. Update the MTRRs.
>> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>>      If only variable-range registers were modified and their individual valid
>>      bits were cleared, then set the valid bits for the affected ranges instead.
>> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>>      Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>>      instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>>      processors, but it may be needed in future systems.)
>> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>>      in control register CR0 to 0.)
>> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
>> 14. Wait for all processors to reach this point.
>> 15. Enable interrupts.
>>
>> So cache disabling is recommended.
> 
> Yeah, I read that.
> 
> But the question is what kind of scenario cache disabling is actually
> prevents if self-snoop is supported? In this case cache stays intact (no
> WBINVD). The next time a cache line gets accessed with different caching
> mode the old line gets snooped, right?
> 
> Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?
> 

That's a question for the Intel architects, I guess.

I'd just ask them how to setup PAT in TDX guests. Either they need to
change the recommended setup sequence, or the PAT support bit needs to
be cleared IMO.


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ