lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 5 Dec 2022 08:40:06 +0100
From:   Juergen Gross <jgross@...e.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     linux-kernel@...r.kernel.org, x86@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v5 13/16] x86: decouple PAT and MTRR handling

On 02.12.22 15:56, Juergen Gross wrote:
> On 02.12.22 15:33, Kirill A. Shutemov wrote:
>> On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
>>> On 02.12.22 14:27, Kirill A. Shutemov wrote:
>>>> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
>>>>> On 02.12.22 00:57, Kirill A. Shutemov wrote:
>>>>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>>>>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>>>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>>>>>>> MTRR.
>>>>>>>>>
>>>>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>>>>>>> changes, require a similar sequence and so full PAT support was added
>>>>>>>>> using the already available MTRR handling.
>>>>>>>>>
>>>>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>>>>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>>>>>>> to change it. This in turn has resulted in a convoluted initialization
>>>>>>>>> sequence and wrong decisions regarding cache mode availability due to
>>>>>>>>> misguiding PAT availability flags.
>>>>>>>>>
>>>>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>>>>>>> the current PAT initialization sequence to match better with the newly
>>>>>>>>> introduced generic cache initialization.
>>>>>>>>>
>>>>>>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>>>>>>> remove the remnants of the patch adding it.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Juergen Gross <jgross@...e.com>
>>>>>>>>
>>>>>>>> This patch breaks boot for TDX guest.
>>>>>>>>
>>>>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>>>>>>> causes #VE:
>>>>>>>>
>>>>>>>>     tdx: Unexpected #VE: 28
>>>>>>>>     VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>>>     CPU: 0 PID: 0 Comm: swapper Not tainted 
>>>>>>>> 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>>>>>>>     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
>>>>>>>> 02/06/2015
>>>>>>>>     RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>>>>>>>      Call Trace:
>>>>>>>>       <TASK>
>>>>>>>>      ? cache_disable (arch/x86/include/asm/cpufeature.h:173 
>>>>>>>> arch/x86/kernel/cpu/cacheinfo.c:1085)
>>>>>>>>      ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 
>>>>>>>> (discriminator 3))
>>>>>>>>      ? setup_arch (arch/x86/kernel/setup.c:1079)
>>>>>>>>      ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 
>>>>>>>> (discriminator 3) init/main.c:960 (discriminator 3))
>>>>>>>>      ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>>>>>>>      ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>>>>>>>       </TASK>
>>>>>>>>
>>>>>>>> Any suggestion how to fix it?
>>>>>>>>
>>>>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>>>>>>
>>>>>>> What was the solution before?
>>>>>>>
>>>>>>> I guess MTRR was disabled, so there was no PAT, too?
>>>>>>
>>>>>> Right:
>>>>>>
>>>>>> Linus' tree:
>>>>>>
>>>>>> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
>>>>>> [    0.003976] Disabled
>>>>>> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
>>>>>> [    0.005856] CPU MTRRs all blank - virtualized system.
>>>>>> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
>>>>>>
>>>>>> tip/master:
>>>>>>
>>>>>> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
>>>>>> [    0.005220] Disabled
>>>>>> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
>>>>>> [    0.007752] tdx: Unexpected #VE: 28
>>>>>>
>>>>>> The dangling "Disabled" comes mtrr_bp_init().
>>>>>>
>>>>>>
>>>>>>> If this is the case, you can go the same route as Xen PV guests do.
>>>>>>
>>>>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
>>>>>> X86_FEATURE_XENPV there?
>>>>>>
>>>>>> Do we have any virtualized platform that supports it?
>>>>>
>>>>> Yes, of course. Any hardware virtualized guest should be able to use it,
>>>>> obviously TDX guests are the first ones not being able to do so.
>>>>>
>>>>> And above dmesg snipplets are showing rather nicely that not disabling
>>>>> PAT completely should be a benefit for TDX guests, as all caching modes
>>>>> would be usable (the PAT MSR seems to be initialized quite fine).
>>>>>
>>>>> Instead of X86_FEATURE_XENPV we could introduce something like
>>>>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
>>>>> TDX guests.
>>>>
>>>> Technically, the MSR is writable on TDX. But it seems there's no way to
>>>> properly change it, following the protocol of changing on MP systems.
>>>
>>> Why not? I don't see why it is possible in a non-TDX system, but not within
>>> a TDX guest.
>>
>> Because the protocol you described below requires setting CR0.CD which is
>> not allowed in TDX guest and causes #VE.
> 
> Hmm, yes, seems to be a valid reason. :-)
> 
>>
>>>> Although, I don't quite follow what role cache disabling playing on system
>>>> with self-snoop support. Hm?
>>>
>>> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
>>> Control"):
>>>
>>> The operating system is responsible for insuring that changes to a PAT entry
>>> occur in a manner that maintains the consistency of the processor caches and
>>> translation lookaside buffers (TLB). This is accomplished by following the
>>> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
>>> ”for changing the value of an MTRR in a multiple processor system. It requires
>>> a specific sequence of operations that includes flushing the processors caches
>>> and TLBs.
>>>
>>> And the sequence for the MTRRs is:
>>>
>>> 1. Broadcast to all processors to execute the following code sequence.
>>> 2. Disable interrupts.
>>> 3. Wait for all processors to reach this point.
>>> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>>>     and the NW flag to 0.)
>>> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>>>     supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
>>> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>>>     that flag.
>>> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>>>     a MOV from control register CR3 to another register and then a MOV from that
>>>     register back to CR3.
>>> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>>>     If only variable ranges are being modified, software may clear the valid 
>>> bits
>>>     for the affected register pairs instead.
>>> 9. Update the MTRRs.
>>> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>>>      If only variable-range registers were modified and their individual valid
>>>      bits were cleared, then set the valid bits for the affected ranges instead.
>>> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>>>      Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>>>      instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>>>      processors, but it may be needed in future systems.)
>>> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>>>      in control register CR0 to 0.)
>>> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
>>> 14. Wait for all processors to reach this point.
>>> 15. Enable interrupts.
>>>
>>> So cache disabling is recommended.
>>
>> Yeah, I read that.
>>
>> But the question is what kind of scenario cache disabling is actually
>> prevents if self-snoop is supported? In this case cache stays intact (no
>> WBINVD). The next time a cache line gets accessed with different caching
>> mode the old line gets snooped, right?
>>
>> Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?
>>
> 
> That's a question for the Intel architects, I guess.
> 
> I'd just ask them how to setup PAT in TDX guests. Either they need to
> change the recommended setup sequence, or the PAT support bit needs to
> be cleared IMO.

I've forwarded the question to Intel, BTW.

Another question to you: where does the initial PAT MSR value come from?
I guess from UEFI?


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ