linux-kernel - Re: [PATCH v5 13/16] x86: decouple PAT and MTRR handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221202143316.mtjz6dghecshldk2@box.shutemov.name>
Date:   Fri, 2 Dec 2022 17:33:16 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Juergen Gross <jgross@...e.com>
Cc:     linux-kernel@...r.kernel.org, x86@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v5 13/16] x86: decouple PAT and MTRR handling

On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
> On 02.12.22 14:27, Kirill A. Shutemov wrote:
> > On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
> > > On 02.12.22 00:57, Kirill A. Shutemov wrote:
> > > > On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
> > > > > On 01.12.22 17:26, Kirill A. Shutemov wrote:
> > > > > > On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
> > > > > > > Today PAT is usable only with MTRR being active, with some nasty tweaks
> > > > > > > to make PAT usable when running as Xen PV guest, which doesn't support
> > > > > > > MTRR.
> > > > > > > 
> > > > > > > The reason for this coupling is, that both, PAT MSR changes and MTRR
> > > > > > > changes, require a similar sequence and so full PAT support was added
> > > > > > > using the already available MTRR handling.
> > > > > > > 
> > > > > > > Xen PV PAT handling can work without MTRR, as it just needs to consume
> > > > > > > the PAT MSR setting done by the hypervisor without the ability and need
> > > > > > > to change it. This in turn has resulted in a convoluted initialization
> > > > > > > sequence and wrong decisions regarding cache mode availability due to
> > > > > > > misguiding PAT availability flags.
> > > > > > > 
> > > > > > > Fix all of that by allowing to use PAT without MTRR and by reworking
> > > > > > > the current PAT initialization sequence to match better with the newly
> > > > > > > introduced generic cache initialization.
> > > > > > > 
> > > > > > > This removes the need of the recently added pat_force_disabled flag, so
> > > > > > > remove the remnants of the patch adding it.
> > > > > > > 
> > > > > > > Signed-off-by: Juergen Gross <jgross@...e.com>
> > > > > > 
> > > > > > This patch breaks boot for TDX guest.
> > > > > > 
> > > > > > Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
> > > > > > causes #VE:
> > > > > > 
> > > > > > 	tdx: Unexpected #VE: 28
> > > > > > 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
> > > > > > 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > > > > 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
> > > > > > 	 Call Trace:
> > > > > > 	  <TASK>
> > > > > > 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
> > > > > > 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
> > > > > > 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
> > > > > > 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
> > > > > > 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
> > > > > > 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> > > > > > 	  </TASK>
> > > > > > 
> > > > > > Any suggestion how to fix it?
> > > > > > 
> > > > > > [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
> > > > > 
> > > > > What was the solution before?
> > > > > 
> > > > > I guess MTRR was disabled, so there was no PAT, too?
> > > > 
> > > > Right:
> > > > 
> > > > Linus' tree:
> > > > 
> > > > [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
> > > > [    0.003976] Disabled
> > > > [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
> > > > [    0.005856] CPU MTRRs all blank - virtualized system.
> > > > [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> > > > 
> > > > tip/master:
> > > > 
> > > > [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
> > > > [    0.005220] Disabled
> > > > [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > [    0.007752] tdx: Unexpected #VE: 28
> > > > 
> > > > The dangling "Disabled" comes mtrr_bp_init().
> > > > 
> > > > 
> > > > > If this is the case, you can go the same route as Xen PV guests do.
> > > > 
> > > > Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
> > > > X86_FEATURE_XENPV there?
> > > > 
> > > > Do we have any virtualized platform that supports it?
> > > 
> > > Yes, of course. Any hardware virtualized guest should be able to use it,
> > > obviously TDX guests are the first ones not being able to do so.
> > > 
> > > And above dmesg snipplets are showing rather nicely that not disabling
> > > PAT completely should be a benefit for TDX guests, as all caching modes
> > > would be usable (the PAT MSR seems to be initialized quite fine).
> > > 
> > > Instead of X86_FEATURE_XENPV we could introduce something like
> > > X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
> > > TDX guests.
> > 
> > Technically, the MSR is writable on TDX. But it seems there's no way to
> > properly change it, following the protocol of changing on MP systems.
> 
> Why not? I don't see why it is possible in a non-TDX system, but not within
> a TDX guest.

Because the protocol you described below requires setting CR0.CD which is
not allowed in TDX guest and causes #VE.

> > Although, I don't quite follow what role cache disabling playing on system
> > with self-snoop support. Hm?
> 
> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
> Control"):
> 
> The operating system is responsible for insuring that changes to a PAT entry
> occur in a manner that maintains the consistency of the processor caches and
> translation lookaside buffers (TLB). This is accomplished by following the
> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
> ”for changing the value of an MTRR in a multiple processor system. It requires
> a specific sequence of operations that includes flushing the processors caches
> and TLBs.
> 
> And the sequence for the MTRRs is:
> 
> 1. Broadcast to all processors to execute the following code sequence.
> 2. Disable interrupts.
> 3. Wait for all processors to reach this point.
> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>    and the NW flag to 0.)
> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>    supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>    that flag.
> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>    a MOV from control register CR3 to another register and then a MOV from that
>    register back to CR3.
> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>    If only variable ranges are being modified, software may clear the valid bits
>    for the affected register pairs instead.
> 9. Update the MTRRs.
> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>     If only variable-range registers were modified and their individual valid
>     bits were cleared, then set the valid bits for the affected ranges instead.
> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>     Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>     instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>     processors, but it may be needed in future systems.)
> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>     in control register CR0 to 0.)
> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
> 14. Wait for all processors to reach this point.
> 15. Enable interrupts.
> 
> So cache disabling is recommended.

Yeah, I read that.

But the question is what kind of scenario cache disabling is actually
prevents if self-snoop is supported? In this case cache stays intact (no
WBINVD). The next time a cache line gets accessed with different caching
mode the old line gets snooped, right?

Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov