lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47bd7e63-501b-28a2-97f4-723d73cf0c53@suse.com>
Date:   Tue, 12 Jul 2022 07:49:25 +0200
From:   Juergen Gross <jgross@...e.com>
To:     Chuck Zmudzinski <brchuckz@...scape.net>,
        Borislav Petkov <bp@...en8.de>, Jan Beulich <jbeulich@...e.com>
Cc:     Andrew Lutomirski <luto@...nel.org>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        lkml <linux-kernel@...r.kernel.org>,
        "xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>
Subject: Re: [PATCH] x86/PAT: have pat_enabled() properly reflect state when
 running on e.g. Xen

On 11.07.22 19:41, Chuck Zmudzinski wrote:
> On 7/11/2022 10:31 AM, Juergen Gross wrote:
>> On 11.07.22 16:18, Chuck Zmudzinski wrote:
>>> On 7/5/22 12:14 PM, Borislav Petkov wrote:
>>>> On Tue, Jul 05, 2022 at 05:56:36PM +0200, Jan Beulich wrote:
>>>>> Re-using pat_disabled like you do in your suggestion below won't
>>>>> work, because mtrr_bp_init() calls pat_disable() when MTRRs
>>>>> appear to be disabled (from the kernel's view). The goal is to
>>>>> honor "nopat" without honoring any other calls to pat_disable().
>>>>
>>>> Actually, the current goal is to adjust Xen dom0 because:
>>>>
>>>> 1. it uses the PAT code
>>>>
>>>> 2. but then it does something special and hides the MTRRs
>>>>
>>>> which is not something real hardware does.
>>>>
>>>> So this one-off thing should be prominent, visible and not get in the
>>>> way.
>>>
>>> Hello,
>>>
>>> I have spent a fair amount of time this past weekend
>>> investigating this regression and what might have caused
>>> it and I also have done several tests on my Xen workstation
>>> that exhibits the regression to verify my understanding
>>> of the problem and also raise other problematic points.
>>>
>>> I think, in addition to immediately fixing the regression by
>>> fixing the now five-year-old commit that is the root cause
>>> of the recently exposed regression, as discussed in my
>>> earlier message which proposes a simple patch to fix that
>>> five-year-old broken commit,
>>>
>>> https://lore.kernel.org/lkml/63583497-152f-5880-4c8f-d47e2a81f5a6@netscape.net/
>>>
>>> there needs to be a decision about how best to deal with
>>> this very aptly described "one-off Xen thing" in the future.
>>>
>>> One problem in x86/mm/pat/memtype.c is the fact that there
>>> exist two functions, pat_init(), and init_cache_modes(), that
>>> do more or less the same thing, so that when one of those
>>> functions needs to be updated, the other also needs to
>>> be updated. In the past, when changes to the pat_enable
>>> and pat_disable functions and variables were made, all
>>> too often, the change was only applied to pat_init() and not
>>> to init_cache_modes() and the one-off Xen case was simply
>>> not addressed and dealt with properly.
>>>
>>> So I propose the following:
>>>
>>> 1) Identify those things that always need to be done in
>>> either pat_init() or init_cache_modes() and try to combine
>>> those things into a single function so that changes will
>>> be applied for both cases. We sort of have that now with
>>> __init_cache_modes(), but it is not really good enough to
>>> prevent the regressions we sometimes get in Xen PV
>>> domains such as the current regression we have with
>>> Xen Dom0 and certain particular Intel graphics devices.
>>>
>>> 2) If it is not possible to do what is proposed in 1), at least
>>> re-factor the code to make it very clear that whenever
>>> either pat_init() needs to be adjusted or init_cache_modes()
>>> needs to be adjusted, care must be taken to ensure that
>>> all cases are properly dealt with, including the
>>> one-off Xen case of enabling PAT with MTRR disabled,
>>> Currently, AIUI, all one really needs to know is that
>>> init_cached_modes() handles two special cases:
>>> First, when PAT is disabled for any reason, including when
>>> the "nopat" boot option is set, and second, the one-off
>>> Xen case of MTRR being disabled but PAT being enabled.
>>>
>>> I am trying to do number 2 with a patch series I am
>>> working on. It is up to the experts to decide if it
>>> is possible or even desirable to improve the code so
>>> that any changes to the caching configuration are more
>>> likely to be properly implemented for all supported platforms
>>> by successfully combining the two functions pat_init() and
>>> init_cache_modes() into a single function. The new function
>>> would probably need to be renamed and include bits from
>>> mtrr_bp_enabled, etc. for it to function properly.
>>>
>>> If anyone wants to argue that it is not desirable to try
>>> combine pat_init() and init_cache_modes() into a single
>>> function, I think the burden of proof rests on that
>>> person to demonstrate why it is good and clean
>>> coding practice to continue to have them separate
>>> and independent from each other in the code when
>>> they both essentially do the same thing in the end, which
>>> is call __init_cache_modes() and determine the PAT
>>> state, and also probably the MTRR state.
>>
>> I think the proper solution would be to make PAT and MTRR independent
>> from each other with some additional restructuring on the PAT side:
>>
>> Today PAT can't be used without MTRR being available, unless MTRR is at
>> least configured via CONFIG_MTRR and the system is running as Xen PV
>> guest. In this case PAT is automatically available via the hypervisor,
>> but the PAT MSR can't be modified by the kernel and MTRR is disabled.
>>
>> As an additional complexity the availability of PAT can't be queried
>> via pat_enabled() in the Xen PV case, as the lack of MTRR will set PAT
>> to be disabled.
> 
> My testing indicates your presumption is not correct, or at least my
> testing on my Xen workstation makes me doubt that your assertion that
> PAT can't be queried vi pat_enabled() in the Xen PV case is true.

I was referring to the current (regressed) state.

> 
> Let me begin a discussion by posing a question about the following
> code in init_cache_modes(), a function implemented in
> arch/x86/mm/pat/memtypes.c, and which is called from
> arch/x86/kernel/setup.c but only effective if pat_cm_initialized
> has not already been set by pat_init(), as is obvious from the
> immediate return if pat_cm_initialized has already been set.
> 
> Now in the Xen PV case, pat_cm_initialized is not set because
> pat_init(), where it is normally set, was skipped due to the fact
> that MTRRs are disabled in Xen PV domains. I verified that
> init_cache_modes() is activated in the Xen PV Dom0 case by my
> testing. So, I ask, why cannot the code that is present here in
> init_cache_modes() be used to test for PAT in the Xen PV Dom0
> environment and if the test is positive, why cannot pat_bp_enabled
> be set which will cause pat_enabled() to return true in the Xen
> PV environment? In fact, I used the additional 'else if' block to
> set pat_bp_enabled to true in the Xen PV environment for the
> case when boot_cpu_has(X86_FEATURE_PAT) is true and PAT
> MSR does not return 0, that, is, when pat !=0 after calling
> rdmsrl(MSR_IA32_CR_PAT, pat) for the boot CPU. All the existing
> comments in this function indicate this is a valid way to test
> for the existence of PAT in the Xen PV Dom0 environment.

I agree this is a possibility to fix the regression.

In the long run I still think that my idea is the better one, as it
allows to run without CONFIG_MTRR even on bare metal without losing
the PAT functionality.

> 
> Moreover... (please move to the bottom of the code snippet
> for more information about my tests in the Xen PV environment...)
> 
> void init_cache_modes(void)
> {
>      u64 pat = 0;
> 
>      if (pat_cm_initialized)
>          return;
> 
>      if (boot_cpu_has(X86_FEATURE_PAT)) {
>          /*
>           * CPU supports PAT. Set PAT table to be consistent with
>           * PAT MSR. This case supports "nopat" boot option, and
>           * virtual machine environments which support PAT without
>           * MTRRs. In specific, Xen has unique setup to PAT MSR.
>           *
>           * If PAT MSR returns 0, it is considered invalid and emulates
>           * as No PAT.
>           */
>          rdmsrl(MSR_IA32_CR_PAT, pat);
>      }
> 
>      if (!pat) {
>          /*
>           * No PAT. Emulate the PAT table that corresponds to the two
>           * cache bits, PWT (Write Through) and PCD (Cache Disable).
>           * This setup is also the same as the BIOS default setup.
>           *
>           * PTE encoding:
>           *
>           *       PCD
>           *       |PWT  PAT
>           *       ||    slot
>           *       00    0    WB : _PAGE_CACHE_MODE_WB
>           *       01    1    WT : _PAGE_CACHE_MODE_WT
>           *       10    2    UC-: _PAGE_CACHE_MODE_UC_MINUS
>           *       11    3    UC : _PAGE_CACHE_MODE_UC
>           *
>           * NOTE: When WC or WP is used, it is redirected to UC- per
>           * the default setup in __cachemode2pte_tbl[].
>           */
>          pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
>                PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC);
>      }
> 
>      else if (!pat_bp_enabled) {
>          /*
>           * In some environments, specifically Xen PV, PAT
>           * initialization is skipped because MTRRs are
>           * disabled even though PAT is available. In such
>           * environments, set PAT to initialized and enabled to
>           * correctly indicate to callers of pat_enabled() that
>           * PAT is available and prevent PAT from being disabled.
>           */
>          pat_bp_enabled = true;
>          pr_info("x86/PAT: PAT enabled by init_cache_modes\n");
>      }
> 
>      __init_cache_modes(pat);
> }
> 
> This function, patched with the extra 'else if' block, fixes the
> regression on my Xen worksatation, and the pr_info message
> "x86/PAT: PAT enabled by init_cache_modes" appears in the logs
> when running this patched kernel in my Xen Dom0. This means
> that in the Xen PV environment on my Xen Dom0 workstation,
> rdmsrl(MSR_IA32_CR_PAT, pat) successfully tested for the presence
> of PAT on the virtual CPU that Xen exposed to the Linux kernel on my
> Xen Dom0 workstation. At least that is what I think my tests prove.

Yes, your fix is correct AFAICS.

> So why is this not a valid way to test for the existence of
> PAT in the Xen PV environment? Are the existing comments
> in init_cache_modes() about supporting both the case when
> the "nopat" boot option is set and the specific case of Xen and
> MTRR disabled wrong? My testing confirms those comments are
> correct.
> 
>> This leads to some drivers believing that not all cache
>> modes are available, resulting in failures or degraded functionality
>> (the current regression).
> 
> I also did some tests to try to verify that fast WC caching mode
> was used as requested by the i915 Intel graphics driver on my
> Xen Dom0 workstation and so far I cannot prove that the WC caching
> mode was not used, and I have some custom logs from the
> Linux kernel i915 driver that indicates my Linux Xen Dom0 is able to
> use the fast WC caching mode, so long as the bug that causes
> pat_enabled() to return a false negative is fixed.
> 
> So my testing seems to contradict rather than confirm your claim
> that "the availability of PAT can't be queried via pat_enabled() in
> the Xen PV case." I think my patch successfully does what you
> claim cannot be done, and I also think the existing comments in
> init_cache_modes() indicates that my patch is successfully testing
> for PAT in the Xen PV case.
> 
>>
>> The same applies to a kernel built with no MTRR support: it won't
>> allow to use the PAT MSR, even if there is no technical reason for
>> that, other than setting up PAT on all cpus the same way (which is a
>> requirement of the processor's cache management) is relying on some
>> MTRR specific code.
> 
> After reading the code extensively, I am curious about testing
> and even successfully compiling a kernel with options like
> CONFIG_MTRR and CONFIG_PAT disabled. But my testing indicates
> such problems could be debugged and fixed if they manifest
> using the current code which, AFAICT, appears to be mostly
> designed and tested for the case when both PAT and MTRR
> are enabled on modern x86 hardware. I would expect many
> bugs for configurations that deviate too far from that paradigm
> on Linux. This is the problem with Xen's unique configuration
> that enables PAT but disables MTRR. I think the problem can
> be solved, it is just that there is not enough care taken to
> ensure that these outlier cases will work as well as the common
> configuration that Linux supports well, which is with both
> MTRR and PAT enabled on x86.

On bare metal PAT can't work without CONFIG_MTRR, as setting the PAT
MSR uniformly across cpus can't be done (the code is not available
without CONFIG_MTRR).

> 
>>
>> All of that should be fixable by:
>>
>> - moving the function needed by PAT from MTRR specific code one level
>>     up
>> - adding a PAT indirection layer supporting the 3 cases "no or disabled
>>     PAT", "PAT under kernel control", and "PAT under Xen control"
> 
> I am not sure the right way to divide the cases is "PAT under
> kernel control" and "PAT under Xen control." I also looked at
> the code in the Xen hypervisor and it does allow for Dom0
> to use a Xen hypercall provided by the xenctrl library to enable
> PAT WC caching mode, and I ran some tests for evidence that

I guess you mean XEN_DMOP_pin_memory_cacheattr. This is meant to
be used for device emulation.

> my Xen Dom0 was making such hypercalls to support the WC
> caching mode, but I have not yet been able to verify that the
> WC caching mode is enabled by a hypercall from Dom0 to Xen,
> although by reading the code in the hypervisor and xenctrl it
> seems to be possible. I just don't think the i915 Linux kernel
> driver uses such hypercalls, and I am not sure that my Xen
> Dom0 could be using the fast WC caching mode because
> "PAT is under Xen control" instead of "under kernel control"
> unless the kernel is relying on Xen to provide PAT via Xen
> hypercalls. The successful call of my Xen Dom0 to rdmsrl

The PAT MSR is setup by the hypervisor to support all possible
caching modes. Dom0 just needs to set the correct bits in a
page table entry to select the wanted caching mode.

The reason why there is no hypercall to modify the PAT MSR is,
that it is required that the MSR has the same value across all
cpus on the system. This precludes the possibility that any
guest could change it, as this would change it for all guests.
This all is true only for PV guests, of course, as HVM or PVH
guests get a virtualized PAT MSR via the hardware virtualization
feature.

> which obtained a valid configuration of PAT for the virtual
> boot CPU exposed by Xen to the kernel indicates at least that
> the kernel can detect, if not control, what the supported PAT
> caching modes are. Of course you are absolutely correct that
> the kernel cannot directly control the PAT caching modes in the
> Xen PV environment, at least not without possibly resorting to
> some Xen hypercalls that are most likely not currently in use
> by the kernel.

There is such hypercall. The hypercall you are thinking of is
only changing the caching mode for a specific set of pages by
modifying the related EPT/NPT entries.

> I would have to do much more testing to really determine
> what is happening and how my Xen Dom0 is really
> configuring the PAT caching modes and specifically, if
> it is actually using the fast WC caching mode as requested
> by the i915 Intel graphics driver.

A PV dom0 can only adapt itself to the PAT setting the Xen
hypervisor has done.

> 
>> - removing the dependency of PAT on MTRR
>>
>> I'd be up for trying this, but right now I haven't got the time to do
>> it. I might be able to start working on that in September.
> 
> Please let me know if/when you start on it, I would be interested
> to try out what you come up with on my Xen Dom0.

Will do that.

> 
>>
>> Meanwhile I think either Jan's patch or the simple one of Chuck
>> should be applied.
> 
> I agree.

Happy to hear that. :-)


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (492 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ