[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ca084a9-768e-a6f5-ace4-cd347978dec7@netscape.net>
Date: Fri, 20 May 2022 09:33:55 -0400
From: Chuck Zmudzinski <brchuckz@...scape.net>
To: Jan Beulich <jbeulich@...e.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Jani Nikula <jani.nikula@...ux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
Rodrigo Vivi <rodrigo.vivi@...el.com>,
Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>,
David Airlie <airlied@...ux.ie>,
Daniel Vetter <daniel@...ll.ch>,
xen-devel@...ts.xenproject.org, x86@...nel.org,
linux-kernel@...r.kernel.org, intel-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org, Juergen Gross <jgross@...e.com>
Subject: Re: [PATCH 2/2] x86/pat: add functions to query specific cache mode
availability
On 5/20/2022 5:41 AM, Jan Beulich wrote:
> On 20.05.2022 10:30, Chuck Zmudzinski wrote:
>> On 5/20/2022 2:59 AM, Chuck Zmudzinski wrote:
>>> On 5/20/2022 2:05 AM, Jan Beulich wrote:
>>>> On 20.05.2022 06:43, Chuck Zmudzinski wrote:
>>>>> On 5/4/22 5:14 AM, Juergen Gross wrote:
>>>>>> On 04.05.22 10:31, Jan Beulich wrote:
>>>>>>> On 03.05.2022 15:22, Juergen Gross wrote:
>>>>>>>
>>>>>>> ... these uses there are several more. You say nothing on why
>>>>>>> those want
>>>>>>> leaving unaltered. When preparing my earlier patch I did inspect them
>>>>>>> and came to the conclusion that these all would also better
>>>>>>> observe the
>>>>>>> adjusted behavior (or else I couldn't have left pat_enabled() as the
>>>>>>> only predicate). In fact, as said in the description of my earlier
>>>>>>> patch, in
>>>>>>> my debugging I did find the use in i915_gem_object_pin_map() to be
>>>>>>> the
>>>>>>> problematic one, which you leave alone.
>>>>>> Oh, I missed that one, sorry.
>>>>> That is why your patch would not fix my Haswell unless
>>>>> it also touches i915_gem_object_pin_map() in
>>>>> drivers/gpu/drm/i915/gem/i915_gem_pages.c
>>>>>
>>>>>> I wanted to be rather defensive in my changes, but I agree at least
>>>>>> the
>>>>>> case in arch_phys_wc_add() might want to be changed, too.
>>>>> I think your approach needs to be more aggressive so it will fix
>>>>> all the known false negatives introduced by bdd8b6c98239
>>>>> such as the one in i915_gem_object_pin_map().
>>>>>
>>>>> I looked at Jan's approach and I think it would fix the issue
>>>>> with my Haswell as long as I don't use the nopat option. I
>>>>> really don't have a strong opinion on that question, but I
>>>>> think the nopat option as a Linux kernel option, as opposed
>>>>> to a hypervisor option, should only affect the kernel, and
>>>>> if the hypervisor provides the pat feature, then the kernel
>>>>> should not override that,
>>>> Hmm, why would the kernel not be allowed to override that? Such
>>>> an override would affect only the single domain where the
>>>> kernel runs; other domains could take their own decisions.
>>>>
>>>> Also, for the sake of completeness: "nopat" used when running on
>>>> bare metal has the same bad effect on system boot, so there
>>>> pretty clearly is an error cleanup issue in the i915 driver. But
>>>> that's orthogonal, and I expect the maintainers may not even care
>>>> (but tell us "don't do that then").
>> Actually I just did a test with the last official Debian kernel
>> build of Linux 5.16, that is, a kernel before bdd8b6c98239 was
>> applied. In fact, the nopat option does *not* break the i915 driver
>> in 5.16. That is, with the nopat option, the i915 driver loads
>> normally on both the bare metal and on the Xen hypervisor.
>> That means your presumption (and the presumption of
>> the author of bdd8b6c98239) that the "nopat" option was
>> being observed by the i915 driver is incorrect. Setting "nopat"
>> had no effect on my system with Linux 5.16. So after doing these
>> tests, I am against the aggressive approach of breaking the i915
>> driver with the "nopat" option because prior to bdd8b6c98239,
>> nopat did not break the i915 driver. Why break it now?
> Because that's, in my understanding, is the purpose of "nopat"
> (not breaking the driver of course - that's a driver bug -, but
> having an effect on the driver).
I wouldn't call it a driver bug, but an incorrect configuration of the
kernel by the user. I presume X86_FEATURE_PAT is required by the
i915 driver and therefore the driver should refuse to disable
it if the user requests to disable it and instead warn the user that
the driver did not disable the feature, contrary to what the user
requested with the nopat option.
In any case, my test did not verify that when nopat is set in Linux 5.16,
the thread takes the same code path as when nopat is not set,
so I am not totally sure that the reason nopat does not break the
i915 driver in 5.16 is that static_cpu_has(X86_FEATURE_PAT)
returns true even when nopat is set. I could test it with a custom
log message in 5.16 if that is necessary.
Are you saying it was wrong for static_cpu_has(X86_FEATURE_PAT)
to return true in 5.16 when the user requests nopat? I think that is
just permitting a bad configuration to break the driver that a
well-written operating system should not allow. The i915 driver
was, in my opinion, correctly ignoring the nopat option in 5.16
because that option is not compatible with the hardware the
i915 driver is trying to initialize and setup at boot time. At least
that is my understanding now, but I will need to test it on 5.16
to be sure I understand it correctly.
Also, AFAICT, your patch would break the driver when the nopat
option is set and only fix the regression introduced by bdd8b6c98239
when nopat is not set on my box, so your patch would
introduce a regression relative to Linux 5.16 and earlier for the
case when nopat is set on my box. I think your point would
be that it is not a regression if it is an incorrect user configuration.
I respond by saying a well-written driver should refuse to honor
the incorrect configuration requested by the user and instead
warn the user that it did not honor the incorrect kernel option.
I am only presuming what your patch would do on my box based
on what I learned about this problem from my debugging. I can
also test your patch on my box to verify that my understanding of
it is correct.
I also have not yet verified Juergen's patch will not fix it, but
I am almost certain it will not unless it is expanded so it also
touches i915_gem_object_pin_map() with the fix. I plan to test
his patch, but expanded so it touches that function also.
I also plan to test your patch with and without nopat and report the
results in the thread where you posted your patch. Hopefully
by tomorrow I will have the results.
Chuck
Powered by blists - more mailing lists