lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <167f6083-4a79-4527-a0c3-3df74ae5d15d@amd.com>
Date: Fri, 26 Jan 2024 13:47:03 -0600
From: Mario Limonciello <mario.limonciello@....com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
 "Rafael J . Wysocki" <rjw@...ysocki.net>, linux-pci@...r.kernel.org,
 linux-acpi@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86/pci: Stop requiring ECAM to be declared in E820,
 ACPI or EFI

On 1/26/2024 13:29, Bjorn Helgaas wrote:
> On Fri, Jan 26, 2024 at 12:32:34PM -0600, Mario Limonciello wrote:
>> On 1/25/2024 18:35, Bjorn Helgaas wrote:
>>> On Wed, Jan 17, 2024 at 11:53:50AM -0600, Mario Limonciello wrote:
>>>> On 12/15/2023 16:03, Mario Limonciello wrote:
>>>>> commit 7752d5cfe3d1 ("x86: validate against acpi motherboard resources")
>>>>> introduced checks for ensuring that MCFG table also has memory region
>>>>> reservations to ensure no conflicts were introduced from a buggy BIOS.
>> ...
> 
>>>> Any thoughts on this version since our last conversation on V1?
>>>
>>> Just to let you know that I'm not ignoring this, and I'm trying to
>>> formulate a useful response.
>>
>> Thanks, I had been wondering.
>>
>> FYI - we've also added another place to make noise about this ECAM
>> issue in AMDGPU.  This warning should go into 6.9:
>>
>> https://lore.kernel.org/amd-gfx/20240110101319.695169-1-Jun.Ma2@amd.com/
> 
> Looks similar to the PCI core warning here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?id=v6.7#n1134
> 
> The comment says it doesn't work for devices on the root bus, though.
> Maybe it could be made to work there as well?

IMO it's not loud enough either.

I think it's better to keep the both, here's my logic:

If someone has this problem that prompted this series the first thing 
they notice is problems "with the GPU".  They'll probably start looking 
at the kernel log for ERR and WARN related to the GPU.

> 
>>> mmconfig-shared.c has grown into an
>>> extremely complicated mess and is a continual source of problems, so
>>> I'd really like to untangle it a tiny bit if we can.
>>>
>>> One thing is that per spec, ACPI motherboard resources are the ONLY
>>> way to reserve ECAM space.  I'd like to reclaim a little clarity about
>>> that and separate out the E820 and EFI stuff as secondary hacks.  But
>>> there's an insane amount of history that got us here.
>>
>> I guess you know better than anyone here.  But if my idea is
>> actually viable then the E820 and EFI stuff turn into "information
>> only".
> 
> That would definitely be a good thing.  I would like it if that were
> more obvious from reading the code because I spend waaay too much time
> staring at that labyrinth.
> 
>>> The problem we have to avoid is assigning a BAR that overlaps ECAM.
>>> We assign BARs for lots of reasons.  dGPU and resizable BARs are
>>> examples but there are others, like hotplug and SR-IOV.  The fact that
>>> Windows works is a red herring because it allocates BARs differently.
>>
>> Have we actually observed a case that assigning the BAR overlaps
>> ECAM region thus far or it's hypothetical?
> 
> Yes, it has happened.  There's an example in the commit log here:
> https://git.kernel.org/linus/070909e56a7d ("x86/pci: Reserve ECAM if
> BIOS didn't include it in PNP0C02 _CRS")

But so in this case; if there was a full ECAM reservation made from 
MMCFG instead then Linux wouldn't have tried to put it on top of that space.

> 
>>> And of course, if there's any way to solve this safely without
>>> adding yet another kernel parameter, that would be vastly
>>> preferable.
>>
>> The kernel isn't static though; something we could do is keep the
>> parameter around a year or two to get the bug feedback loop of
>> people testing it and then rip it out if nothing comes up.
> 
> Yeah.  It's pretty hard to remove those options though.  For example,
> "pci=routeirq" was added in the pre-git era and probably isn't
> necessary, but how do we know nobody uses it?

Detect it's in use and drop a notice() or higher into the logs like this:

"pci=irq has been deprecated and is planned to be removed from the 
kernel on YY/ZZZZ.  If you need this for your system to work, please
raise an email to linux-pci@...r.kernel.org"

If you give it ~2 years, that gives enough time to get through about
2 LTS kernels.  People who need it by then but chose not to report it
still have several LTS kernels to fall back on.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ