lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2c361e7d-0ee6-cad7-a3d3-a49cd6db3f20@linux.intel.com>
Date: Fri, 30 May 2025 09:55:27 +0300 (EEST)
From: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
To: Tudor Ambarus <tudor.ambarus@...aro.org>
cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org, 
    Michał Winiarski <michal.winiarski@...el.com>, 
    Igor Mammedov <imammedo@...hat.com>, LKML <linux-kernel@...r.kernel.org>, 
    Mika Westerberg <mika.westerberg@...ux.intel.com>, 
    William McVicker <willmcvicker@...gle.com>
Subject: Re: [PATCH 24/25] PCI: Perform reset_resource() and build fail list
 in sync

On Wed, 28 May 2025, Tudor Ambarus wrote:
> On 5/28/25 12:39 PM, Tudor Ambarus wrote:
> > On 5/28/25 12:22 PM, Tudor Ambarus wrote:
> >> On 5/6/25 4:53 PM, Ilpo Järvinen wrote:
> >>> On Tue, 6 May 2025, Tudor Ambarus wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> On 12/16/24 5:56 PM, Ilpo Järvinen wrote:
> >>>>> Resetting resource is problematic as it prevent attempting to allocate
> >>>>> the resource later, unless something in between restores the resource.
> >>>>> Similarly, if fail_head does not contain all resources that were reset,
> >>>>> those resource cannot be restored later.
> >>>>>
> >>>>> The entire reset/restore cycle adds complexity and leaving resources
> >>>>> into reseted state causes issues to other code such as for checks done
> >>>>> in pci_enable_resources(). Take a small step towards not resetting
> >>>>> resources by delaying reset until the end of resource assignment and
> >>>>> build failure list (fail_head) in sync with the reset to avoid leaving
> >>>>> behind resources that cannot be restored (for the case where the caller
> >>>>> provides fail_head in the first place to allow restore somewhere in the
> >>>>> callchain, as is not all callers pass non-NULL fail_head).
> >>>>>
> >>>>> The Expansion ROM check is temporarily left in place while building the
> >>>>> failure list until the upcoming change which reworks optional resource
> >>>>> handling.
> >>>>>
> >>>>> Ideally, whole resource reset could be removed but doing that in a big
> >>>>> step would make the impact non-tractable due to complexity of all
> >>>>> related code.
> >>>>>
> >>>>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
> >>>> I'm hitting the BUG_ON(!list_empty(&add_list)); in
> >>>> pci_assign_unassigned_bus_resources() [1] with 6.15-rc5 and the the
> >>>> pixel6 downstream pcie driver.
> >>>>
> >>>> I saw the thread where "a34d74877c66 PCI: Restore assigned resources
> >>>> fully after release" fixes things for some other cases, but it's not the
> >>>> case here.
> >>>>
> >>>> Reverting the following patches fixes the problem:
> >>>> a34d74877c66 PCI: Restore assigned resources fully after release
> >>>> 2499f5348431 PCI: Rework optional resource handling
> >>>> 96336ec70264 PCI: Perform reset_resource() and build fail list in sync
> >>> So it's confirmed that you needed to revert also this last commit 
> >>> 96336ec70264, not just the rework change?
> >> I needed to revert 96336ec70264 as well otherwise the build fails.
> >>>> In the working case the add_list list is empty throughout the entire
> >>>> body of pci_assign_unassigned_bus_resources().
> >>>>
> >>>> In the failing case __pci_bus_size_bridges() leaves the add_list not
> >>>> empty and __pci_bus_assign_resources() does not consume the list, thus
> >>>> the BUG_ON. The failing case contains an extra print that's not shown
> >>>> when reverting the blamed commits:
> >>>> [   13.951185][ T1101] pcieport 0000:00:00.0: bridge window [mem
> >>>> 0x00100000-0x001fffff] to [bus 01-ff] add_size 100000 add_align 100000
> >>>>
> >>>> I've added some prints trying to describe the code path, see
> >>>> https://paste.ofcode.org/Aeu2YBpLztc49ZDw3uUJmd#
> >>>>
> >>>> Failing case:
> >>>> [   13.944231][ T1101] pci 0000:01:00.0: [144d:a5a5] type 00 class
> >>>> 0x000000 PCIe Endpoint
> >>>> [   13.944412][ T1101] pci 0000:01:00.0: BAR 0 [mem
> >>>> 0x00000000-0x000fffff 64bit]
> >>>> [   13.944532][ T1101] pci 0000:01:00.0: ROM [mem 0x00000000-0x0000ffff
> >>>> pref]
> >>>> [   13.944649][ T1101] pci 0000:01:00.0: enabling Extended Tags
> >>>> [   13.944844][ T1101] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
> >>>> [   13.945015][ T1101] pci 0000:01:00.0: 15.752 Gb/s available PCIe
> >>>> bandwidth, limited by 8.0 GT/s PCIe x2 link at 0000:00:00.0 (capable of
> >>>> 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
> >>>> [   13.950616][ T1101] __pci_bus_size_bridges: before pbus_size_mem.
> >>>> list empty? 1
> >>>> [   13.950784][ T1101] pbus_size_mem: 2. list empty? 1
> >>>> [   13.950886][ T1101] pbus_size_mem: 1 list empty? 0
> >>>> [   13.950982][ T1101] pbus_size_mem: 3. list empty? 0
> >>>> [   13.951082][ T1101] pbus_size_mem: 4. list empty? 0
> >>>> [   13.951185][ T1101] pcieport 0000:00:00.0: bridge window [mem
> >>>> 0x00100000-0x001fffff] to [bus 01-ff] add_size 100000 add_align 100000
> >>>> [   13.951448][ T1101] __pci_bus_size_bridges: after pbus_size_mem. list
> >>>> empty? 0
> >>>> [   13.951643][ T1101] pci_assign_unassigned_bus_resources: before
> >>>> __pci_bus_assign_resources -> list empty? 0
> >>>> [   13.951924][ T1101] pcieport 0000:00:00.0: bridge window [mem
> >>>> 0x40000000-0x401fffff]: assigned
> >>>> [   13.952248][ T1101] pci_assign_unassigned_bus_resources: after
> >>>> __pci_bus_assign_resources -> list empty? 0
> >>>> [   13.952634][ T1101] ------------[ cut here ]------------
> >>>> [   13.952818][ T1101] kernel BUG at drivers/pci/setup-bus.c:2514!
> >>>> [   13.953045][ T1101] Internal error: Oops - BUG: 00000000f2000800 [#1]
> >>>>  SMP
> >>>> ...
> >>>> [   13.976086][ T1101] Call trace:
> >>>> [   13.976206][ T1101]  pci_assign_unassigned_bus_resources+0x110/0x114 (P)
> >>>> [   13.976462][ T1101]  pci_rescan_bus+0x28/0x48
> >>>> [   13.976628][ T1101]  exynos_pcie_rc_poweron
> >>>>
> >>>> Working case:
> >>>> [   13.786961][ T1120] pci 0000:01:00.0: [144d:a5a5] type 00 class
> >>>> 0x000000 PCIe Endpoint
> >>>> [   13.787136][ T1120] pci 0000:01:00.0: BAR 0 [mem
> >>>> 0x00000000-0x000fffff 64bit]
> >>>> [   13.787280][ T1120] pci 0000:01:00.0: ROM [mem 0x00000000-0x0000ffff
> >>>> pref]
> >>>> [   13.787541][ T1120] pci 0000:01:00.0: enabling Extended Tags
> >>>> [   13.787808][ T1120] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
> >>>> [   13.787988][ T1120] pci 0000:01:00.0: 15.752 Gb/s available PCIe
> >>>> bandwidth, limited by 8.0 GT/s PCIe x2 link at 0000:00:00.0 (capable of
> >>>> 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
> >>>> [   13.795279][ T1120] __pci_bus_size_bridges: before pbus_size_mem.
> >>>> list empty? 1
> >>>> [   13.795408][ T1120] pbus_size_mem: 2. list empty? 1
> >>>> [   13.795495][ T1120] pbus_size_mem: 2. list empty? 1
> >>>> [   13.795577][ T1120] __pci_bus_size_bridges: after pbus_size_mem. list
> >>>> empty? 1
> >>>> [   13.795692][ T1120] pci_assign_unassigned_bus_resources: before
> >>>> __pci_bus_assign_resources -> list empty? 1
> >>>> [   13.795849][ T1120] pcieport 0000:00:00.0: bridge window [mem
> >>>> 0x40000000-0x401fffff]: assigned
> >>>> [   13.796072][ T1120] pci_assign_unassigned_bus_resources: after
> >>>> __pci_bus_assign_resources -> list empty? 1
> >>>> [   13.796662][ T1120] cpif: s5100_poweron_pcie: DBG: MSI sfr not set
> >>>> up, yet(s5100_pdev is NULL)
> >>>> [   13.796666][ T1120] cpif: register_pcie: s51xx_pcie_init start
> >>>>
> >>>>
> >>>> Any hints are welcomed. Thanks,
> >>>> ta
> >>> Hi and thanks for the report.
> >> Hi! Thanks for the help. I've been out of office for the last 2 weeks,
> >> sorry for the delayed reply.
> >>
> >>> The interesting part occurs inside reassign_resources_sorted() where most 
> >>> items are eliminated from realloc_head by the list_del().
> >>>
> >>> My guess is that somehow, the change in 96336ec70264 from !res->flags
> >>> to the more complicated check somehow causes this. If the new check 
> >>> doesn't match and subsequently, no match is found from the head list, the 
> >>> loop will do continue and not remove the entry from realloc_head.
> >> I added a print right there and it seems it's something else. See below.
> >>> But it's hard to confirm without knowing what that resources realloc_head 
> >>> contains. Perhaps if you print the resources that are processed around 
> >>> that part of the code in reassign_resources_sorted(), comparing the log 
> >>> from the reverted code with the non-working case might help to understand 
> >>> what is different there and why. To understand better what is in the head 
> >>> list, it would be also useful to know from which device the resources were 
> >>> added into the head list in pdev_sort_resources().
> >>>
> >> I added the suggested prints
> >> (https://paste.ofcode.org/DgmZGGgS6D36nWEzmfCqMm) on top of v6.15 with
> >> the downstream PCIe pixel driver and I obtain the following. Note that
> >> all added prints contain "tudor" for differentiation.
> >>
> >> [   15.211179][ T1107] pci 0001:01:00.0: [144d:a5a5] type 00 class
> >> 0x000000 PCIe Endpoint
> >> [   15.212248][ T1107] pci 0001:01:00.0: BAR 0 [mem
> >> 0x00000000-0x000fffff 64bit]
> >> [   15.212775][ T1107] pci 0001:01:00.0: ROM [mem 0x00000000-0x0000ffff
> >> pref]
> >> [   15.213195][ T1107] pci 0001:01:00.0: enabling Extended Tags
> >> [   15.213720][ T1107] pci 0001:01:00.0: PME# supported from D0 D3hot
> >> D3cold
> >> [   15.214035][ T1107] pci 0001:01:00.0: 15.752 Gb/s available PCIe
> >> bandwidth, limited by 8.0 GT/s PCIe x2 link at 0001:00:00.0 (capable of
> >> 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
> >> [   15.222286][ T1107] pci 0001:01:00.0: tudor: 1: pbus_size_mem: BAR 0
> >> [mem 0x00000000-0x000fffff 64bit] list empty? 1
> >> [   15.222813][ T1107] pci 0001:01:00.0: tudor: 1: pbus_size_mem: ROM
> >> [mem 0x00000000-0x0000ffff pref] list empty? 1
> >> [   15.224429][ T1107] pci 0001:01:00.0: tudor: 2: pbus_size_mem: ROM
> >> [mem 0x00000000-0x0000ffff pref] list empty? 0
> >> [   15.224750][ T1107] pcieport 0001:00:00.0: bridge window [mem
> >> 0x00100000-0x001fffff] to [bus 01-ff] add_size 100000 add_align 100000
> >>
> >> [   15.225393][ T1107] tudor : pci_assign_unassigned_bus_resources:
> >> before __pci_bus_assign_resources -> list empty? 0
> >> [   15.225594][ T1107] pcieport 0001:00:00.0: tudor:
> >> pdev_sort_resources: bridge window [mem 0x00100000-0x001fffff] resource
> >> added in head list
> >> [   15.226078][ T1107] pcieport 0001:00:00.0: bridge window [mem
> >> 0x40000000-0x401fffff]: assigned
> >> [   15.226419][ T1107] tudor : pci_assign_unassigned_bus_resources:
> >> after __pci_bus_assign_resources -> list empty? 0
> >> [   15.226442][ T1107] ------------[ cut here ]------------
> >> [   15.227587][ T1107] kernel BUG at drivers/pci/setup-bus.c:2522!
> >> [   15.227813][ T1107] Internal error: Oops - BUG: 00000000f2000800 [#1]
> >>  SMP
> >> ...
> >> [   15.251570][ T1107] Call trace:
> >> [   15.251690][ T1107]  pci_assign_unassigned_bus_resources+0x110/0x114 (P)
> >> [   15.251945][ T1107]  pci_rescan_bus+0x28/0x48
> >>
> >> I obtain the following output when using the same prints adapted
> >> (https://paste.ofcode.org/37w7RnKkPaCxyNhi5yhZPbZ) and with the blamed
> >> commits reverted:
> >> a34d74877c66 PCI: Restore assigned resources fully after release
> >> 2499f5348431 PCI: Rework optional resource handling
> >> 96336ec70264 PCI: Perform reset_resource() and build fail list in sync
> >>
> >> [   15.200456][ T1102] pci 0000:01:00.0: [144d:a5a5] type 00 class
> >> 0x000000 PCIe Endpoint
> >> [   15.200632][ T1102] pci 0000:01:00.0: BAR 0 [mem
> >> 0x00000000-0x000fffff 64bit]
> >> [   15.200755][ T1102] pci 0000:01:00.0: ROM [mem 0x00000000-0x0000ffff
> >> pref]
> >> [   15.200876][ T1102] pci 0000:01:00.0: enabling Extended Tags
> >> [   15.201075][ T1102] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
> >> [   15.201254][ T1102] pci 0000:01:00.0: 15.752 Gb/s available PCIe
> >> bandwidth, limited by 8.0 GT/s PCIe x2 link at 0000:00:00.0 (capable of
> >> 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
> >> [   15.206555][ T1102] pci 0000:01:00.0: tudor: 1: pbus_size_mem: BAR 0
> >> [mem 0x00000000-0x000fffff 64bit] list empty? 1
> >> [   15.206737][ T1102] pci 0000:01:00.0: tudor: 1: pbus_size_mem: ROM
> >> [mem 0x00000000-0x0000ffff pref] list empty? 1
> >> [   15.206901][ T1102] tudor : pci_assign_unassigned_bus_resources:
> >> before __pci_bus_assign_resources -> list empty? 1
> >> [   15.207072][ T1102] pcieport 0000:00:00.0: tudor:
> >> pdev_sort_resources: bridge window [mem 0x00100000-0x002fffff] resource
> >> added in head list
> >> [   15.207396][ T1102] pcieport 0000:00:00.0: bridge window [mem
> >> 0x40000000-0x401fffff]: assigned
> >> [   15.208165][ T1102] tudor : pci_assign_unassigned_bus_resources:
> >> after __pci_bus_assign_resources -> list empty? 1
> >> [   15.208783][ T1102] cpif: s5100_poweron_pcie: DBG: MSI sfr not set
> >> up, yet(s5100_pdev is NULL)
> >> [   15.208786][ T1102] cpif: register_pcie: s51xx_pcie_init start
> > 
> > I see my email client split the lines for the prints making the output
> > very hard to read. Added the output here too:
> > https://paste.ofcode.org/AEfjASQW8Z2jbMak5VkmpJ
> 
> With the following change things get back to how they were before
> 2499f5348431:
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 5247370010aa..1589dd8afa69 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1214,9 +1214,10 @@ static int pbus_size_mem(struct pci_bus *bus,
> unsigned long mask,
>  				__func__, r_name, r, list_empty(realloc_head));
> 
>  			/* Put SRIOV requested res to the optional list */
> -			if (realloc_head && pci_resource_is_optional(dev, i)) {
> +			if (realloc_head && pci_resource_is_iov(i)) {
>  				add_align = max(pci_resource_alignment(dev, r), add_align);
> -				add_to_list(realloc_head, dev, r, 0, 0 /* Don't care */);
> +				resource_set_size(r, 0);
> +				add_to_list(realloc_head, dev, r, r_size, 0 /* Don't care */);
>  				children_add_size += r_size;
>  				pci_info(dev, "tudor: 2: %s: %s %pR list empty? %d\n",
>  					__func__, r_name, r, list_empty(realloc_head));

Okay, thanks for the additional data point, but I don't think the problem 
is at this site. That removal of that resource_set_size(r, 0) is crucial 
piece in my changes (and as is, it even doesn't extend far enough and we 
can end up lose resources in some situations when their sizes are reset 
like that).

This entire resource fitting and assignment code as whole is very complex 
spanning many functions. The steps are roughly these:

1) calculate sizing in advance (which the above code relates to)
2) collect all resources we want to assign in a PCI bus subtree
3) opportunistically assign with the optional sizes
4) if everything was assigned, skip steps 5-7.
5) release some/all from step 2 assignments
6) assign only required resources
7) reassign optional parts (enlargen to include the optional sizes) what 
   we can
8) As a final sanity check, realloc_head is checked to have been fully
   consumed by the earlier steps (~ no unchecked work todo remains)

In some cases, kernel iterates the entire process a few times (but 
that mostly occurs when pci=realloc is given on the kernel cmdline).

Now, as discussed in my other replay, it looks at some point the algorithm 
assigned the bridge window resource successfully, but it still remained in 
the realloc_head for some reason if I parsed all the information from your 
debug patch right. The resource should have been removed from realloc_head 
before __assign_resources_sorted() exits if it got assigned, if it 
wasn't, that's the bug we're looking for. (Alternatively, it could have 
been that the assignment was never tried for the particular resource, 
but it now looks that might not be the case here).

-- 
 i.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ