lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85A10B5D-67E5-44F5-886F-CA9D5E7EBFAF@oracle.com>
Date: Thu, 3 Jul 2025 20:34:18 +0000
From: Himanshu Madhani <himanshu.madhani@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)



> On Jul 3, 2025, at 13:21, Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> On Thu, Jul 03 2025 at 18:32, Himanshu Madhani wrote:
>> On Jul 3, 2025, at 11:27, Himanshu Madhani <himanshu.madhani@...cle.com> wrote:
>> Git-bisect point to this merge commit
>> 
>> commit 6376c0770656f3bdf7f411faf068371b6932aeca
>> Merge: 5e8bbb2caa4e 29857e6f4e30
>> Author: Linus Torvalds <torvalds@...ux-foundation.org>
>> Date:   Tue May 27 09:01:26 2025 -0700
>> 
>>   Merge tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> 
>>   Pull clocksource updates from Thomas Gleixner:
>>    "Updates for clocksource/clockevent drivers:
>> 
>>      - The final conversion of text formatted device tree binding to
>>        schemas
>> 
>>      - A new driver fot the System Timer Module on S32G NXP SoCs
>> 
>>      - A new driver fot the Econet HPT timer
>> 
>>      - The usual improvements and device tree binding updates"
> 
> That obviously does not make sense, so your bisect got side ways.
> 
>> Following further in this commit, I only see this following series
>> that had changes which may or may not be related to hang.
>> 
>> https://lore.kernel.org/all/20250429065337.117370076@linutronix.de/
> 
> They are not. There is a hint in both backtraces:
> 
>> [  514.305717]  schedule_preempt_disabled+0x15/0x30
>> [  514.360954]  __mutex_lock.constprop.0+0x4be/0x8a0
>> [  514.417232]  msi_domain_get_virq+0xcc/0x110
>> [  514.467279]  pci_msix_write_tph_tag+0x3c/0x100
> 
> and
> 
>> [  525.930478]  schedule_preempt_disabled+0x15/0x30
>> [  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
>> [  526.041993]  msi_domain_get_virq+0xcc/0x110
>> [  526.092031]  pci_msix_write_tph_tag+0x3c/0x100
> 
> pci_msix_write_tph_tag() is the function which ends up trying to lock
> the mutex and gets stuck. This function was introduced with commit
> 
>  d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")
> 
> and the subsequent commit
> 
>  71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")
> 
> flipped the TPH code over to use that.
> 
> The problem is obvious and if you would have enabled
> CONFIG_PROVE_LOCKING then you would have got the reason presented on a
> silver tablet in dmesg. I encourage you to do so nevertheless.
> 
Great tip on this. I’ll keep that in mind for future debugging efforts. 

> I definitely screwed that one up in the most stupid way.
> 
> As I had no idea how to exercise that code path I did not test it. It
> seems this code is not really tested by any of the CI stuff either
> before it hits Linus tree and as some folks start testing only post rc1
> it takes some time to surface :( 
> 
> The fix is as obvious as the problem. See uncompiled and untested patch
> below. If it solves the problem, which it should, feel free to take it
> and create a proper patch with changelog and Fixes tag yourself (Adding
> Suggested-by: Thomas ... is good enough). Otherwise let me know, and I
> take care of it in my copious spare time :)
> 

Sure. I’ll get this testing in our test bed and report back in couple days.

> Thanks,
> 
>        tglx
> ---
> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index 6ede55a7c5e6..eb26f3816922 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -934,10 +934,11 @@ int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int index, u16 tag)
> if (!pdev->msix_enabled)
> return -ENXIO;
> 
> - guard(msi_descs_lock)(&pdev->dev);
> virq = msi_get_virq(&pdev->dev, index);
> if (!virq)
> return -ENXIO;
> +
> + guard(msi_descs_lock)(&pdev->dev);
> /*
> * This is a horrible hack, but short of implementing a PCI
> * specific interrupt chip callback and a huge pile of



-- 
Himanshu Madhani	Oracle Linux Engineering

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ