lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <318cc8da-f8d2-4307-866e-8c302dacf094@amd.com>
Date:   Tue, 7 Nov 2023 15:27:08 -0600
From:   Mario Limonciello <mario.limonciello@....com>
To:     Takashi Sakamoto <o-takashi@...amocchi.jp>,
        Linux kernel regressions list <regressions@...ts.linux.dev>,
        a.mark.broadworth@...il.com, matthias.schrumpf@...enet.de,
        LKML <linux-kernel@...r.kernel.org>, aros@....com,
        bagasdotme@...il.com,
        "open list:PCI SUBSYSTEM" <linux-pci@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

+linux-pci / Bjorn
On 11/7/2023 06:17, Takashi Sakamoto wrote:
> Hi Mario,
> 
> Thanks for the report.
> 
> I apologize for the inconvenience you and your reporter facing, however
> I can not avoid to say that the problem appears to be specific to the AMD
> Ryzen machines.

Unfortunately I don't have this 1394 hardware myself.  I was just 
looking at another completely unrelated issue on Bugzilla and noticed 
the report come up in my search and wanted to ensure it's on your radar 
already as the author as it's lingered a while.

> 
> I've already received the similar report[1], and have been
> investigating it in the last few weeks, then got the insight. Please take
> a look at my short report about it in PR to Linus for 6.7-rc1:
> https://lore.kernel.org/lkml/20231105144852.GA165906@workstation.local/
> 
> I can confirm that I have been abe to reproduce the problem on AMD Ryzen
> machine. However, it's important to note that I have not observed the
> problem on the following systems:

Any chance you (or anyone with the issue) has a serial output available?
I think it would be really good to look at the circumstances surrounding 
the reboot.

> 
> * Intel machine (Sandy Bridge and Skylake generations)
> * AMD machines predating Ryzen (Sempron 145)
> * Machines using different 1394 OHCI hardware from other vendors such as
>    TI
> * VIA VT6307 connected directly to PCI slot (i.e. without the issued
>    PCIe/PCI bridge)
> 
> Currently, I have not been able to obtain any useful debug output from
> the Linux system or any hardware error reports when the system reboots.
> It seems that the system reboots spontaneously. My assumption at this
> point is that AMD Ryzen machines detect a specific hardware error
> triggered by Ryzen machine quirk related to the combination of the Asmedia
> ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset.
> 

Recent kernels have enabled PCI AER.  Could that be factoring in perhaps?

> I genuinely appreciate your assistance in debugging this elusive
> hardware issue. If any workaround specific to AMD Ryzen machine quirk is
> required in PCI driver for 1394 OHCI hardware, I'm willing to apply it.
> However, it is preferable to figure out the reboot mechanism at first,
> I think.

Does the BIOS on these machines enable a watchdog timer?  If so, I'd 
suggest disabling that for a starting point.

How about if you compile as a module and then modprobe.blacklist the 
module on kernel command line and load it later.  Can you trigger the 
fault/reboot this way?  If so, it at least rules out some conditions 
that happen during a race at boot.

Looking more closely at the change, I would guess the fault is 
specifically in get_cycle_time().  I can see that the VIA devices do set
QUIRK_CYCLE_TIMER which will cause additional reads.

Another guesses worth looking at is to see if iommu=pt or amd_iommu=off 
help.

If either of those help it could point at being a problem with 
get_cycle_time() and IOMMU.  The older systems you mentioned working 
probably didn't enable IOMMU by default but most AMD Ryzen systems do.

> 
> On Mon, Nov 06, 2023 at 02:14:39PM -0600, Mario Limonciello wrote:
>> Hi,
>>
>> I recently came across a kernel bugzilla that bisected a boot problem [1]
>> introduced in kernel 6.5 to this change.
>>
>> commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
>> Author: Takashi Sakamoto <o-takashi@...amocchi.jp>
>> Date:   Tue May 30 08:12:40 2023 +0900
>>
>>      firewire: core: use union for callback of transaction completion
>>
>> Removing the firewire card from the system fixes it for both reporters
>> (CC'ed)
>>
>> As the author of this issue can you please take a look at it?
>>
>> Thanks,
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993
> 
> 
> [1] https://bugzilla.suse.com/show_bug.cgi?id=1215436
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217994
> 
> Thanks
> 
> Takashi Sakamoto

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ