lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <929214ba-f432-542b-4867-7b8e58cf4290@amd.com>
Date: Wed, 2 Oct 2024 19:32:41 +0530
From: Basavaraj Natikar <bnatikar@....com>
To: Richard Shaw <hobbes1069@...il.com>,
 Linux regressions mailing list <regressions@...ts.linux.dev>
Cc: linux-kernel-bugs@...ontech.com,
 Basavaraj Natikar <Basavaraj.Natikar@....com>, Jiri Kosina
 <jkosina@...e.com>, linux-input@...r.kernel.org,
 Benjamin Tissoires <benjamin.tissoires@...hat.com>,
 akshata.mukundshetty@....com, LKML <linux-kernel@...r.kernel.org>,
 Skyler <skpu@...me>, linux-btrfs <linux-btrfs@...r.kernel.org>,
 "Limonciello, Mario" <Mario.Limonciello@....com>
Subject: Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults /
 btrfs on-disk corruption [Was: .../ btrfs going read-only]


On 10/2/2024 6:19 PM, Richard Shaw wrote:
> On Wed, Oct 2, 2024 at 7:30 AM Linux regression tracking (Thorsten 
> Leemhuis) <regressions@...mhuis.info> wrote:
>
>     >> Basavaraj Natikar, I noticed a report about a regression in
>     >> bugzilla.kernel.org <http://bugzilla.kernel.org> that appears
>     to be caused by a change of yours:
>     >>
>     >> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is
>     available")
>     >> [v6.9-rc1]
>     >>
>     >> As many (most?) kernel developers don't keep an eye on the bug
>     tracker,
>     >> I decided to write this mail. To quote from
>     >> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>     >>
>     >>> I am getting bad page map errors on kernel version 6.9 or newer.
>     >>> They always appear within a few minutes of the system being on, if
>     >>> not immediately upon booting. My system is a Dell Inspiron 7405.
>     > [...]
>     >>> [   23.234632] systemd-journald[611]: File
>     /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal
>     corrupted or uncleanly shut down, renaming and replacing.
>     >>> [   23.580724] rfkill: input handler enabled
>     >>> [   25.652067] rfkill: input handler disabled
>     >
>     >>> [   34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover,
>     sensors not enabled is 0
>     >>> [   34.222379] pcie_mp2_amd 0000:03:00.7:
>     amd_sfh_hid_client_init failed err -95
>     >
>     > No sensors detected - do we all have that in common?
>
As in all system there is a issue there is no sensor supported.

>
> My last log was with 6.11.0-debug[1] and found this:
>
> [   40.178603] kernel: pcie_mp2_amd 0000:04:00.7: Failed to discover, 
> sensors not enabled is 0
> [   40.178904] kernel: pcie_mp2_amd 0000:04:00.7: 
> amd_sfh_hid_client_init failed err -95
> [   43.913688] kernel: Oops: general protection fault, probably for 
> non-canonical address 0x3ffe71b40000848: 0000 [#1] PREEMPT SMP KASAN NOPTI

Since I am unable to reproduce this issue, I added a debug patch to the bug ID.
Could you please try it?

Thanks,
--
Basavaraj

>
> Interestingly the first OOPS was right after the amd_sfh tried to load 
> (if I'm interpreting the above correctly).
>
>     >> See the ticket for more details and the bisection result.
>     Skyler, the
>     >> reporter (CCed), later also added:
>     >>
>     >>> Occasionally I will not get the usual bad page map error, but
>     >>> instead some BTRFS  errors followed by the file system going
>     read-only.
>     >>
>     >> Note, we had and earlier regression caused by this change
>     reported by
>     >> Chris Hixon that maybe was not solved completely:
>     >>
>     https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
>     >
>     > This looks like the same issue I reported.
>
>     And sounds a lot like what Richard sees, who also sees disk corruption
>     with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>
> <snip>
>
>     > I still encounter errors with every kernel/patch I've tested.
>     I've blacklisted
>     > the amd_sfh module as a workaround, but when the module is
>     inserted, a crash
>     > similar to those reported will happen soon after the (45 second?)
>     > detection/initialization timeout. It seems to affect whatever
>     part of the
>     > kernel next becomes active. I've had disk corruption as well,
>     when BTRFS is
>     > affected by the memory corruption,
>
>     Skyler, did you see btrfs disk corruption as well, just like Chris and
>     Richard did?
>
>
> Yes, most of the time the btrfs write checker catches the problem but 
> not always. I've had to reinstall F40 3 times while debugging this 
> issue for uncorrectable errors. When I run the debug kernel I think it 
> brings the system to a halt so fast it doesn't have time to write the 
> corruption to disk.
>
>     From what I see it seems all three of you are using Fedora. Wonder if
>     that is a coincidence.
>
>
> Possibly. Can't say there isn't some patch we're using that's helping 
> cause or expose the issue but Fedora tends to run the newest packages 
> (including the Linux kernel) so can sometimes be the early warning 
> system for other distros.
>
> Thanks,
> RIchard
>
> [1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=2049688


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ