[<prev] [next>] [day] [month] [year] [list]
Message-ID: <929214ba-f432-542b-4867-7b8e58cf4290@amd.com>
Date: Wed, 2 Oct 2024 19:32:41 +0530
From: Basavaraj Natikar <bnatikar@....com>
To: Richard Shaw <hobbes1069@...il.com>,
Linux regressions mailing list <regressions@...ts.linux.dev>
Cc: linux-kernel-bugs@...ontech.com,
Basavaraj Natikar <Basavaraj.Natikar@....com>, Jiri Kosina
<jkosina@...e.com>, linux-input@...r.kernel.org,
Benjamin Tissoires <benjamin.tissoires@...hat.com>,
akshata.mukundshetty@....com, LKML <linux-kernel@...r.kernel.org>,
Skyler <skpu@...me>, linux-btrfs <linux-btrfs@...r.kernel.org>,
"Limonciello, Mario" <Mario.Limonciello@....com>
Subject: Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults /
btrfs on-disk corruption [Was: .../ btrfs going read-only]
On 10/2/2024 6:19 PM, Richard Shaw wrote:
> On Wed, Oct 2, 2024 at 7:30 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@...mhuis.info> wrote:
>
> >> Basavaraj Natikar, I noticed a report about a regression in
> >> bugzilla.kernel.org <http://bugzilla.kernel.org> that appears
> to be caused by a change of yours:
> >>
> >> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is
> available")
> >> [v6.9-rc1]
> >>
> >> As many (most?) kernel developers don't keep an eye on the bug
> tracker,
> >> I decided to write this mail. To quote from
> >> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
> >>
> >>> I am getting bad page map errors on kernel version 6.9 or newer.
> >>> They always appear within a few minutes of the system being on, if
> >>> not immediately upon booting. My system is a Dell Inspiron 7405.
> > [...]
> >>> [ 23.234632] systemd-journald[611]: File
> /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal
> corrupted or uncleanly shut down, renaming and replacing.
> >>> [ 23.580724] rfkill: input handler enabled
> >>> [ 25.652067] rfkill: input handler disabled
> >
> >>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover,
> sensors not enabled is 0
> >>> [ 34.222379] pcie_mp2_amd 0000:03:00.7:
> amd_sfh_hid_client_init failed err -95
> >
> > No sensors detected - do we all have that in common?
>
As in all system there is a issue there is no sensor supported.
>
> My last log was with 6.11.0-debug[1] and found this:
>
> [ 40.178603] kernel: pcie_mp2_amd 0000:04:00.7: Failed to discover,
> sensors not enabled is 0
> [ 40.178904] kernel: pcie_mp2_amd 0000:04:00.7:
> amd_sfh_hid_client_init failed err -95
> [ 43.913688] kernel: Oops: general protection fault, probably for
> non-canonical address 0x3ffe71b40000848: 0000 [#1] PREEMPT SMP KASAN NOPTI
Since I am unable to reproduce this issue, I added a debug patch to the bug ID.
Could you please try it?
Thanks,
--
Basavaraj
>
> Interestingly the first OOPS was right after the amd_sfh tried to load
> (if I'm interpreting the above correctly).
>
> >> See the ticket for more details and the bisection result.
> Skyler, the
> >> reporter (CCed), later also added:
> >>
> >>> Occasionally I will not get the usual bad page map error, but
> >>> instead some BTRFS errors followed by the file system going
> read-only.
> >>
> >> Note, we had and earlier regression caused by this change
> reported by
> >> Chris Hixon that maybe was not solved completely:
> >>
> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
> >
> > This looks like the same issue I reported.
>
> And sounds a lot like what Richard sees, who also sees disk corruption
> with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>
> <snip>
>
> > I still encounter errors with every kernel/patch I've tested.
> I've blacklisted
> > the amd_sfh module as a workaround, but when the module is
> inserted, a crash
> > similar to those reported will happen soon after the (45 second?)
> > detection/initialization timeout. It seems to affect whatever
> part of the
> > kernel next becomes active. I've had disk corruption as well,
> when BTRFS is
> > affected by the memory corruption,
>
> Skyler, did you see btrfs disk corruption as well, just like Chris and
> Richard did?
>
>
> Yes, most of the time the btrfs write checker catches the problem but
> not always. I've had to reinstall F40 3 times while debugging this
> issue for uncorrectable errors. When I run the debug kernel I think it
> brings the system to a halt so fast it doesn't have time to write the
> corruption to disk.
>
> From what I see it seems all three of you are using Fedora. Wonder if
> that is a coincidence.
>
>
> Possibly. Can't say there isn't some patch we're using that's helping
> cause or expose the issue but Fedora tends to run the newest packages
> (including the Linux kernel) so can sometimes be the early warning
> system for other distros.
>
> Thanks,
> RIchard
>
> [1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=2049688
Powered by blists - more mailing lists