linux-kernel - Re: [bisected] ext4 corruption on parisc since 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7e3682f8-ec11-40b0-898f-f3729d6f110b@bell.net>
Date: Mon, 2 Dec 2024 14:45:48 -0500
From: John David Anglin <dave.anglin@...l.net>
To: matoro <matoro_mailinglist_kernel@...oro.tk>
Cc: Magnus Lindholm <linmag7@...il.com>,
 Linux Parisc <linux-parisc@...r.kernel.org>, deller@...nel.org,
 Deller <deller@....de>, Sam James <sam@...too.org>,
 Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [bisected] ext4 corruption on parisc since 6.12

On 2024-12-02 10:31 a.m., matoro wrote:
> On 2024-12-02 09:54, John David Anglin wrote:
>> On 2024-12-02 1:30 a.m., Magnus Lindholm wrote:
>>> On Mon, Dec 2, 2024 at 5:55 AM matoro
>>> <matoro_mailinglist_kernel@...oro.tk> wrote:
>>>> Hmm, this is my config, also on an rp3440:
>>>>
>>>> #
>>>> # Timers subsystem
>>>> #
>>>> CONFIG_HZ_PERIODIC=y
>>>> # CONFIG_NO_HZ_IDLE is not set
>>>> # CONFIG_NO_HZ is not set
>>>> # CONFIG_HIGH_RES_TIMERS is not set
>>>> # end of Timers subsystem
>>>>
>>>> lindholm can confirm on their hardware/config.  Maybe you can try that and
>>>> see if you can reproduce?  I will try your config as well.
>>> Hi, I'm on a HPC8000 "parisc64 PA8800 (Mako) 9000/785/C8000". I can confirm
>>> that building a kernel CONFIG_SMP=n will mitigate this problem.
>>> I haven't messed around with the config in the Timer subsystem so in my case the
>>> parameters suggested are unset. (my config looks like matoros)
>> The clockevent driver was tested on both rp3440 and c8000, and some other SMP machines.
>> Helge knows details.  I have used it on rp3440 and c8000.
>>
>> I would try my settings.  The primary reason in switching to the clockevent drivers was to
>> improve clock resolution.  The best resolution with the old drivers was 1 ms at 1000 HZ.
>> This caused problems with various package tests.  If config is the issue, probably
>> CONFIG_HIGH_RES_TIMERS needs to be forced when clockevent drivers are used.
>>
>> Almost every other system uses the clockevent drivers.  So, there was a risk that parisc would
>> become unsupported.
>>
>> I wonder if this could be caused by dead RTC battery.  Did you check output of date command?
>> Maybe a dead RTC battery interacts badly with clockevent drivers.
>>
>> I run ntp on all my machines.
>>
>> What files have bad dates (i.e., is this really a ext4 file system issue) or is it just that system has
>> a bad clock?
>>
>> Dave
>
> The files that have bad dates seem to be the ones /init on this system touches at early boot.  See the output here: 
> https://paste.matoro.tk/8cq8omg
>
> When booted into the bad kernel, date(1) works and displays the correct time.  I'm using chrony for time syncing as well.
>
> After switching to the config specified above, boot hangs before even getting to userspace with the following output:
>
> [   12.473410] 0000:e0:01.1: ttyS2 at MMIO 0xfffffffff4050038 (irq = 73, base_baud = 115200) is a 16550A
> [   12.757386] sym0: <1010-66> rev 0x1 at pci 0000:20:01.0 irq 70
> [   12.761419] sym0: PA-RISC Firmware, ID 7, Fast-80, LVD, parity checking
> [   12.885367] sym0: SCSI BUS has been reset.
> [   12.889389] scsi host0: sym-2.2.3
> [   13.053380] sym1: <1010-66> rev 0x1 at pci 0000:20:01.1 irq 71
> [   13.055515] sym1: PA-RISC Firmware, ID 7, Fast-80, LVD, parity checking
> [   13.165367] sym1: SCSI BUS has been reset.
> [   13.169388] scsi host1: sym-2.2.3
> [   13.208927] rtc-generic rtc-generic: registered as rtc0
> [   13.281367] rtc-generic rtc-generic: setting system clock to 2024-12-02T07:17:02 UTC (1733123822)
> [   13.281367] NET: Registered PF_INET6 protocol family
> [   13.281367] Segment Routing with IPv6
> [   13.281367] In-situ OAM (IOAM) with IPv6
> [   13.281367] registered taskstats version 1
> [   13.281367] Unstable clock detected, switching default tracing clock to "global"
> [   13.281367] If you want to keep using the local clock, then add:
> [   13.281367]   "trace_clock=local"
> [   13.281367] on the kernel command line
>
> At the end there the clock seems to stop progressing forward, as there are several real-time seconds that elapse in between messages with the 
> same timestamp.  So I'm completely unable to boot with this config at all.
I don't see "Unstable clock detected" message.

I also have in config:
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GENERIC_SCHED_CLOCK=y

Clock seems to get stuck here;
[   13.281367] rtc-generic rtc-generic: setting system clock to 2024-12-02T07:17:02 UTC (1733123822)

On mx3210, clock continues to increment:
[    1.995462] rtc-generic rtc-generic: registered as rtc0
[    2.003158] rtc-generic rtc-generic: setting system clock to 2024-12-01T15:23:25 UTC (1733066605)
[    2.003719] IR JVC protocol handler initialized
[    2.004109] IR MCE Keyboard/mouse protocol handler initialized

Dave

-- 
John David Anglin  dave.anglin@...l.net