linux-kernel - Re: [bisected] ext4 corruption on parisc since 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <71fae3d3a9bd816ea268eb73c152b564@matoro.tk>
Date: Sun, 01 Dec 2024 23:55:29 -0500
From: matoro <matoro_mailinglist_kernel@...oro.tk>
To: John David Anglin <dave.anglin@...l.net>
Cc: Linux Parisc <linux-parisc@...r.kernel.org>, deller@...nel.org, Deller
 <deller@....de>, linmag7@...il.com, Sam James <sam@...too.org>, Linux Kernel
 Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [bisected] ext4 corruption on parisc since 6.12

Hmm, this is my config, also on an rp3440:

#
# Timers subsystem
#
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
# end of Timers subsystem

lindholm can confirm on their hardware/config.  Maybe you can try that and 
see if you can reproduce?  I will try your config as well.

On 2024-12-01 20:47, John David Anglin wrote:
> I haven't seen any file system corruption on rp3440 with several weeks of 
> running with clock events.  I just
> started running 6.12.1 today though.
> 
> I have the following timer config:
> 
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ is not set
> CONFIG_HIGH_RES_TIMERS=y
> # end of Timers subsystem
> 
> There was some concern about this change on systems where the CPU timers 
> aren't synchronized.  what
> systems do you see this on?
> 
> Dave
> 
> On 2024-12-01 7:26 p.m., matoro wrote:
>> Hi Helge, when booting 6.12 here myself and another user (CC'd) both 
>> observed our ext4 filesystems to be immediately corrupted in the same 
>> manner.
>> 
>> Every file that is read or written will have its access/modify times set to 
>> 2446-05-10 18:38:55.0000, which is the maximum ext4 timestamp.  The 32-bit 
>> userspace doesn't seem to be able to handle this at all, as every further 
>> stat() call will error with "Value too large for defined data type".  
>> Unfortunately, simply rolling back to kernel 6.11 is insufficient to 
>> recover, as the filesystem corruption is persistent, and the errors come 
>> from userspace attempting to read the modified files.  I was able to 
>> recover with a command like:  find / -newermt 2446-01-01 -o -newerct 
>> 2446-01-01 -o -newerat 2446-01-01 | xargs touch -h
>> 
>> Luckily, lindholm was able to bisect and identified as the culprit commit:  
>> b5ff52be891347f8847872c49d7a5c2fa29400a7 ("parisc: Convert to generic 
>> clockevents").  Some other comments from the discussion:
>> 
>> 17:20:37 <awilfox> would be curious if keeping that patch + CONFIG_SMP=n 
>> fixes it
>> 17:20:44 <awilfox> this doesn't look necessarily correct on MP machines
>> 17:23:56 <awilfox> time_keeper_id is now unused; the old code specifically 
>> marked the clocksource as unstable on MP machines despite having per_cpu 
>> before
>> 17:24:11 <awilfox> and now it seems to imply CLOCK_EVT_FEAT_PERCPU is 
>> enough to work around it
>> 17:24:13 <awilfox> maybe it isn't
>> 
>> Thanks!