linux-kernel - Re: [PATCH] mm/page-writeback: Consolidate wb_thresh bumping logic into __wb_calc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2xndprbkr5k5qer4zb6ov35fa5ym7c36q6mcyapdh22ypqxivh@ahuvuqs47yd4>
Date: Thu, 16 Jan 2025 15:56:57 +0100
From: Jan Kara <jack@...e.cz>
To: Guenter Roeck <linux@...ck-us.net>
Cc: Jan Kara <jack@...e.cz>, Jim Zhao <jimzhao.ai@...il.com>, 
	akpm@...ux-foundation.org, willy@...radead.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH] mm/page-writeback: Consolidate wb_thresh bumping logic
 into __wb_calc_thresh

On Wed 15-01-25 08:41:43, Guenter Roeck wrote:
> On 1/15/25 08:07, Jan Kara wrote:
> > On Tue 14-01-25 07:01:08, Guenter Roeck wrote:
> > > On 1/14/25 05:19, Jan Kara wrote:
> > > > On Mon 13-01-25 15:05:25, Guenter Roeck wrote:
> > > > > On Thu, Nov 21, 2024 at 06:05:39PM +0800, Jim Zhao wrote:
> > > > > > Address the feedback from "mm/page-writeback: raise wb_thresh to prevent
> > > > > > write blocking with strictlimit"(39ac99852fca98ca44d52716d792dfaf24981f53).
> > > > > > The wb_thresh bumping logic is scattered across wb_position_ratio,
> > > > > > __wb_calc_thresh, and wb_update_dirty_ratelimit. For consistency,
> > > > > > consolidate all wb_thresh bumping logic into __wb_calc_thresh.
> > > > > > 
> > > > > > Reviewed-by: Jan Kara <jack@...e.cz>
> > > > > > Signed-off-by: Jim Zhao <jimzhao.ai@...il.com>
> > > > > 
> > > > > This patch triggers a boot failure with one of my 'sheb' boot tests.
> > > > > It is seen when trying to boot from flash (mtd). The log says
> > > > > 
> > > > > ...
> > > > > Starting network: 8139cp 0000:00:02.0 eth0: link down
> > > > > udhcpc: started, v1.33.0
> > > > > EXT2-fs (mtdblock3): error: ext2_check_folio: bad entry in directory #363: : directory entry across blocks - offset=0, inode=27393, rec_len=3072, name_len=2
> > > > > udhcpc: sending discover
> > > > > udhcpc: sending discover
> > > > > udhcpc: sending discover
> > > > > EXT2-fs (mtdblock3): error: ext2_check_folio: bad entry in directory #363: : directory entry across blocks - offset=0, inode=27393, rec_len=3072, name_len=2
> > > > 
> > > > Thanks for report! Uh, I have to say I'm very confused by this. It is clear
> > > > than when ext2 detects the directory corruption (we fail checking directory
> > > > inode 363 which is likely /etc/init.d/), the boot fails in interesting
> > > > ways. What is unclear is how the commit can possibly cause ext2 directory
> > > > corruption.  If you didn't verify reverting the commit fixes the issue, I'd
> > > > be suspecting bad bisection but that obviously isn't the case :-)
> > > > 
> > > > Ext2 is storing directory data in the page cache so at least it uses the
> > > > subsystem which the patch impacts but how writeback throttling can cause
> > > > ext2 directory corruption is beyond me. BTW, do you recreate the root
> > > > filesystem before each boot? How exactly?
> > > 
> > > I use pre-built root file systems. For sheb, they are at
> > > https://github.com/groeck/linux-build-test/tree/master/rootfs/sheb
> > 
> > Thanks. So the problematic directory is /usr/share/udhcpc/ where we
> > read apparently bogus metadata at the beginning of that directory.
> > 
> > > I don't think this is related to ext2 itself. Booting an ext2 image from
> > > ata/ide drive works.
> > 
> > Interesting this is specific to mtd. I'll read the patch carefully again if
> > something rings a bell.
> > 
> 
> Interesting. Is there some endianness issue, by any chance ? I only see the problem
> with sheb (big endian), not with sh (little endian). I'd suspect that it is an
> emulation bug, but it is odd that the problem did not show up before.

So far I don't have a good explanation. Let me write down here the facts,
maybe it will trigger the aha effect.

1) Ext2 stores the metadata in little endian ordering. We observe the
problem with the first directory entry in the folio. Both entry->rec_len
(16-bit) and entry->inode (32-bit) appear to be seen in wrong endianity

2) The function that fails is ext2_check_folio(). We kmap_local() the folio
in ext2_get_folio(), then in ext2_check_folio() we do:

	ext2_dirent *p;

	p = (ext2_dirent *)(kaddr + 0);
	rec_len = ext2_rec_len_from_disk(p->rec_len);
	^^^ value 3072 == 0x0c00 seen here instead of correct 0x000c
	this value is invalid so we go to:
	ext2_error(sb, __func__, "bad entry in directory #%lu: : %s - "
                        "offset=%llu, inode=%lu, rec_len=%d, name_len=%d",
                        dir->i_ino, error, folio_pos(folio) + offs,
                        (unsigned long) le32_to_cpu(p->inode),
                        rec_len, p->name_len);

	Here rec_len is printed so we see the wrong value. Also
le32_to_cpu(p->inode) is printed which also shows number with swapped byte
ordering (the message contains inode number 27393 == 0x00006b01 but the
proper inode number is 363 == 0x0000016b). This actually releals more about
the problem because only the two bytes were swapped in the inode number
although we always treat it as 32-bit entity. So this would indeed point
more at some architectural issue rather than a problem in the filesystem /
MM.

Note that to get at this point in the boot we must have correctly
byteswapped many other directory entries in the filesystem. So the problem
must be triggered by some parallel activity happening in the system or
something like that.

3) The problem appears only with MTD storage, not with IDE/SATA on the same
system + filesystem image. It it unclear how the storage influences the
reproduction, rather than that it influences timing of events in the
system.

4) The problem reliably happens with "mm/page-writeback: Consolidate wb_thresh
bumping logic into __wb_calc_thresh", not without it. All this patch does
is that it possibly changes a limit at which processes dirtying pages in
the page cache get throttled. Thus there are fairly limited opportunities
for how it can cause damage (I've checked for possible UAF issues or memory
corruption but I don't really see any such possibility there, it is just
crunching numbers from the mm counters and takes decision based on the
result). This change doesn't have direct on the directory ext2 code. The only
thing it does is that it possibly changes code alignment of ext2 code if it
gets linked afterwards into vmlinux image (provided ext2 is built in). Another
possibility is again that it changes timing of events in the system due to
differences in throttling of processes dirtying page cache.

So at this point I don't have a better explanation than blame the HW. What
really tipped my conviction in this direction is the 16-bit byteswap in a
32-bit entity. Hence I guess I'll ask Andrew to put Jim's patch back into
tree if you don't object.
								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR