linux-kernel - Re: Crash with PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y2U+Je+LICO2HkNY@linutronix.de>
Date:   Fri, 4 Nov 2022 17:30:29 +0100
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Jan Kara <jack@...e.cz>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mel Gorman <mgorman@...e.de>
Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 2022-11-03 12:54:44 [+0100], Jan Kara wrote:
> Hello,
Hi,

> I was tracking down the following crash with 6.0 kernel with
> patch-6.0.5-rt14.patch applied:
> 
> [ T6611] ------------[ cut here ]------------
> [ T6611] kernel BUG at fs/inode.c:625!

seems like an off-by-one ;)

> The machine is aarch64 architecture, kernel config is attached. I have seen
> the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> triggered relatively reliably (on two different aarch64 machines) by our
> performance testing framework when running dbench benchmark against an XFS
> filesystem.

different aarch64 machines as in different SoC? Or the same CPU twice.
And no trouble on x86-64 I guess?

> Now originally I thought this is some problem with XFS or writeback code
> but after debugging this for some time I don't think that anymore.
> clear_inode() complains about inode->i_wb_list being non-empty. In fact
> looking at the list_head, I can see it is corrupted. In all the occurences
> of the problem ->prev points back to the list_head itself but ->next points
> to some list_head that used to be part of the sb->s_inodes_wb list (or
> actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> the stack as ->next pointer as well).

so you assume a delete and add operation in parallel?

> This is not just some memory ordering issue with the check in
> clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> clear_inode(), the problem still reproduces.

What about dropping the list_empty() check in sb_mark_inode_writeback()
and sb_clear_inode_writeback() so that the check operation always
happens within the locked section? Either way, missing an add/delete
should result in consistent pointers.

> If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> raw_spinlock_t, the problem disappears.
> 
> Finally, I'd note that the list is modified from three places which makes
> audit relatively simple. sb_mark_inode_writeback(),
> sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> loss what could be causing this. As unlikely as it seems to me I've started
> wondering whether it is not some subtle issue with RT spinlocks on aarch64
> possibly in combination with interrupts (because sb_clear_inode_writeback()
> may be called from an interrupt).

This should be modified from a threaded interrupt so interrupts and
preemption should be enabled at this point.
If preemption and or interrupts are disabled at some point then
CONFIG_DEBUG_ATOMIC_SLEEP should complain about it.

spinlock_t and raw_spinlock_t differ slightly in terms of locking.
rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
while the actual lock is modified via cmpxchg. 

> Any ideas?
> 
> 								Honza

Sebastian