linux-kernel - Re: Crash with PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221107135636.biouna36osqc4rik@quack3>
Date:   Mon, 7 Nov 2022 14:56:36 +0100
From:   Jan Kara <jack@...e.cz>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Jan Kara <jack@...e.cz>, LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mel Gorman <mgorman@...e.de>
Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Fri 04-11-22 17:30:29, Sebastian Andrzej Siewior wrote:
> On 2022-11-03 12:54:44 [+0100], Jan Kara wrote:
> > Hello,
> Hi,
> 
> > I was tracking down the following crash with 6.0 kernel with
> > patch-6.0.5-rt14.patch applied:
> > 
> > [ T6611] ------------[ cut here ]------------
> > [ T6611] kernel BUG at fs/inode.c:625!
> 
> seems like an off-by-one ;)
> 
> > The machine is aarch64 architecture, kernel config is attached. I have seen
> > the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> > triggered relatively reliably (on two different aarch64 machines) by our
> > performance testing framework when running dbench benchmark against an XFS
> > filesystem.
> 
> different aarch64 machines as in different SoC? Or the same CPU twice.
> And no trouble on x86-64 I guess?

The same CPU it appears, just different machines. The problem never
happened on x86-64, that is correct. /proc/cpuinfo from the two machines
is:

processor	: 0
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

...

there are 80 cpus in total in the machine.


> > Now originally I thought this is some problem with XFS or writeback code
> > but after debugging this for some time I don't think that anymore.
> > clear_inode() complains about inode->i_wb_list being non-empty. In fact
> > looking at the list_head, I can see it is corrupted. In all the occurences
> > of the problem ->prev points back to the list_head itself but ->next points
> > to some list_head that used to be part of the sb->s_inodes_wb list (or
> > actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> > the stack as ->next pointer as well).
> 
> so you assume a delete and add operation in parallel?

Yes, I assume sb_clear_inode_writeback() was deleting inode from the list
while wait_sb_inodes() was doing list_move_tail() operation on the same list.

> > This is not just some memory ordering issue with the check in
> > clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> > clear_inode(), the problem still reproduces.
> 
> What about dropping the list_empty() check in sb_mark_inode_writeback()
> and sb_clear_inode_writeback() so that the check operation always
> happens within the locked section? Either way, missing an add/delete
> should result in consistent pointers.

I've tested removing the list_empty() checks from sb_mark_inode_writeback()
and sb_clear_inode_writeback() but it didn't change a bit. The corruption
still happened.

> > If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> > raw_spinlock_t, the problem disappears.
> > 
> > Finally, I'd note that the list is modified from three places which makes
> > audit relatively simple. sb_mark_inode_writeback(),
> > sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> > sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> > loss what could be causing this. As unlikely as it seems to me I've started
> > wondering whether it is not some subtle issue with RT spinlocks on aarch64
> > possibly in combination with interrupts (because sb_clear_inode_writeback()
> > may be called from an interrupt).
> 
> This should be modified from a threaded interrupt so interrupts and
> preemption should be enabled at this point.
> If preemption and or interrupts are disabled at some point then
> CONFIG_DEBUG_ATOMIC_SLEEP should complain about it.

I see.

> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> while the actual lock is modified via cmpxchg. 

So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
stops happening as well. So do you suspect some bug in the CPU itself?

								Honza

-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR