linux-kernel - Re: Crash with PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221201123752.po5z2qpmitafuzhn@quack3>
Date:   Thu, 1 Dec 2022 13:37:52 +0100
From:   Jan Kara <jack@...e.cz>
To:     Pierre Gondois <pierre.gondois@....com>
Cc:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Will Deacon <will@...nel.org>, Jan Kara <jack@...e.cz>,
        Waiman Long <longman@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mel Gorman <mgorman@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Catalin Marinas <catalin.marinas@....com>
Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Wed 30-11-22 18:20:27, Pierre Gondois wrote:
> 
> On 11/28/22 16:58, Sebastian Andrzej Siewior wrote:
> > How about this?
> > 
> > - The fast path is easy…
> > 
> > - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
> >    I made that one _acquire so that it is visible by the unlocker forcing
> >    everyone into slow path.
> > 
> > - If the lock is acquired, then the owner is written via
> >    rt_mutex_set_owner(). This happens under wait_lock so it is
> >    serialized and so a WRITE_ONCE() is used to write the final owner. I
> >    replaced it with a cmpxchg_acquire() to have the owner there.
> >    Not sure if I shouldn't make this as you put it:
> > |   e.g. by making use of dependency ordering where it already exists.
> >    The other (locking) CPU needs to see the owner not only the WAITER
> >    bit. I'm not sure if this could be replaced with smp_store_release().
> > 
> > - After the whole operation completes, fixup_rt_mutex_waiters() cleans
> >    the WAITER bit and I kept the _acquire semantic here. Now looking at
> >    it again, I don't think that needs to be done since that shouldn't be
> >    set here.
> > 
> > - There is rtmutex_spin_on_owner() which (as the name implies) spins on
> >    the owner as long as it active. It obtains it via READ_ONCE() and I'm
> >    not sure if any memory barrier is needed. Worst case is that it will
> >    spin while owner isn't set if it observers a stale value.
> > 
> > - The unlock path first clears the waiter bit if there are no waiters
> >    recorded (via simple assignments under the wait_lock (every locker
> >    will fail with the cmpxchg_acquire() and go for the wait_lock)) and
> >    then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
> >    Should there be a wait, it will just store the WAITER bit with
> >    smp_store_release() (setting the owner is NULL but the WAITER bit
> >    forces everyone into the slow path).
> > 
> > - Added rt_mutex_set_owner_pi() which does simple assignment. This is
> >    used from the futex code and here everything happens under a lock.
> > 
> > - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
> >    *think* want to observe a real waiter and not something stale.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> 
> 
> Hello,
> Just to share some debug attempts, I tried Sebastian's patch and could not
> reproduce the error. While trying to understand the solution, I could not
> reproduce the error if I only took the changes made to
> mark_rt_mutex_waiters(), or to rt_mutex_set_owner_pi(). I am not sure I
> understand why this would be a rt-mutex issue.
> 
> Without Sebastian's patch, to try adding some synchronization around the
> 'i_wb_list', I did the following:
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 443f83382b9b..42ce1f7f8aef 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1271,10 +1271,10 @@ void sb_clear_inode_writeback(struct inode *inode)
>         struct super_block *sb = inode->i_sb;
>         unsigned long flags;
> -       if (!list_empty(&inode->i_wb_list)) {
> +       if (!list_empty_careful(&inode->i_wb_list)) {

In my debug attempts I've actually completely removed this unlocked check
and the corruption still triggered.

>                 spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> -               if (!list_empty(&inode->i_wb_list)) {
> -                       list_del_init(&inode->i_wb_list);
> +               if (!list_empty_careful(&inode->i_wb_list)) {
> +                       list_del_init_careful(&inode->i_wb_list);

This shouldn't be needed, at least once unlocked checks are removed. Also
even if they stay, the list should never get corrupted because all the
modifications are protected by the spinlock. This is why we eventually
pointed to the rt_mutex as the problem.

It may be possible that your change adds enough memory ordering so that the
missing ordering in rt_mutex does not matter anymore. 

> diff --git a/fs/inode.c b/fs/inode.c
> index b608528efd3a..fbe6b4fe5831 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -621,7 +621,7 @@ void clear_inode(struct inode *inode)
>         BUG_ON(!list_empty(&inode->i_data.private_list));
>         BUG_ON(!(inode->i_state & I_FREEING));
>         BUG_ON(inode->i_state & I_CLEAR);
> -       BUG_ON(!list_empty(&inode->i_wb_list));
> +       BUG_ON(!list_empty_careful(&inode->i_wb_list));
>         /* don't need i_lock here, no concurrent mods to i_state */
>         inode->i_state = I_FREEING | I_CLEAR;
>  }
> 
> I never stepped on the:
>   BUG_ON(!list_empty_careful(&inode->i_wb_list))
> statement again, but got the dump at [2]. I also regularly end-up
> with the following endless logs when trying other things, when rebooting:
> 
> EXT4-fs (nvme0n1p3): sb orphan head is 2840597
> sb_info orphan list:
>   inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
>   inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
>   ...

Looks like another list corruption - in this case ext4 list of inodes that
are undergoing truncate.

> Also, Jan said that the issue was reproducible on 'two different aarch64
> machines', cf [1]. Would it be possible to know which one ?

Both had the same Ampere CPU. Full cpuinfo is here:

https://lore.kernel.org/all/20221107135636.biouna36osqc4rik@quack3/

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR