linux-kernel - [PATCH v2 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230825181033.504534-1-bigeasy@linutronix.de>
Date:   Fri, 25 Aug 2023 20:10:27 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Cc:     bigeasy@...utronix.de, tglx@...utronix.de, boqun.feng@...il.com,
        bristot@...hat.com, bsegall@...gle.com, dietmar.eggemann@....com,
        jstultz@...gle.com, juri.lelli@...hat.com, longman@...hat.com,
        mgorman@...e.de, mingo@...hat.com, rostedt@...dmis.org,
        swood@...hat.com, vincent.guittot@...aro.org, vschneid@...hat.com,
        will@...nel.org
Subject: [PATCH v2 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Hi!

This is basically the 'same' patches as send earlier by PeterZ here:
	https://lore.kernel.org/all/20230815110121.117752409@infradead.org/

Here is a diff vs the original submission, which includes what has been
discussed on the list, only patch 5 has been changed:

|5:  2b0bb1b525cf1 ! 5:  91f7c083c398d locking/rtmutex: Use rt_mutex specific scheduler helpers
|    @@ Commit message
|         Link: https://lore.kernel.org/r/20230815111430.421408298@infradead.org
|         Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
|     
|    + ## kernel/futex/pi.c ##
|    +@@
|    + // SPDX-License-Identifier: GPL-2.0-or-later
|    + 
|    + #include <linux/slab.h>
|    ++#include <linux/sched/rt.h>
|    + #include <linux/sched/task.h>
|    + 
|    + #include "futex.h"
|    +@@ kernel/futex/pi.c: int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
|    + 		goto no_block;
|    + 	}
|    + 
|    ++	/*
|    ++	 * Must be done before we enqueue the waiter, here is unfortunately
|    ++	 * under the hb lock, but that *should* work because it does nothing.
|    ++	 */
|    ++	rt_mutex_pre_schedule();
|    ++
|    + 	rt_mutex_init_waiter(&rt_waiter);
|    + 
|    + 	/*
|    +@@ kernel/futex/pi.c: int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
|    + 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
|    + 		ret = 0;
|    + 
|    ++	/*
|    ++	 * Waiter is unqueued.
|    ++	 */
|    ++	rt_mutex_post_schedule();
|    + no_block:
|    + 	/*
|    + 	 * Fixup the pi_state owner and possibly acquire the lock if we
|    +
|      ## kernel/locking/rtmutex.c ##
|     @@ kernel/locking/rtmutex.c: static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
|      		raw_spin_unlock_irq(&lock->wait_lock);
|    @@ kernel/locking/rwbase_rt.c: static int __sched __rwbase_read_lock(struct rwbase_
|      	return ret;
|      }
|      
|    +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
|    + 	/* Force readers into slow path */
|    + 	atomic_sub(READER_BIAS, &rwb->readers);
|    + 
|    ++	rt_mutex_pre_schedule();
|    ++
|    + 	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
|    + 	if (__rwbase_write_trylock(rwb))
|    + 		goto out_unlock;
|    +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
|    + 		if (rwbase_signal_pending_state(state, current)) {
|    + 			rwbase_restore_current_state();
|    + 			__rwbase_write_unlock(rwb, 0, flags);
|    ++			rt_mutex_post_schedule();
|    + 			trace_contention_end(rwb, -EINTR);
|    + 			return -EINTR;
|    + 		}
|    +@@ kernel/locking/rwbase_rt.c: static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
|    + 
|    + out_unlock:
|    + 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
|    ++	rt_mutex_post_schedule();
|    + 	return 0;
|    + }
|    + 
|     
|      ## kernel/locking/rwsem.c ##
|     @@ kernel/locking/rwsem.c: static inline void __downgrade_write(struct rw_semaphore *sem)

I just stuffed this into latest -RT. As for the futex_lock_pi() change,
I wouldn't do the "whole" rt_mutex_pre_schedule() thingy since for futex
as it has to be a nop because no I/O is done. So something like
|  rt_mutex_pre_schedule_futex()
|  {
|    lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
|    WARN_ON_ONCE(tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER));
|    SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
|    WARN_ON_ONCE(tsk->plug);
|  }

but then why over complicate things. Speaking of over complicating, this
triggered:
| ------------[ cut here ]------------
| WARNING: CPU: 6 PID: 2155 at kernel/locking/spinlock_rt.c:40 rt_spin_lock+0x5a/0xf0
| Modules linked in:
| CPU: 6 PID: 2155 Comm: futex_requeue_p Not tainted 6.5.0-rc7-rt3+ #272
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
| RIP: 0010:rt_spin_lock+0x5a/0xf0
| Call Trace:
|  <TASK>
|  ? rt_spin_lock+0x5a/0xf0
|  futex_wait_requeue_pi+0x3a4/0x5c0
|  do_futex+0x152/0x190
|  __x64_sys_futex+0x110/0x1b0
|  do_syscall_64+0x44/0x90
|  entry_SYSCALL_64_after_hwframe+0x6f/0xd9
| RIP: 0033:0x7f8a79ed34f9
| ---[ end trace 0000000000000000 ]---

and is
| static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
| {
|         lockdep_assert(!current->pi_blocked_on);

Here, rt_mutex_wait_proxy_lock() returned with -EINTR, didn't acquire
the lock. Later rt_mutex_cleanup_proxy_lock() would clean
->pi_blocked_on but that happens after 
|                 /* Current is not longer pi_blocked_on */
|                 spin_lock(q.lock_ptr);

and that comment is accurate for r != 0. This isn't a regression from
before, I just didn't spot it earlier.

Sebastian