linux-kernel - Re: [PATCH] locking/rtmutex: Always use trylock in rt_mutex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ce60b576-b8bd-49da-90b1-a84c7cb0eb9e@redhat.com>
Date: Tue, 8 Oct 2024 09:21:34 -0400
From: Waiman Long <llong@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>, Waiman Long <llong@...hat.com>
Cc: Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
 Boqun Feng <boqun.feng@...il.com>, linux-kernel@...r.kernel.org,
 Thomas Gleixner <tglx@...utronix.de>, Luis Goncalves <lgoncalv@...hat.com>,
 Chunyu Hu <chuhu@...hat.com>
Subject: Re: [PATCH] locking/rtmutex: Always use trylock in rt_mutex_trylock()

On 10/8/24 3:38 AM, Peter Zijlstra wrote:
> On Mon, Oct 07, 2024 at 11:54:54AM -0400, Waiman Long wrote:
>> On 10/7/24 11:33 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 07, 2024 at 11:23:32AM -0400, Waiman Long wrote:
>>>
>>>>> Is the problem that:
>>>>>
>>>>> 	sched_tick()
>>>             raw_spin_lock(&rq->__lock);
>>>>> 	  task_tick_mm_cid()
>>>>> 	    task_work_add()
>>>>> 	      kasan_save_stack()
>>>>> 	        idiotic crap while holding rq->__lock ?
>>>>>
>>>>> Because afaict that is completely insane. And has nothing to do with
>>>>> rtmutex.
>>>>>
>>>>> We are not going to change rtmutex because instrumentation shit is shit.
>>>> Yes, it is because of KASAN that causes page allocation while holding the
>>>> rq->__lock. Maybe we can blame KASAN for this. It is actually not a problem
>>>> for non-PREEMPT_RT kernel because only trylock is being used. However, we
>>>> don't use trylock all the way when rt_spin_trylock() is being used with
>>>> PREEMPT_RT Kernel.
>>> It has nothing to do with trylock, an everything to do with scheduler
>>> locks being special.
>>>
>>> But even so, trying to squirrel a spinlock inside a raw_spinlock is
>>> dodgy at the best of times, yes it mostly works, but should be avoided
>>> whenever possible.
>>>
>>> And instrumentation just doesn't count.
>>>
>>>> This is certainly a problem that we need to fix as there
>>>> may be other similar case not involving rq->__lock lurking somewhere.
>>> There cannot be, lock order is:
>>>
>>>     rtmutex->wait_lock
>>>       task->pi_lock
>>>         rq->__lock
>>>
>>> Trying to subvert that order gets you a splat, any other:
>>>
>>>     raw_spin_lock(&foo);
>>>     spin_trylock(&bar);
>>>
>>> will 'work', despite probably not being a very good idea.
>>>
>>> Any case involving the scheduler locks needs to be eradicated, not
>>> worked around.
>> OK, I will see what I can do to work around this issue.
> Something like the completely untested below might just work.

The real problem is due to the occasional need to allocate new pages to 
expand the stack buffer in stack depot that will take additional lock. 
Fortunately, there is a kasan_record_aux_stack_noalloc() variant that 
will prevent that. Below is my proposed solution that is less restrictive.

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index cf5e7e891a77..2964171856e0 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -14,11 +14,14 @@ init_task_work(struct callback_head *twork, 
task_work_func_t func)
  }

  enum task_work_notify_mode {
-    TWA_NONE,
+    TWA_NONE = 0,
      TWA_RESUME,
      TWA_SIGNAL,
      TWA_SIGNAL_NO_IPI,
      TWA_NMI_CURRENT,
+
+    TWA_FLAGS = 0xff00,
+    TWAF_NO_ALLOC = 0x0100,
  };

  static inline bool task_work_pending(struct task_struct *task)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..0259301e572e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10458,7 +10458,9 @@ void task_tick_mm_cid(struct rq *rq, struct 
task_struct *curr)
          return;
      if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
          return;
-    task_work_add(curr, work, TWA_RESUME);
+
+    /* No page allocation under rq lock */
+    task_work_add(curr, work, TWA_RESUME | TWAF_NO_ALLOC);
  }

  void sched_mm_cid_exit_signals(struct task_struct *t)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 5d14d639ac71..c969f1f26be5 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -55,15 +55,26 @@ int task_work_add(struct task_struct *task, struct 
callback_head *work,
            enum task_work_notify_mode notify)
  {
      struct callback_head *head;
+    int flags = notify & TWA_FLAGS;

+    notify &= ~TWA_FLAGS;
      if (notify == TWA_NMI_CURRENT) {
          if (WARN_ON_ONCE(task != current))
              return -EINVAL;
          if (!IS_ENABLED(CONFIG_IRQ_WORK))
              return -EINVAL;
      } else {
-        /* record the work call stack in order to print it in KASAN 
reports */
-        kasan_record_aux_stack(work);
+        /*
+         * Record the work call stack in order to print it in KASAN
+         * reports.
+         *
+         * Note that stack allocation can fail if TWAF_NO_ALLOC flag
+         * is set and new page is needed to expand the stack buffer.
+         */
+        if (flags & TWAF_NO_ALLOC)
+            kasan_record_aux_stack_noalloc(work);
+        else
+            kasan_record_aux_stack(work);
      }

      head = READ_ONCE(task->task_works);