linux-kernel - Re: [BUG]locking/rwsem: only clean RWSEM_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <13d683ed-793c-b502-44ff-f28114d9386b@redhat.com>
Date:   Sun, 7 Nov 2021 10:24:31 -0500
From:   Waiman Long <longman@...hat.com>
To:     Hillf Danton <hdanton@...a.com>
Cc:     马振华 <mazhenhua@...omi.com>,
        peterz <peterz@...radead.org>, mingo <mingo@...hat.com>,
        will <will@...nel.org>, "boqun.feng" <boqun.feng@...il.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already
 set

On 11/7/21 04:01, Hillf Danton wrote:
> On Sat, 6 Nov 2021 23:25:38 -0400 Waiman Long wrote:
>> On 11/6/21 08:39, 马振华 wrote:
>>> Dear longman,
>>>
>>> recently , i find a issue which rwsem count is negative value, it
>>> happened always when a task try to get the lock
>>> with __down_write_killable , then it is killed
>>>
>>> this issue happened like this
>>>
>>>              CPU2         CPU4
>>>      task A[reader]     task B[writer]
>>>      down_read_killable[locked]
>>>      sem->count=0x100
>>>              down_write_killable
>>>              sem->count=0x102[wlist not empty]
>>>      up_read
>>>      count=0x2
>>>              sig kill received
>>>      down_read_killable
>>>      sem->count=0x102[wlist not empty]
>>>              goto branch out_nolock:
>>> list_del(&waiter.list);
>>> wait list is empty
>>> sem->count-RWSEM_FLAG_HANDOFF
>>> sem->count=0xFE
>>>      list_empty(&sem->wait_list) is TRUE
>>>       sem->count andnot RWSEM_FLAG_WAITERS
>>>        sem->count=0xFC
>>>      up_read
>>>      sem->count -= 0x100
>>>      sem->count=0xFFFFFFFFFFFFFFFC
>>>      DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>>>
>>> so sem->count will be negative after writer is killed
>>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>> Thanks for reporting this possible race condition.
>>
>> However, I am still trying to figure how it is possible to set the
>> wstate to WRITER_HANDOFF without actually setting the handoff bit as
>> well. The statement sequence should be as follows:
>>
>> wstate = WRITER_HANDOFF;
>> raw_spin_lock_irq(&sem->wait_lock);
>> if (rwsem_try_write_lock(sem, wstate))
>> raw_spin_unlock_irq(&sem->wait_lock);
>>    :
>> if (signal_pending_state(state, current))
>>      goto out_nolock
>>
>> The rwsem_try_write_lock() function will make sure that we either
>> acquire the lock and clear handoff or set the handoff bit. This should
>> be done before we actually check for signal. I do think that it is
> Given that WRITER_HANDOFF makes no sure that RWSEM_FLAG_HANDOFF is set in
> wsem_try_write_lock(), the flag should be cleared only by the setter to
> avoid count underflow.
>
> Hillf
>
>> probably safer to use atomic_long_andnot to clear the handoff bit
>> instead of using atomic_long_add().

I did have a tentative patch to address this issue which is somewhat 
similar to your approach. However, I would like to further investigate 
the exact mechanics of the race condition to make sure that I won't miss 
a latent bug somewhere else in the rwsem code.

-Longman

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..20be967620c0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -347,7 +347,8 @@ enum rwsem_wake_type {
  enum writer_wait_state {
      WRITER_NOT_FIRST,    /* Writer is not first in wait list */
      WRITER_FIRST,        /* Writer is first in wait list     */
-    WRITER_HANDOFF        /* Writer is first & handoff needed */
+    WRITER_NEED_HANDOFF    /* Writer is first & handoff needed */
+    WRITER_HANDOFF        /* Writer is first & handoff set */
  };

  /*
@@ -532,11 +533,11 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
   * race conditions between checking the rwsem wait list and setting the
   * sem->count accordingly.
   *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * If wstate is WRITER_NEED_HANDOFF, it will make sure that either the 
handoff
   * bit is set or the lock is acquired with handoff bit cleared.
   */
  static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-                    enum writer_wait_state wstate)
+                    enum writer_wait_state *wstate)
  {
      long count, new;

@@ -546,13 +547,13 @@ static inline bool rwsem_try_write_lock(struct 
rw_semaphore *sem,
      do {
          bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);

-        if (has_handoff && wstate == WRITER_NOT_FIRST)
+        if (has_handoff && *wstate == WRITER_NOT_FIRST)
              return false;

          new = count;

          if (count & RWSEM_LOCK_MASK) {
-            if (has_handoff || (wstate != WRITER_HANDOFF))
+            if (has_handoff || (*wstate != WRITER_NEED_HANDOFF))
                  return false;

              new |= RWSEM_FLAG_HANDOFF;
@@ -569,8 +570,10 @@ static inline bool rwsem_try_write_lock(struct 
rw_semaphore *sem,
       * We have either acquired the lock with handoff bit cleared or
       * set the handoff bit.
       */
-    if (new & RWSEM_FLAG_HANDOFF)
+    if (new & RWSEM_FLAG_HANDOFF) {
+        *wstate = WRITER_HANDOFF;
          return false;
+    }

      rwsem_set_owner(sem);
      return true;
@@ -1083,7 +1086,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
      /* wait until we successfully acquire the lock */
      set_current_state(state);
      for (;;) {
-        if (rwsem_try_write_lock(sem, wstate)) {
+        if (rwsem_try_write_lock(sem, &wstate)) {
              /* rwsem_try_write_lock() implies ACQUIRE on success */
              break;
          }
@@ -1138,7 +1141,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
               */
              if ((wstate == WRITER_FIRST) && (rt_task(current) ||
                  time_after(jiffies, waiter.timeout))) {
-                wstate = WRITER_HANDOFF;
+                wstate = WRITER_NEED_HANDOFF;
                  lockevent_inc(rwsem_wlock_handoff);
                  break;
              }
@@ -1159,7 +1162,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
      list_del(&waiter.list);

      if (unlikely(wstate == WRITER_HANDOFF))
-        atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+        atomic_long_addnot(RWSEM_FLAG_HANDOFF, &sem->count);

      if (list_empty(&sem->wait_list))
          atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
--