linux-kernel - Re: [PATCH v8 10/12] bpf/rqspinlock: Use smp_cond_load_acquire

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874ipqc234.fsf@oracle.com>
Date: Mon, 15 Dec 2025 23:34:07 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>,
        LKML
 <linux-kernel@...r.kernel.org>,
        linux-arch <linux-arch@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Linux Power
 Management <linux-pm@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, Arnd
 Bergmann <arnd@...db.de>,
        Catalin Marinas <catalin.marinas@....com>,
        Will
 Deacon <will@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
        Andrew
 Morton <akpm@...ux-foundation.org>,
        Mark Rutland <mark.rutland@....com>, harisokn@...zon.com,
        Christoph Lameter <cl@...two.org>,
        Alexei Starovoitov
 <ast@...nel.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Daniel Lezcano
 <daniel.lezcano@...aro.org>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>, zhenglifeng1@...wei.com,
        xueshuai@...ux.alibaba.com, joao.m.martins@...cle.com,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>, konrad.wilk@...cle.com
Subject: Re: [PATCH v8 10/12] bpf/rqspinlock: Use
 smp_cond_load_acquire_timeout()


Alexei Starovoitov <alexei.starovoitov@...il.com> writes:

> On Sun, Dec 14, 2025 at 8:51 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>>
>>  /**
>>   * resilient_queued_spin_lock_slowpath - acquire the queued spinlock
>>   * @lock: Pointer to queued spinlock structure
>> @@ -415,7 +415,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
>>          */
>>         if (val & _Q_LOCKED_MASK) {
>>                 RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
>> -               res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, timeout_err, _Q_LOCKED_MASK) < 0);
>> +               smp_cond_load_acquire_timeout(&lock->locked, !VAL,
>> +                                             (timeout_err = clock_deadlock(lock, _Q_LOCKED_MASK, &ts)),
>> +                                             ts.duration);
>
> I'm pretty sure we already discussed this and pointed out that
> this approach is not acceptable.

Where? I don't see any mail on this at all.

In any case your technical point below is quite reasonable. It's better
to lead with that instead of peremptorily declaring what you find
acceptable and what not.

> We cannot call ktime_get_mono_fast_ns() first.
> That's why RES_CHECK_TIMEOUT() exists and it does
> if (!(ts).spin++)
> before doing the first check_timeout() that will do ktime_get_mono_fast_ns().
> Above is a performance critical lock acquisition path where
> pending is spinning on the lock word waiting for the owner to
> release the lock.
> Adding unconditional ktime_get_mono_fast_ns() will destroy
> performance for quick critical section.

Yes that makes sense.

This is not ideal, but how about something like this:

  #define smp_cond_load_relaxed_timeout(ptr, cond_expr,                 \
                                      time_expr_ns, timeout_ns)         \
  ({                                                                    \
        typeof(ptr) __PTR = (ptr);                                      \
        __unqual_scalar_typeof(*ptr) VAL;                               \
        u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;                   \
        s64 __timeout = (s64)timeout_ns;                                \
        s64 __time_now, __time_end = 0;                                 \
                                                                        \
        for (;;) {                                                      \
                VAL = READ_ONCE(*__PTR);                                \
                if (cond_expr)                                          \
                        break;                                          \
                cpu_poll_relax(__PTR, VAL, __timeout);                  \
                if (++__n < __spin)                                     \
                        continue;                                       \
                __time_now = (s64)(time_expr_ns);                       \
                if (unlikely(__time_end == 0))                          \
                        __time_end = __time_now + __timeout;            \
                __timeout = __time_end - __time_now;                    \
                if (__time_now <= 0 || __timeout <= 0) {                \
                        VAL = READ_ONCE(*__PTR);                        \
                        break;                                          \
                }                                                       \
                __n = 0;                                                \
        }                                                               \
                                                                        \
        (typeof(*ptr))VAL;                                              \
  })

The problem with this version is that there's no way to know how much
time has passed in the first cpu_poll_relax()). So for the waiting case
this has a builtin overshoot of up to __timeout for the WFET case.
And ~100us for WFE and ~2us for x86 cpu_relax.

Of course we could specify a small __timeout value for the first iteration
which would help WFET.

Anyway let me take another look at this tomorrow.

But, that is more complexity.

Opinions?

--
ankur