linux-kernel - Re: [PATCH v2] selftests/futex: fix the failed futex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <87fr7q9lr2.ffs@tglx>
Date: Wed, 28 Jan 2026 09:36:49 +0100
From: Thomas Gleixner <tglx@...nel.org>
To: Yuwen Chen <ywen.chen@...mail.com>
Cc: akpm@...ux-foundation.org, andrealmeid@...lia.com,
 bigeasy@...utronix.de, colin.i.king@...il.com, dave@...olabs.net,
 dvhart@...radead.org, edliaw@...gle.com, justinstitt@...gle.com,
 kernel-team@...roid.com, licayy@...mail.com, linux-kernel@...r.kernel.org,
 linux-kselftest@...r.kernel.org, luto@....edu, mingo@...hat.com,
 morbo@...gle.com, nathan@...nel.org, ndesaulniers@...gle.com,
 peterz@...radead.org, shuah@...nel.org, usama.anjum@...labora.com,
 wakel@...gle.com, ywen.chen@...mail.com
Subject: Re: [PATCH v2] selftests/futex: fix the failed futex_requeue test
 issue

On Wed, Jan 28 2026 at 11:29, Yuwen Chen wrote:
> On Tue, 27 Jan 2026 19:30:31 +0100, Thomas Gleixner wrote:
>> Extremely high?
>> 
>> The main thread waits for 10000us aka. 10 seconds to allow the waiter
>> thread to reach futex_wait().
>> 
>> If anything is extreme then it's the 10 seconds wait, not the
>> requirements. Please write factual changelogs and not fairy tales.
>
> 10,000 us is equal to 10 ms. On a specific ARM64 platform, it's quite
> common for this test case to fail when there is a 10-millisecond waiting
> time.

Sorry. Somehow my tired brain converted micro seconds to milliseconds.

But looking at it with brain awake again. Your change does not address
the underlying problem at all. It just papers over it to the extent that
it can't be observed anymore. Assume the following situation:

   CPU0                       		CPU1
   pthread_create()
   ...                                  run new thread
                                        --> preemption
   for (i = 0; i < 100; i++) {
       if (waiting_on_futex)
          break;
       usleep(100);
   }

   -> fail

As this still sleeps only 10ms in total this just works by chance for
you, but there is no guarantee that it actually works under a wide range
of scenarios. So this needs to increase the total wait time to let's say
1 second, which is fine as the wait check will terminate the loop once
the other thread reached the wait condition.

>> That's a known issue for all futex selftests when the test system is
>> under extreme load. That's why there is a gratious 10 seconds timeout,
>> which is annoyingly long already.
>> 
>> Also why is this special for the requeue_single test case?
>> 
>> It's exactly the same issue for all futex selftests including the multi
>> waiter one in the very same file, no?
>
> Yes, this is a common phenomenon. However, for the sake of convenient
> illustration, only the case of requeue_single is listed here.

Sure, but why are you then implementing it per case instead of making it
a general usable facility and fix up _all_ problematic cases which rely
on the sleep in one go?

>> Why do you need an atomic store here?
>> 
>> pthread_barrier_wait() is a full memory barrier already, no?
>
> Yes, there's no need to use atomic here. However, in the kernel, WRITE_ONCE
> and READ_ONCE are more likely to be used. Since it's particularly difficult
> to use them here, atomic is adopted.

You don't need READ/WRITE_ONCE() at all as there is no concurrency. The
waiter thread writes before invoking pthread_barrier_wait() so the
control thread _cannot_ read concurrently. Ergo there is no need for any
of this voodoo.

>> What's wrong with reading /proc/$PID/wchan ?
>> 
>> It's equally unreliable as /proc/$PID/stat because both can return the
>> desired state _before_ the thread reaches the inner workings of the test
>> related sys_futex(... WAIT).
>
> Is it possible for the waiterfn to enter the sleep state between the
> pthread_barrier_wait function and the futex_wait function?

No, but it can reach sleep state _before_ even reaching the thread
function. pthread_barrier_wait() itself can result in a futex_wait() too
if the control thread did not reach pthread_barrier_wait() before, but
that's harmless because then the control thread will wake the waiter
thread _before_ checking the state of the waiter.

> If so, would checking the call stack be a solution?

To make it even more complex and convoluted?

Thanks,

        tglx