lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210719162418.GA28003@zipoli.concurrent-rt.com>
Date:   Mon, 19 Jul 2021 12:24:18 -0400
From:   Joe Korty <joe.korty@...current-rt.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Lee Jones <lee.jones@...aro.org>
Cc:     Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: [BUG] 4.4.262: infinite loop in futex_unlock_pi (EAGAIN loop)

[BUG] 4.4.262: infinite loop in futex_unlock_pi (EAGAIN loop)

   [ replicator, attached ]
   [ workaround patch that crudely clears the loop, attached ]
   [ 4.4.256 does _not_ have this problem, 4.4.262 is known to have it ]

When a certain, secure-site application is run on 4.4.262, it locks up and
is unkillable.  Crash(8) and sysrq backtraces show that the application
is looping in the kernel in futex_unlock_pi.

Between 4.4.256 and .257, 4.4 got this 4.12 patch backported into it:

   73d786b ("[PATCH] futex: Rework inconsistent rt_mutex/futex_q state")

This patch has the following comment:

   The only problem is that this breaks RT timeliness guarantees. That
   is, consider the following scenario:

      T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

        CPU0

        T1
          lock_pi()
          queue_me()  <- Waiter is visible
   
        preemption

        T2
          unlock_pi()
            loops with -EAGAIN forever

    Which is undesirable for PI primitives. Future patches will rectify
    this.

This describes the situation exactly.  To prove, we developed a little
kernel patch that, on loop detection, puts a message into the kernel log for
just the first occurrence, keeps a count of the number of occurrences seen
since boot, and tries to break out of the loop via usleep_range(1000,1000).
Note that the patch is not really needed for replication.  It merely shows,
by 'fixing' the problem, that it really is the EAGAIN loop that triggers
the lockup.

Along with this patch, we submit a replicator.  Running this replicator
with this patch, it can be seen that 4.4.256 does not have the problem.
4.4.267 and the latest 4.4, 4.4.275, do.  In addition, 4.9.274 (tested
w/o the patch) does not have the problem.

>From this pattern there may be some futex fixup patch that was ported
back into 4.9 but failed to make it to 4.4.

Acknowledgements: My colleague, Scott Shaffer, performed the crash/sysrq
analysis that found the futex_unlock_pi loop, and he raised the suspicion
that commit 73d786b might be the cause.

Signed-off-by: Joe Korty <joe.korty@...current-rt.com>


View attachment "futex-unlock-pi-eagain-hack" of type "text/plain" (3153 bytes)

View attachment "1-1.c" of type "text/plain" (3997 bytes)

View attachment "posixtest.h" of type "text/plain" (465 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ