linux-kernel - Re: [patch 00/12] futex: Cure robust/PI futex exit races

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1911081223110.26566@nanos.tec.linutronix.de>
Date:   Fri, 8 Nov 2019 12:51:53 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Florian Weimer <fweimer@...hat.com>
cc:     LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Darren Hart <darren@...art.com>,
        Yi Wang <wang.yi59@....com.cn>,
        Yang Tao <yang.tao172@....com.cn>,
        Oleg Nesterov <oleg@...hat.com>,
        Carlos O'Donell <carlos@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>
Subject: Re: [patch 00/12] futex: Cure robust/PI futex exit races

On Fri, 8 Nov 2019, Thomas Gleixner wrote:
> On Fri, 8 Nov 2019, Florian Weimer wrote:
> > > On Fri, 8 Nov 2019, Florian Weimer wrote:
> > > Unpatched 5.4-rc6:
> > >
> > > FAIL: nptl/tst-thread-affinity-pthread
> > > original exit status 1
> > > info: Detected CPU set size (in bits): 225
> > > info: Maximum test CPU: 255
> > > error: pthread_create for thread 253 failed: Resource temporarily unavailable
> > 
> > Huh.  Reverting your patches (at commit 26bc672134241a080a83b2ab9aa8abede8d30e1c)
> > fixes the test for me.
> > 
> > > TBH, the futex changes have absolutely nothing to do with that resource
> > > fail.
> > 
> > I suspect that there are some changes to task exit latency, which
> > triggers the latent resource management bug.
> 
> Right, and depending on which hardware you run, this changes. On the big
> testbox I use the failure is also bouncing around between thread 252 and
> 254.

Which was just an assumption and is completely wrong.

The fail is expected and the failure output of that test is totally
bonkers:

Tracing shows that clone is not failing at all:

   ld-linux.so.2-26694 [060] ....  6477.924785: sys_enter: NR 120 (3d0f00, f7cda424, f7cdaba8, ff819790, f7cdaba8, f7edd000)
   ld-linux.so.2-26694 [060] ....  6477.924867: sys_exit: NR 120 = 26695

...

   ld-linux.so.2-26694 [191] ....  6477.985139: sys_enter: NR 120 (3d0f00, fef27424, fef27ba8, ff819790, fef27ba8, f7edd000)
   ld-linux.so.2-26694 [191] ....  6477.985220: sys_exit: NR 120 = 27203

That's a total of 509 threads created. And then right after that:

   ld-linux.so.2-26694 [191] ....  6477.985221: sys_enter: NR 192 (0, 801000, 0, 20022, ffffffff, 0)
   ld-linux.so.2-26694 [191] ....  6477.985222: sys_exit: NR 192 = -12

mmap2 fails with ENOMEM which is not really surprising. The map length is
0x801000 which means that the already started threads have already consumed

   509 * 0x801000 == 4073.99 MB == 3.9785 GB

The next mmap2 fails for a 32bit process for pretty obvious reasons and
rightfully so.

pthread_create() returns EAGAIN while the underlying problem is ENOMEM
which causes this bonkers output:

  error: pthread_create for thread 253 failed: Resource temporarily unavailable

There is nothing temporarily. The process has its address space exhausted.

That test's output is anyway strange:

 info: Detected CPU set size (in bits): 225
 info: Maximum test CPU: 255

Interesting how it fits 256 CPUs into a cpuset with a size of 225 bits.

/me goes back to stare into iopl().

Thanks,

	tglx