linux-kernel - Re: [PATCH 1/4] locking/ww_mutex: Fix a deadlock affecting ww

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 23 Nov 2016 14:00:46 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Nicolai Hähnle <nhaehnle@...il.com>
Cc:     linux-kernel@...r.kernel.org,
        Nicolai Hähnle <Nicolai.Haehnle@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Chris Wilson <chris@...is-wilson.co.uk>,
        Maarten Lankhorst <maarten.lankhorst@...onical.com>,
        dri-devel@...ts.freedesktop.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] locking/ww_mutex: Fix a deadlock affecting ww_mutexes

On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> From: Nicolai Hähnle <Nicolai.Haehnle@....com>
> 
> Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> the following example. Acquire context stamps are ordered like the thread
> numbers, i.e. thread #1 should back off when it encounters a mutex locked
> by thread #0 etc.
> 
> Thread #0    Thread #1    Thread #2    Thread #3
> ---------    ---------    ---------    ---------
>                                        lock(ww)
>                                        success
>              lock(ww')
>              success
>                           lock(ww)
>              lock(ww)        .
>                 .            .         unlock(ww) part 1
> lock(ww)        .            .            .
> success         .            .            .
>                 .            .         unlock(ww) part 2
>                 .         back off
> lock(ww')       .
>    .            .
> (stuck)      (stuck)
> 
> Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> (without being protected by lock->base.wait_lock), meaning that thread #0
> can acquire ww in the fast path or, much more likely, the medium path
> in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> 
> Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> thread #2, since waiters are added at the tail. Thread #2 wakes up and
> backs off since it sees ww owned by a context with a lower stamp.
> 
> Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> on ww'. So thread #0 gets stuck waiting for ww' to be released.
> 
> This patch fixes the deadlock by waking up all waiters in the slow path
> of ww_mutex_unlock.
> 
> We have an internal test case for amdgpu which continuously submits
> command streams from tens of threads, where all command streams reference
> hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> between command streams. This test reliably caused a deadlock, and while I
> haven't completely confirmed that it is exactly the scenario outlined
> above, this patch does fix the test case.
> 
> v2:
> - use wake_q_add
> - add additional explanations
> 
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Chris Wilson <chris@...is-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@...onical.com>
> Cc: dri-devel@...ts.freedesktop.org
> Cc: stable@...r.kernel.org
> Reviewed-by: Christian König <christian.koenig@....com> (v1)
> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@....com>

Completely and utterly fails to apply; I think this patch is based on
code prior to the mutex rewrite.

Please rebase on tip/locking/core.

Also, is this a regression, or has this been a 'feature' of the ww_mutex
code from early on?