lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 8 May 2019 07:22:13 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Rik van Riel <riel@...riel.com>
Cc:     linux-xfs@...r.kernel.org, kernel-team@...com,
        linux-kernel@...r.kernel.org,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        David Chinner <dchinner@...hat.com>
Subject: Re: [PATCH] fs,xfs: fix missed wakeup on l_flush_wait

On Tue, May 07, 2019 at 01:05:28PM -0400, Rik van Riel wrote:
> The code in xlog_wait uses the spinlock to make adding the task to
> the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
> with respect to the waker.
> 
> Doing the wakeup after releasing the spinlock opens up the following
> race condition:
> 
> - add task to wait queue
> 
> -                                      wake up task
> 
> - set task state to UNINTERRUPTIBLE
> 
> Simply moving the spin_unlock to after the wake_up_all results
> in the waker not being able to see a task on the waitqueue before
> it has set its state to UNINTERRUPTIBLE.

Yup, seems like an issue. Good find, Rik.

So, what problem is this actually fixing? Was it noticed by
inspection, or is it actually manifesting on production machines?
If it is manifesting IRL, what are the symptoms (e.g. hang running
out of log space?) and do you have a test case or any way to
exercise it easily?

And, FWIW, did you check all the other xlog_wait() users for the
same problem?

> The lock ordering of taking the waitqueue lock inside the l_icloglock
> is already used inside xlog_wait; it is unclear why the waker was doing
> things differently.

Historical, most likely, and the wakeup code has changed in years
gone by and a race condition that rarely manifests is unlikely to be
noticed.

....

Yeah, the conversion from counting semaphore outside the iclog lock
to use wait queues in 2008 introduced this bug. The wait queue
addition was moved inside the lock, the wakeup left outside. So:

Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")

Apart from the commit message, the change looks fine.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ