linux-kernel - Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150811214822.GA20596@dastard>
Date:	Wed, 12 Aug 2015 07:48:22 +1000
From:	Dave Chinner <david@...morbit.com>
To:	"Wilcox, Matthew R" <matthew.r.wilcox@...el.com>
Cc:	Boaz Harrosh <boaz@...xistor.com>, Jan Kara <jack@...e.cz>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Davidlohr Bueso <dbueso@...e.de>
Subject: Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock

On Tue, Aug 11, 2015 at 04:51:22PM +0000, Wilcox, Matthew R wrote:
> The race that you're not seeing is page fault vs page fault.  Two
> threads each attempt to store a byte to different locations on the
> same page.  With a read-mutex to exclude truncates, each thread
> calls ->get_block.  One of the threads gets back a buffer marked
> as BH_New and calls memset() to clear the page.  The other thread
> gets back a buffer which isn't marked as BH_New and simply inserts
> the mapping, returning to userspace, which stores the byte ...
> just in time for the other thread's memset() to write a zero over
> the top of it.

So, this is not a truncate race that the XFS MMAPLOCK solves.

However, that doesn't mean that the DAX code needs to add locking to
solve it. The race here is caused by block initialisation being
unserialised after a ->get_block call allocates the block (which the
filesystem serialises via internal locking). Hence two simultaneous
->get_block calls to the same block is guaranteed to have the DAX
block initialisation race with the second ->get_block call that says
the block is already allocated.

IOWs, the way to handle this is to have the ->get_block call handle
the block zeroing for new blocks instead of doing it after the fact
in the generic DAX code where there is no fine-grained serialisation
object available. By calling dax_clear_blocks() in the ->get_block
callback, the filesystem can ensure that the second racing call will
only make progress once the block has been fully initialised by the
first call.

IMO the fix is - again - to move the functionality into the
filesystem where we already have the necessary exclusion in place to
avoid this race condition entirely.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/