linux-ext4 - Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c70d61a-fc5c-b928-334a-fbb2567b8dea@sandeen.net>
Date:   Mon, 10 Sep 2018 17:18:49 -0500
From:   Eric Sandeen <sandeen@...deen.net>
To:     Jan Kara <jack@...e.cz>, Ross Zwisler <zwisler@...nel.org>
Cc:     linux-nvdimm@...ts.01.org, darrick.wong@...cle.com,
        lczerner@...hat.com, linux-ext4 <linux-ext4@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On 8/7/18 3:45 AM, Jan Kara wrote:
> On Fri 27-07-18 10:28:51, Ross Zwisler wrote:
>> + fsdevel and the xfs list.
>>
>> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
>> <ross.zwisler@...ux.intel.com> wrote:
>>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
>>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
>>>>> Changes since v3:
>>>>>  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
>>>>>    that the {ext4,xfs}_break_layouts() calls have the same meaning.
>>>>>    (Dave, Darrick and Jan)
>>>>
>>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to
>>>> hunt them down?
>>>
>>> The root cause of this issue is that while the ei->i_mmap_sem provides
>>> synchronization between ext4_break_layouts() and page faults, it doesn't
>>> provide synchronize us with the direct I/O path.  This exact same issue exists
>>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
>>>
>>> This allows the direct I/O path to do I/O and raise & lower page->_refcount
>>> while we're executing a truncate/hole punch.  This leads to us trying to free
>>> a page with an elevated refcount.
>>>
>>> Here's one instance of the race:
>>>
>>> CPU 0                                   CPU 1
>>> -----                                   -----
>>> ext4_punch_hole()
>>>   ext4_break_layouts() # all pages have refcount=1
>>>
>>>                                         ext4_direct_IO()
>>>                                           ... lots of layers ...
>>>                                           follow_page_pte()
>>>                                             get_page() # elevates refcount
>>>
>>>   truncate_pagecache_range()
>>>    ... a few layers ...
>>>    dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
>>>
> 
> So this is a very different race from the one below. And it should be
> impossible to happen. This race is exactly the reason why
> dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault
> which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race.
> 
>>> A similar race occurs when the refcount is being dropped while we're running
>>> ext4_break_layouts(), and this is the one that my test was actually hitting:
>>>
>>> CPU 0                                   CPU 1
>>> -----                                   -----
>>>                                         ext4_direct_IO()
>>>                                           ... lots of layers ...
>>>                                           follow_page_pte()
>>>                                             get_page()
>>>                                             # elevates refcount of page X
>>> ext4_punch_hole()
>>>   ext4_break_layouts() # two pages, X & Y, have refcount == 2
>>>     __wait_var_event() # called for page X
>>>
>>>                                           __put_devmap_managed_page()
>>>                                           # drops refcount of X to 1
>>>
>>>    # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
>>>    # We never actually called ext4_wait_dax_page(), so 'retry' in
>>>    # ext4_break_layouts() is still false.  Exit do/while loop in
>>>    # ext4_break_layouts, never attempting to wait on page Y which still has an
>>>    # elevated refcount of 2.
>>>
>>>   truncate_pagecache_range()
>>>    ... a few layers ...
>>>    dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
>>>
>>> This second race can be fixed with the patch at the end of this function,
>>> which I think should go in, unless there is a benfit to the current retry
>>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
>>> With this patch applied I've been able to run my unit test through
>>> thousands of iterations, where it used to failed consistently within 10 or
>>> so.
>>>
>>> Even so, I wonder if the real solution is to add synchronization between
>>> the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
>>> this should be handled?
>>>
>>> --- >8 ---
>>>
>>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
>>> From: Ross Zwisler <zwisler@...nel.org>
>>> Date: Wed, 25 Jul 2018 16:16:05 -0600
>>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
>>>
>>> If the refcount of a page is lowered between the time that it is returned
>>> by dax_busy_page() and when the refcount is again checked in
>>> ext4_break_layouts() => ___wait_var_event(), the waiting function
>>> ext4_wait_dax_page() will never be called.  This means that
>>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
>>> looping and never check the refcount of other pages in this inode.
>>>
>>> Instead, always continue looping as long as dax_layout_busy_page() gives us
>>> a page which it found with an elevated refcount.
>>>
>>> Note that this works around the race exposed by my unit test, but I think
>>> that there is another race that needs to be addressed, probably with
>>> additional synchronization added between direct I/O and
>>> {ext4,xfs}_break_layouts().
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> 
> OK, this is a good catch and the patch looks good. You can add:
> 
> Reviewed-by: Jan Kara <jack@...e.cz>
> 
> Also please post this fix officially to Ted to include it in his tree (I
> can see that he has all your other patches queued for the merge window).

Did these ever get on Ted's radar?  I don't see it upstream yet.

Thanks,
-Eric