linux-kernel - Re: [PATCH] dax: fix radix tree insertion race

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 11 Apr 2017 10:29:42 +0200
From:   Jan Kara <jack@...e.cz>
To:     Ross Zwisler <ross.zwisler@...ux.intel.com>
Cc:     Jan Kara <jack@...e.cz>, Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christoph Hellwig <hch@....de>,
        Dan Williams <dan.j.williams@...el.com>,
        Matthew Wilcox <mawilcox@...rosoft.com>,
        linux-fsdevel@...r.kernel.org, linux-nvdimm@...ts.01.org
Subject: Re: [PATCH] dax: fix radix tree insertion race

On Mon 10-04-17 14:34:29, Ross Zwisler wrote:
> On Mon, Apr 10, 2017 at 03:41:11PM +0200, Jan Kara wrote:
> > On Thu 06-04-17 15:29:44, Ross Zwisler wrote:
> > > While running generic/340 in my test setup I hit the following race. It can
> > > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
> > > 
> > > Thread 1				Thread 2
> > > --------				--------
> > > dax_iomap_pmd_fault()
> > >   grab_mapping_entry()
> > >     spin_lock_irq()
> > >     get_unlocked_mapping_entry()
> > >     'entry' is NULL, can't call lock_slot()
> > >     spin_unlock_irq()
> > >     radix_tree_preload()
> > > 					dax_iomap_pmd_fault()
> > > 					  grab_mapping_entry()
> > > 					    spin_lock_irq()
> > > 					    get_unlocked_mapping_entry()
> > > 					    ...
> > > 					    lock_slot()
> > > 					    spin_unlock_irq()
> > > 					  dax_pmd_insert_mapping()
> > > 					    <inserts a PMD mapping>
> > >     spin_lock_irq()
> > >     __radix_tree_insert() fails with -EEXIST
> > >     <fall back to 4k fault, and die horribly
> > >      when inserting a 4k entry where a PMD exists>
> > > 
> > > The issue is that we have to drop mapping->tree_lock while calling
> > > radix_tree_preload(), but since we didn't have a radix tree entry to lock
> > > (unlike in the pmd_downgrade case) we have no protection against Thread 2
> > > coming along and inserting a PMD at the same index.  For 4k entries we
> > > handled this with a special-case response to -EEXIST coming from the
> > > __radix_tree_insert(), but this doesn't save us for PMDs because the
> > > -EEXIST case can also mean that we collided with a 4k entry in the radix
> > > tree at a different index, but one that is covered by our PMD range.
> > > 
> > > So, correctly handle both the 4k and 2M collision cases by explicitly
> > > re-checking the radix tree for an entry at our index once we reacquire
> > > mapping->tree_lock.
> > > 
> > > This patch has made it through a clean xfstests run with the current
> > > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
> > > loop.  It used to fail within the first 10 iterations.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> > > Cc: <stable@...r.kernel.org>    [4.10+]
> > 
> > The patch looks good to me (and I can see Andrew already sent it to Linus),
> > I'm just worndering where did things actually go wrong? I'd expect we would
> > return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault
> > for the address which should just work out fine...
> 
> Yep, that's what I thought as well, and I think it does work for processes
> which have separate page tables.  The second process will do a 4k fault (just
> as it would have if it had a VMA that was smaller than 2MiB, for example), map
> the 4k page into its page table and just dirty the 2MiB DAX entry in the radix
> tree.  I've tested this case manually in the past.
> 
> I think the error case that I was seeing was for threads that share page
> tables.  In that case the 2nd thread falls back to PTEs, but there is already
> a PMD in the page table from the first fault.  When we try and insert a PTE
> over the PMD we get the following BUG:

Ouch, right. We have to be really careful so that radix tree entries are
consistent with what we have in page tables so that our locking works...
Thanks for explanation.

								Honza

> 
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: do_raw_spin_trylock+0x5/0x40
> PGD 8d6ee0067
> PUD 8db6e8067
> PMD 0
> 
> Oops: 0000 [#1] PREEMPT SMP
> Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm [last unloaded: scsi_debug]
> CPU: 2 PID: 25323 Comm: holetest Not tainted 4.11.0-rc4 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> task: ffff880095492a00 task.stack: ffffc90014048000
> RIP: 0010:do_raw_spin_trylock+0x5/0x40
> RSP: 0000:ffffc9001404bb60 EFLAGS: 00010296
> RAX: ffff880095492a00 RBX: 0000000000000018 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffffc9001404bb80 R08: 0000000000000001 R09: 0000000000000000
> R10: ffff880095492a00 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff8808d5fe4220 R14: ffff88004c3e3c80 R15: 8000000000000025
> FS:  00007f7ed7dff700(0000) GS:ffff8808de400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000008d86f6000 CR4: 00000000001406e0
> Call Trace:
>  ? _raw_spin_lock+0x49/0x80
>  ? __get_locked_pte+0x16b/0x1d0
>  __get_locked_pte+0x16b/0x1d0
>  insert_pfn.isra.68+0x3a/0x100
>  vm_insert_mixed+0x64/0x90
>  dax_iomap_fault+0xa41/0x1680
>  ext4_dax_huge_fault+0xa9/0xd0
>  ext4_dax_fault+0x10/0x20
>  __do_fault+0x20/0x130
>  __handle_mm_fault+0x9b3/0x1190
>  handle_mm_fault+0x169/0x370
>  ? handle_mm_fault+0x47/0x370
>  __do_page_fault+0x28f/0x590
>  trace_do_page_fault+0x58/0x2c0
>  do_async_page_fault+0x2c/0x90
>  async_page_fault+0x28/0x30
> RIP: 0033:0x4014b2
> RSP: 002b:00007f7ed7dfef20 EFLAGS: 00010216
> RAX: 00007f7ec6c00400 RBX: 0000000000010000 RCX: 0000000001c00000
> RDX: 0000000000001c01 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00007f7ed7dff700 R08: 00007f7ed7dff700 R09: 00007f7ed7dff700
> R10: 00007f7ed7dff9d0 R11: 0000000000000202 R12: 00007f7ec6c00000
> R13: 00007ffe3ffb5b60 R14: 0000000000000400 R15: 00007f7ed7dff700
> Code: 30 84 ee 81 48 89 df e8 4a fe ff ff eb 89 89 c6 48 89 df e8 7e e7 ff ff eb 8c 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 55 48 89 e5 85 c0 75 2b ba 01 00 00 00 f0 0f b1 17 85 c0
> RIP: do_raw_spin_trylock+0x5/0x40 RSP: ffffc9001404bb60
> CR2: 0000000000000000
> ---[ end trace 75d38250d89b67cd ]---
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR