linux-kernel - Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140108180202.GL27046@suse.de>
Date:	Wed, 8 Jan 2014 18:02:02 +0000
From:	Mel Gorman <mgorman@...e.de>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>
Subject: Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs
 prep_new_page() race

On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote:
> On 01/08, Mel Gorman wrote:
> >
> > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> > >
> > > 	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > 	compound_lock(). In theory this page_head can be already freed and
> > > 	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > 	get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > 	and compound_lock() can race with the non-atomic __SetPageHead() in
> > > 	prep_compound_page().
> > >
> > This patch is putting a write barrier in the page allocator fast path and
> > that is going to be a leading cause of Sad Face. We already have seen
> > large regressions before when write barriers were introduced to the page
> > allocator paths for cpusets.  Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
> > does not really address the issue.
> 
> As you already mentioned in another email, smp_wmb() is mostly nop. On
> x86_64 at least.

Which sometimes means that it'll just take longer for someone to find it
and bitch about it.

> Although perhaps it would be nice to have
> 
> 	static inline void atomic_store_release(atomic_t *v, int i)
> 	{
> 		smp_store_release(&v->counter, i);
> 	}
> 
> > > Yes, but thp can access this page_head via stale pointer, tail->first_page,
> > > if it races with split_huge_page_refcount().
> >
> > To justify the introduction of a performance regression we need to be 100%
> > sure this race actually exists
> 
> See below. But let me remind that I never looked at this code before,
> I can be easily wrong.
> 
> > and not just theoretical.
> 
> It is theoretical anyway, I guess.
> 
> > For futex, the THP page (and the tail) must have been discovered via
> > the page tables in which case the page tables are temporarily preventing
> > the page being freed to the allocator.
> 
> Yes. But, for example, get_futex_key() does
> 
> 	if (unlikely(PageTail(page))) {
> 		put_page(page);
> 
> why this put_page() can't race with _split? If nothing else, another thread
> can unmap the part of this vma.
> 

The race is not prevented but that does not mean it matters. Basic
scenario where a split starts after the PageTail check but before the
put_page in get_futex_key

CPU A
get_futex_key
  -> fast gup, page table removing prevents parallel unmap and free
    -> gup_huge_pmd (arch/x86/mm/gup.c at least)
      -> get_huge_page_tail (increment page tail _map_count)
      -> get_huge_page_multiple (increment ref on head page)
  -> Check PageTail
					CPU B
					split_huge_page_to_list
					  -> split_huge_page_refcount
					     spin_lock_irq(lru_lock)
					     compound_lock
  -> put_page(tail_page)
    ->put_compound_page
       looks up head page
       takes reference unless zero
       compound_lock (block)
					     complete split
					     compound_unlock
       check PageTail

This put_page blocks on the compound lock, finds the page is no longer a
PageTail as the split barriers correctly and backs out gracefully. So sure,
splits can race but the case is cared for.

The parallel unmap is prevented by get_huge_page_multiple in the gup path
and held in place until put_page_compound frees it later.

I still don't see the case where a page gets freed to the page allocator
that causes weird problems here. Unfortunately, I also recognise I have
tunnel vision because subconsciously I don't *want* to see a problem here
that justifies adding a write barrier. Andrea may stomp all over this
reasoning in which case we'll get a good comment for the smp_wmb :/

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/