linux-kernel - Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131216201952.GE21218@redhat.com>
Date:	Mon, 16 Dec 2013 21:19:52 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Mel Gorman <mgorman@...e.de>
Subject: Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)

On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> On 12/13, Andrea Arcangeli wrote:
> >
> > On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> > >
> > > I'll try to make v2 based on -mm and your suggestions.
> >
> > Ok great!
> 
> Yes, it would be great, but I need your help again ;)
> 
> 
> Let me quote the pseudo-code you sent me:
> 
> 	put_compound_tail(page) {
> 		page_head = compound_trans_head(page);
> 		if (!__compound_tail_refcounted(page_head)) {
> 			...
> 			return ...;
> 		}
> 
> 		flags = compound_lock_irqsave(page_head);
> 		...
> 
> Sure, put_compound_tail() should be the simplified version of
> put_compound_page() which doesn't dec page_head->_count, this is clear.
> 
> But afaics, compound_lock_irqsave() above looks unsafe without
> get_page_unless_zero(page_head) ? If we race with _split, page_head
> can be freed and compound_lock() can race with, say, free_pages_check()
> which plays with page->flags ?
> 
> So it seems that put_compound_tail() should also do get/put(head) like
> put_compound_page() does, and this probably means we should factor out
> the common code somehow.
> 

Yes it was supposed to do the get_page_unless_zero before the
compound_lock.

> Or I missed something?
> 
> 
> 
> OTOH, I can't really understand
> 
> 	if (likely(page != page_head && get_page_unless_zero(page_head)))
> 
> in __get_page_tail() and put_compound_page().
> 
> First of all, is it really possible that page == compound_trans_head(page)?

It is possible if it become a regular page because split_huge_page
run in between.

compound_trans_head guarantees us it got us to a safe head page. And
the idea was to do get_page_unless_zero only on head pages and never
on tails to be safer/simpler. Same goes for the compound_lock that
follows still on the page_head where "page != page_head".

If you like, you can try to optimize away the check and reuse the
PageTail branch inside the compound_lock but I'm not sure if it's
worth it.

> We already verified that PG_tail was set. Of course this bit can be cleared
> and ->first_page can be a dangling pointer but it can never be changed to
> point to this page? (in fact, afaics it can be changed at all as long as we
> have a reference, but this doesn't matter).

If PG_tail gets cleared compound_trans_head returns "page".

> And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> 
> For example, suppose that page_head was already freed and then re-allocated
> as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> race with the non-atomic prep_compound_page()->__SetPageHead() ?

Yes. We need to change to:

	if (order && (gfp_flags & __GFP_COMP))
		prep_compound_page(page, order);
	smp_wmb();
	/* as the compound_lock can be taken after it's refcounted */
	set_page_refcounted(page);

__SetPageHead uses bts asm insn so literally only a "lock" prefix is
missing in a assembly instruction. So the race window is incredibly
small, but it must be fixed indeed. This also puts set_page_refcounted
as the last action of buffered_rmqueue so there shouldn't be any other
issues like this left in the page allocation code.

Can you reorder set_page_refcount in your v2?

I wonder if arch_alloc_page needs refcount 1, it sets the page as
stable on s390. the other way around is to move prep_compound_page
before set_page_refcounted (though I think if we can, keeping the
refcounted at the very last with a comment is preferable).

> Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
> confusing imho. I think this is correct, we already checked PageAnon() so it
> can't be a thp page. But probably this needs a comment and __basepage_index()
> should do BUG_ON(!PageHuge()).

This looks a bug if you apply the patches to add THP in pagecache, and
BUG_ON(!PageHuge) would also break THP in pagecache later (though
safer than silently broken like now).

It'd safer to read the basepage_index in a local variable just before
doing the put_compund_tail but I agree it's not going to make a
difference right now.

But the whole idea of working with the head of the THP despite the
address points to a tail, looks flakey if there is a split (THP in
pagecache). I mean chances are reading basepage_index(page) before
releasing the tail pin is not enough.

The Anon path looks sane for all cases, as it doesn't pretend to
return any information on the actual mapping and it limits itself to
do the PageAnon check on the actual head that has PageAnon set, which
is still valid thing to do even after a split, and in fact required if
a split didn't take place yet as the tail "page" isn't anon but just a
tail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/