linux-kernel - Re: [PATCHv3 4/5] mm: make compound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150821121028.GB12016@node.dhcp.inet.fi>
Date:	Fri, 21 Aug 2015 15:10:28 +0300
From:	"Kirill A. Shutemov" <kirill@...temov.name>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Hugh Dickins <hughd@...gle.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>,
	David Rientjes <rientjes@...gle.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Christoph Lameter <cl@...ux.com>
Subject: Re: [PATCHv3 4/5] mm: make compound_head() robust

On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com> wrote:
> 
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > 	CPU0					CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> > 					put_page()
> > 					  tail->first_page = NULL
> >       head = tail->first_page
> > 					alloc_pages(__GFP_COMP)
> > 					   prep_compound_page()
> > 					     tail->first_page = head
> > 					     __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. That means it shares storage
> > space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> >  - page->pmd_huge_pte;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word. It can be used to encode PageTail(). And if the bit
> > set, rest of the word is pointer to head page.
> 
> So nothing else which participates in the union in the "Third double
> word block" is allowed to use bit zero of the first word.

Correct.

> Is this really true?  For example if it's a slab page, will that page
> ever be inspected by code which is looking for the PageTail bit?

+Christoph.

What we know for sure is that space is not used in tail pages, otherwise
it would collide with current compound_dtor.

For head/small pages it gets trickier. I convinced myself that it should
be safe this way:

All fields it shares space with are pointers (with possible exception of
pmd_huge_pte, see below) to objects with sizeof() > 1. I think it's
reasonable to expect that the bit 0 in such pointers would be clear due
alignment. We do the same for page->mapping.

On pmd_huge_pte: it's pgtable_t which on most architectures is typedef to
struct page *. That should not create any conflicts. On some architectures
it's pte_t *, which is fine too. On arc it's virtual address of the page
in form of unsigned long. It should work.

The worry I have about pmd_huge_pte is that some new architecture may
choose to implement pgtable_t as pfn and that will collide on bit 0. :-/

We can address this worry by shifting pmd_huge_pte to the second word in
the double word block. But I'm not sure if we should.

And of course there's chance that these field are used not according to
its type. I didn't find such cases, but I can't guarantee that they don't
exist.

I tested patched kernel with all three SLAB allocator and was not able to
crash it under trinity. More testing is required.

> Anyway, this is quite subtle and there's a risk that people will
> accidentally break it later on.  I don't think the patch puts
> sufficient documentation in place to prevent this.

I would appreciate for suggestion on place and form of documentation.

> And even documentation might not be enough to prevent accidents.

The only think I can propose is VM_BUG_ON() in PageTail() and
compound_head() which would ensure that page->compound_page points to
place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.

Do you consider this helpful?

> >
> > ...
> >
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,12 @@ struct page {
> >  		};
> >  	};
> >  
> > -	/* Third double word block */
> > +	/*
> > +	 * Third double word block
> > +	 *
> > +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> > +	 * for non-tail pages.
> > +	 */
> >  	union {
> >  		struct list_head lru;	/* Pageout list, eg. active_list
> >  					 * protected by zone->lru_lock !
> > @@ -143,6 +148,7 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> >  		struct {
> > +			unsigned long compound_head; /* If bit zero is set */
> 
> I think the comments around here should have more details and should
> be louder!

I'm always bad when it comes to documentation. Is it enough?

	/*
	 * Third double word block
	 *
	 * WARNING: bit 0 of the first word encode PageTail(). That means
	 * the rest users of the storage space MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/