linux-kernel - Re: [PATCH] thp: tail page refcounting fix #2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110826161048.GE23870@redhat.com>
Date:	Fri, 26 Aug 2011 18:10:48 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Michel Lespinasse <walken@...gle.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
	Minchan Kim <minchan.kim@...il.com>,
	Johannes Weiner <jweiner@...hat.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Shaohua Li <shaohua.li@...el.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCH] thp: tail page refcounting fix #2

On Thu, Aug 25, 2011 at 11:24:36PM -0700, Michel Lespinasse wrote:
> I had never heard before of locked instructions being necessary when a
> straight assignment would do what we want, but after reading the erratas
> you listed, I'm not so sure anymore. Given that, I think the version with
> just one single atomic add is good enough.

spin_unlock sometime is adding the lock prefix too for that reason. So
I feel safer that way.

> (there are also 511 consecutive atomic_sub calls on the head page _count,
> which could just as well be coalesced into a signle one at the end of the
> tail page loop).

That should be safe. It's not like I'm a mood to microoptimize
__split_huge_page_refcount after you found I forgot the
get_page_unless_zero needed to keep the page->flags stable (they're
overwritten by the time the head page is freed, that is why we need it).

> I think your current __get_page_tail() is unsafe when it takes the
> compound lock on the head page, because there is no refcount held on it.
> If the THP page gets broken up before we get the compound lock, the head
> page could get freed. But it looks like you could fix that by doing
> get_page_unless_zero on the head, and you should end up with something
> very much like the put_page() function, which I find incredibly tricky
> but seems to be safe.

Correct, it's enough and we need it for the same reason it is in
put_page. Nothing new or no new fundamental problem with this
approach, just an implementation mistake. At least it could introduced
no regression compared to the previous code.

> I would suggest moving get_page_foll() and __get_page_tail_foll() to
> mm/internal.h so that people writing code outside of mm/ don't get confused
> about which get_page() version they must call.

Good idea. That is for MM internal usage only, only follow_page is
allowed to call it.

> In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
> to reflect the fact that get_page() callers are expected to have already
> gotten a reference on the page through a gup call.

So I could put it just before calling __get_page_tail_foll().

I don't see a way anybody could call get_page on a tail page without
having called gup on it first. So I think it's correct. Any
pfn-scanning code like your working set estimation code has to use
get_page_unless_zero and that will never succeed anymore for tail
pages.

> (not your fault, you just moved that code) The comment above
> reset_page_mapcount() and page_mapcount() mentions that _count starts from -1.
> This does not seem to be accurate anymore - as you see page_count() just
> returns the _count value without adding 1. I guess you could just remove
> ', like _count,' from the comment and that'd make it accurate :)

The comment talks about _mapcount not _count. page_mapcount still adds
1 to _mapcount and _mapcount really still starts from -1.

> The use of _mapcount to store tail page counts should probably be
> documented somewhere - probably in mm_types.h where _mapcount is
> defined, and/or before the page_mapcount accessor function. Or, there
> could be a tail_page_count() accessor function for that so that it's
> evident in all call sites that we're accessing a refcount and not a mapcount:
> 
> static inline int tail_page_count(struct page *page)
> {
> 	VM_BUG_ON(!PageTail(page));
> 	return page_mapcount(page);
> }
> 
> 
> (probably for another commit) I'm not too comfortable with having several
> arch-specific fast gup functions knowning details about how page counts
> are implemented. Linus's tree also adds such support in sparc arch
> (and it doesn't even seem to be correct as it increments the head count
> but not the tail count). This should probably be cleaned up sometime by
> moving such details into generic inline helper functions.
> 
> 
> Besides these comments, overall I like the change a lot & I'm especially
> happy to see get_page() work in all cases again :)

Glad to hear :).

Thanks a lot for pointing out the missing get_page_unless_zero(). I'll
post a #3 version soon with that bit fixed.

I'm undecided of tail_page_count is needed. The only benefit would be
to be able to grep for tail_page_count and see the few call sites, maybe
that makes it worth it. The VM_BUG_ON I doubt is necessary there
considering it's easy to review the callsites and they're so few. It'd
also need to go into internal.h I guess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/