linux-kernel - Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150126151906.GS11755@redhat.com>
Date:	Mon, 26 Jan 2015 16:19:06 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	Ebru Akagunduz <ebru.akagunduz@...il.com>, linux-mm@...ck.org,
	akpm@...ux-foundation.org, kirill@...temov.name, mhocko@...e.cz,
	mgorman@...e.de, rientjes@...gle.com, sasha.levin@...cle.com,
	hughd@...gle.com, hannes@...xchg.org, linux-kernel@...r.kernel.org,
	riel@...hat.com
Subject: Re: [PATCH v2] mm: incorporate read-only pages into transparent huge
 pages

Hi,

On Mon, Jan 26, 2015 at 08:36:06AM +0100, Vlastimil Babka wrote:
> >  - Add fast path optimistic check to
> >    __collapse_huge_page_isolate()
> 
> My interpretation is that the optimistic check is in khugepaged_scan_pmd() while
> in __collapse_huge_page_isolate() it's protected by lock, as Andrea suggested?

Correct, __collapse_huge_page_isolate is the "accurate" check that was
missing in v1. The optimistic check was the one in khugepaged_scan_pmd
and it was already present.

> > @@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		VM_BUG_ON_PAGE(!PageAnon(page), page);
> >  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> >  
> > -		/* cannot use mapcount: can't collapse if there's a gup pin */
> > -		if (page_count(page) != 1)
> > -			goto out;
> >  		/*
> >  		 * We can do it before isolate_lru_page because the
> >  		 * page can't be freed from under us. NOTE: PG_lock
> > @@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		 */
> >  		if (!trylock_page(page))
> >  			goto out;
> > +
> > +		/*
> > +		 * cannot use mapcount: can't collapse if there's a gup pin.
> > +		 * The page must only be referenced by the scanned process
> > +		 * and page swap cache.
> > +		 */
> > +		if (page_count(page) != 1 + !!PageSwapCache(page)) {
> > +			unlock_page(page);
> > +			goto out;
> > +		}
> > +		if (!pte_write(pteval)) {
> > +			if (++ro > khugepaged_max_ptes_none) {
> > +				unlock_page(page);
> > +				goto out;
> 
> So just for completeness, as I said later for v1 I think this can leave us with
> read-only VMA: consider ro == 256 and none == 256, referenced can still be >0
> (up to 256). I think that the check for referenced that follows this for loop
> should also check if (ro + none < HPAGE_PMD_NR).

The moment "ro" becomes 256 or "none" becomes 256, we immediately goto out.

	if (likely(referenced))
		return 1;
out:
	release_pte_pages(pte, _pte);
	return 0;

"out" is past the referenced check and we fail the collapse
immediately.

If ro is < 255 then "none" is also < 255 (if pte_write is true, then
pte_none cannot be true).

Overall I don't see how we could collapse in readonly vma and where
the bug is for this case, but I may be overlooking something obvious.

> > +			}
> > +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> > +				unlock_page(page);
> > +				goto out;
> > +			}
> > +			/*
> > +			 * Page is not in the swap cache, and page count is
> > +			 * one (see above). It can be collapsed into a THP.
> > +			 */
> 
> I would still put the VM_BUG_ON(page_count(page) != 1) here as I suggested
> previously. Even more so that I think it would have been able to catch the
> problem that Andrea pointed out in v1.

This is subtle, but we can't do VM_BUG_ON because it's ok if the VM
comes before us in another CPU, and takes a pin on the page to isolate
the page.

In short if you changed the current upstream code like below:

		/* cannot use mapcount: can't collapse if there's a gup pin */
		if (page_count(page) != 1)
			goto out;
		VM_BUG_ON(page_count(page) != 1);
		trylock...

the VM_BUG_ON could fire as a false positive. It's not even related to
the trylock_page, it's related to the LRU lock and isolate_lru_page
that the VM could run on a different CPU and it's not a bug.

The VM is free to pin the page. We only need to be sure there are no
GUP-pins after we blocked all variants of GUP and the check is enough
for that.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/