linux-kernel - Re: [PATCH 04/19] mm, thp: Preserve pgprot across huge page split

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120809214203.GG10459@redhat.com>
Date:	Thu, 9 Aug 2012 23:42:03 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	mingo@...nel.org, riel@...hat.com, oleg@...hat.com, pjt@...gle.com,
	akpm@...ux-foundation.org, torvalds@...ux-foundation.org,
	tglx@...utronix.de, Lee.Schermerhorn@...com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 04/19] mm, thp: Preserve pgprot across huge page split

On Tue, Jul 31, 2012 at 09:12:08PM +0200, Peter Zijlstra wrote:
> If we marked a THP with our special PROT_NONE protections, ensure we
> don't loose them over a split.
> 
> Collapse seems to always allocate a new (huge) page which should
> already end up on the new target node so loosing protections there
> isn't a problem.

This looks an optimization too, as it reduces a few branches.

If you didn't introduce an unnecessary goto it would have made the
actual change more readable and the patch much smaller. (you could
have cleaned it up with a later patch if you disliked the codying
style that tried to avoid using unnecessary gotos)

The s/barrier/ACCESS_ONCE/ I'll merge it in my tree as a separate
commit, as it's not related to sched-numa.

> 
> Cc: Rik van Riel <riel@...hat.com>
> Cc: Paul Turner <pjt@...gle.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> ---
>  arch/x86/include/asm/pgtable.h |    1 
>  mm/huge_memory.c               |  104 +++++++++++++++++++----------------------
>  2 files changed, 50 insertions(+), 55 deletions(-)
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -350,6 +350,7 @@ static inline pgprot_t pgprot_modify(pgp
>  }
>  
>  #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
> +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
>  
>  #define canon_pgprot(p) __pgprot(massage_pgprot(p))
>  
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1353,64 +1353,60 @@ static int __split_huge_page_map(struct 
>  	int ret = 0, i;
>  	pgtable_t pgtable;
>  	unsigned long haddr;
> +	pgprot_t prot;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = page_check_address_pmd(page, mm, address,
>  				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> -	if (pmd) {
> -		pgtable = get_pmd_huge_pte(mm);
> -		pmd_populate(mm, &_pmd, pgtable);
> -
> -		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> -		     i++, haddr += PAGE_SIZE) {
> -			pte_t *pte, entry;
> -			BUG_ON(PageCompound(page+i));
> -			entry = mk_pte(page + i, vma->vm_page_prot);
> -			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -			if (!pmd_write(*pmd))
> -				entry = pte_wrprotect(entry);
> -			else
> -				BUG_ON(page_mapcount(page) != 1);
> -			if (!pmd_young(*pmd))
> -				entry = pte_mkold(entry);
> -			pte = pte_offset_map(&_pmd, haddr);
> -			BUG_ON(!pte_none(*pte));
> -			set_pte_at(mm, haddr, pte, entry);
> -			pte_unmap(pte);
> -		}
> +	if (!pmd)
> +		goto unlock;
>  
> -		smp_wmb(); /* make pte visible before pmd */
> -		/*
> -		 * Up to this point the pmd is present and huge and
> -		 * userland has the whole access to the hugepage
> -		 * during the split (which happens in place). If we
> -		 * overwrite the pmd with the not-huge version
> -		 * pointing to the pte here (which of course we could
> -		 * if all CPUs were bug free), userland could trigger
> -		 * a small page size TLB miss on the small sized TLB
> -		 * while the hugepage TLB entry is still established
> -		 * in the huge TLB. Some CPU doesn't like that. See
> -		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> -		 * Erratum 383 on page 93. Intel should be safe but is
> -		 * also warns that it's only safe if the permission
> -		 * and cache attributes of the two entries loaded in
> -		 * the two TLB is identical (which should be the case
> -		 * here). But it is generally safer to never allow
> -		 * small and huge TLB entries for the same virtual
> -		 * address to be loaded simultaneously. So instead of
> -		 * doing "pmd_populate(); flush_tlb_range();" we first
> -		 * mark the current pmd notpresent (atomically because
> -		 * here the pmd_trans_huge and pmd_trans_splitting
> -		 * must remain set at all times on the pmd until the
> -		 * split is complete for this pmd), then we flush the
> -		 * SMP TLB and finally we write the non-huge version
> -		 * of the pmd entry with pmd_populate.
> -		 */
> -		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> -		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> -		pmd_populate(mm, pmd, pgtable);
> -		ret = 1;
> +	prot = pmd_pgprot(*pmd);
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +
> +		BUG_ON(PageCompound(page+i));
> +		entry = mk_pte(page + i, prot);
> +		entry = pte_mkdirty(entry);
> +		if (!pmd_young(*pmd))
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
>  	}
> +
> +	smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
> +	/*
> +	 * Up to this point the pmd is present and huge.
> +	 *
> +	 * If we overwrite the pmd with the not-huge version, we could trigger
> +	 * a small page size TLB miss on the small sized TLB while the hugepage
> +	 * TLB entry is still established in the huge TLB.
> +	 *
> +	 * Some CPUs don't like that. See
> +	 * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
> +	 * on page 93.
> +	 *
> +	 * Thus it is generally safer to never allow small and huge TLB entries
> +	 * for overlapping virtual addresses to be loaded. So we first mark the
> +	 * current pmd not present, then we flush the TLB and finally we write
> +	 * the non-huge version of the pmd entry with pmd_populate.
> +	 *
> +	 * The above needs to be done under the ptl because pmd_trans_huge and
> +	 * pmd_trans_splitting must remain set on the pmd until the split is
> +	 * complete. The ptl also protects against concurrent faults due to
> +	 * making the pmd not-present.
> +	 */
> +	set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	pmd_populate(mm, pmd, pgtable);
> +	ret = 1;
> +
> +unlock:
>  	spin_unlock(&mm->page_table_lock);
>  
>  	return ret;
> @@ -2241,9 +2237,7 @@ static int khugepaged_wait_event(void)
>  static void khugepaged_do_scan(struct page **hpage)
>  {
>  	unsigned int progress = 0, pass_through_head = 0;
> -	unsigned int pages = khugepaged_pages_to_scan;
> -
> -	barrier(); /* write khugepaged_pages_to_scan to local stack */
> +	unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
>  
>  	while (progress < pages) {
>  		cond_resched();
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/