linux-kernel - Re: [rfc][patch] mm: dirty page accounting race fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0808141210200.4398@blonde.site>
Date:	Thu, 14 Aug 2008 12:55:46 +0100 (BST)
From:	Hugh Dickins <hugh@...itas.com>
To:	Nick Piggin <npiggin@...e.de>
cc:	Linux Memory Management List <linux-mm@...ck.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [rfc][patch] mm: dirty page accounting race fix

On Thu, 14 Aug 2008, Nick Piggin wrote:

> Hi,
> 
> Wonder if you would be kind enough to verify this? It solves my issues (well
> still running after several hours wheras it previously would lock up in a
> few seconds).
> 
> Thanks,
> Nick
> 
> ---
> There is a race with dirty page accounting where a page may not properly
> be accounted for.
> 
> clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> 
> page_mkclean walks the rmaps for that page, and for each one it cleans and
> write protects the pte if it was dirty. It uses page_check_address to find the
> pte. That function has a shortcut to avoid the ptl if the pte is not
> present. Unfortunately, the pte can be switched to not-present then back to
> present by other code while holding the page table lock -- this should not
> be a signal for page_mkclean to ignore that pte, because it may be dirty.
> 
> For example, do_wp_page calls ptep_clear_flush_notify before marking the
> pte writable and dirty. do_wp_page does subsequently set the page dirty, but
> that can happen before clear_page_dirty_for_io() calls TestClearPageDirty.
> There may also be other code which does similar things, and/or architectures
> which implement other pte manipulation functions using a clear-set sequence,
> I haven't looked thoroughly. But it should be fixed.
> 
> The consequence of the bug is loss of data integrity (msync) of dirty page
> accounting accuracy. XIP's __xip_unmap is probably also unreliable, which
> can lead to data corruption.
> 
> Fix this by always taking ptl to check the pte in page_check_address.
> 
> It's possible to retain this optimization for page_referenced and
> try_to_unmap.
> 
> Signed-off-by: Nick Piggin <npiggin@...e.de>
> ---

Thanks for the inlined patch.

> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h
> +++ linux-2.6/include/linux/rmap.h
> @@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
>   * Called from mm/filemap_xip.c to unmap empty zero page
>   */
>  pte_t *page_check_address(struct page *, struct mm_struct *,
> -				unsigned long, spinlock_t **);
> +				unsigned long, spinlock_t **, int);
>  
>  /*
>   * Used by swapoff to help locate where page is expected in vma.
> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -		pte = page_check_address(page, mm, address, &ptl);
> +		pte = page_check_address(page, mm, address, &ptl, 1);
>  		if (pte) {
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
>  /*
>   * Check that @page is mapped at @address into @mm.
>   *
> + * If @synch is false, page_check_address may perform a racy check to avoid
> + * the page table lock when the pte is not present (helpful when reclaiming
> + * highly shared pages).
> + *
>   * On success returns with pte mapped and locked.
>   */
>  pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> -			  unsigned long address, spinlock_t **ptlp)
> +			  unsigned long address, spinlock_t **ptlp, int synch)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> -	if (!pte_present(*pte)) {
> +	if (!synch && !pte_present(*pte)) {
>  		pte_unmap(pte);
>  		return NULL;
>  	}
> @@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> @@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 1);
>  	if (!pte)
>  		goto out;
>  
> @@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> 

I'm not against this if it really turns out to be the answer,
it's simple enough and it sounds like a very good find; but
I'm still not convinced that you've got to the bottom of it.

Am I confused, or is your "do_wp_page calls ptep_clear_flush_notify"
example a very bad one?  The page it's dealing with there doesn't
go back into the page table (its COW does), and the dirty_accounting
case doesn't even get down there, it's dealt with in the reuse case
above, which uses ptep_set_access_flags.  Now, I think that one may
well behave as you suggest on some arches (though it's extending
permissions not restricting them, so maybe not); but please check
that out and improve your example.

Even if it does, it's not clear to me that your fix is the answer.
That may well be because the whole of dirty page accounting grew too
subtle for me!  But holding the page table lock on one pte of the
page doesn't guarantee much about the integrity of the whole dance:
do_wp_page does its set_page_dirty_balance for this case, you'd
need to spell out the bad sequence more to convince me.

Sorry, that may be a lot of work, to get it through my skull!
And I may be lazily asking you to do my thinking for me.

But I got a bit distracted: mprotect's change_pte_range is
traditionally where the pte_modify operation has been split up into
stages on some arches, that really can be restricting permissions
and needs to tread carefully.  Now I go to look there, I see its
		/*
		 * Avoid taking write faults for pages we know to be
		 * dirty.
		 */
		if (dirty_accountable && pte_dirty(ptent))
			ptent = pte_mkwrite(ptent);

and get rather worried: isn't that likely to be giving write permission
to a pte in a vma we are precisely taking write permission away from?
That's a different issue, of course; but perhaps it's even relevant.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/