lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.1404161642020.11154@eggly.anvils>
Date:	Wed, 16 Apr 2014 17:15:37 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Johannes Weiner <hannes@...xchg.org>
cc:	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>, linux-mm@...ck.org
Subject: Re: [3.14+] kernel BUG at mm/filemap.c:1347!

On Wed, 16 Apr 2014, Johannes Weiner wrote:
> Subject: [patch] mm: filemap: update find_get_pages_tag() to deal with shadow
>  entries
> 
> Dave Jones reports the following crash when find_get_pages_tag() runs
> into an exceptional entry:
> 
> kernel BUG at mm/filemap.c:1347!
> RIP: 0010:[<ffffffffb815aeab>]  [<ffffffffb815aeab>] find_get_pages_tag+0x1cb/0x220
> Call Trace:
>  [<ffffffffb815ad16>] ? find_get_pages_tag+0x36/0x220
>  [<ffffffffb8168511>] pagevec_lookup_tag+0x21/0x30
>  [<ffffffffb81595de>] filemap_fdatawait_range+0xbe/0x1e0
>  [<ffffffffb8159727>] filemap_fdatawait+0x27/0x30
>  [<ffffffffb81f2fa4>] sync_inodes_sb+0x204/0x2a0
>  [<ffffffffb874d98f>] ? wait_for_completion+0xff/0x130
>  [<ffffffffb81fa5b0>] ? vfs_fsync+0x40/0x40
>  [<ffffffffb81fa5c9>] sync_inodes_one_sb+0x19/0x20
>  [<ffffffffb81caab2>] iterate_supers+0xb2/0x110
>  [<ffffffffb81fa864>] sys_sync+0x44/0xb0
>  [<ffffffffb875c4a9>] ia32_do_call+0x13/0x13
> 
> 1343                         /*
> 1344                          * This function is never used on a shmem/tmpfs
> 1345                          * mapping, so a swap entry won't be found here.
> 1346                          */
> 1347                         BUG();
> 
> After 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
> cache radix trees") this comment and BUG() are out of date because
> exceptional entries can now appear in all mappings - as shadows of
> recently evicted pages.
> 
> However, as Hugh Dickins notes,
> 
>   "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
>    any other PAGECACHE_TAG_*) to appear on an exceptional entry.
> 
>    I expect it comes down to an occasional race in RCU lookup of the
>    radix_tree: lacking absolute synchronization, we might sometimes
>    catch an exceptional entry, with the tag which really belongs with
>    the unexceptional entry which was there an instant before."
> 
> And indeed, not only is the tree walk lockless, the tags are also read
> in chunks, one radix tree node at a time.  There is plenty of time for
> page reclaim to swoop in and replace a page that was already looked up
> as tagged with a shadow entry.
> 
> Remove the BUG() and update the comment.  While reviewing all other
> lookup sites for whether they properly deal with shadow entries of
> evicted pages, update all the comments and fix memcg file charge
> moving to not miss shmem/tmpfs swapcache pages.
> 
> Reported-by: Dave Jones <davej@...hat.com>
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")

Looks exactly right to me, thanks Hannes.  Good catch in memcontrol.c.

Acked-by: Hugh Dickins <hughd@...gle.com>

And I realize now that the tag races which led me to defer to you, are
actually just races we have lived with for years; but before they were
all handled invisibly at the "unlikely(!page)" stage, whereas now they
simply need active handling at the radix_tree_exception stage too.

There is, by the way, a separate cleanup that I noticed last night,
while puzzing over the filemap.c:202 bug.  In mm/truncate.c there
are several "We rely upon deletion not changing page->index" comments
(and in mm/filemap.c "Leave page->index set: truncation relies upon it").
I think your indices[] everywhere have ended that reliance?  Whether you
also remove the "WARN_ON(page->index != index)"s is a matter of taste:
it is reassuring to have that checked somewhere, but no longer so
particular to those loops.

> ---
>  mm/filemap.c    | 49 ++++++++++++++++++++++++++++---------------------
>  mm/memcontrol.c | 20 ++++++++++++--------
>  mm/truncate.c   |  8 --------
>  3 files changed, 40 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a82fbe4c9e8e..d92c437a79c4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -906,8 +906,8 @@ EXPORT_SYMBOL(page_cache_prev_hole);
>   * Looks up the page cache slot at @mapping & @offset.  If there is a
>   * page cache page, it is returned with an increased refcount.
>   *
> - * If the slot holds a shadow entry of a previously evicted page, it
> - * is returned.
> + * If the slot holds a shadow entry of a previously evicted page, or a
> + * swap entry from shmem/tmpfs, it is returned.
>   *
>   * Otherwise, %NULL is returned.
>   */
> @@ -928,9 +928,9 @@ repeat:
>  			if (radix_tree_deref_retry(page))
>  				goto repeat;
>  			/*
> -			 * Otherwise, shmem/tmpfs must be storing a swap entry
> -			 * here as an exceptional entry: so return it without
> -			 * attempting to raise page count.
> +			 * A shadow entry of a recently evicted page,
> +			 * or a swap entry from shmem/tmpfs.  Return
> +			 * it without attempting to raise page count.
>  			 */
>  			goto out;
>  		}
> @@ -983,8 +983,8 @@ EXPORT_SYMBOL(find_get_page);
>   * page cache page, it is returned locked and with an increased
>   * refcount.
>   *
> - * If the slot holds a shadow entry of a previously evicted page, it
> - * is returned.
> + * If the slot holds a shadow entry of a previously evicted page, or a
> + * swap entry from shmem/tmpfs, it is returned.
>   *
>   * Otherwise, %NULL is returned.
>   *
> @@ -1099,8 +1099,8 @@ EXPORT_SYMBOL(find_or_create_page);
>   * with ascending indexes.  There may be holes in the indices due to
>   * not-present pages.
>   *
> - * Any shadow entries of evicted pages are included in the returned
> - * array.
> + * Any shadow entries of evicted pages, or swap entries from
> + * shmem/tmpfs, are included in the returned array.
>   *
>   * find_get_entries() returns the number of pages and shadow entries
>   * which were found.
> @@ -1128,9 +1128,9 @@ repeat:
>  			if (radix_tree_deref_retry(page))
>  				goto restart;
>  			/*
> -			 * Otherwise, we must be storing a swap entry
> -			 * here as an exceptional entry: so return it
> -			 * without attempting to raise page count.
> +			 * A shadow entry of a recently evicted page,
> +			 * or a swap entry from shmem/tmpfs.  Return
> +			 * it without attempting to raise page count.
>  			 */
>  			goto export;
>  		}
> @@ -1198,9 +1198,9 @@ repeat:
>  				goto restart;
>  			}
>  			/*
> -			 * Otherwise, shmem/tmpfs must be storing a swap entry
> -			 * here as an exceptional entry: so skip over it -
> -			 * we only reach this from invalidate_mapping_pages().
> +			 * A shadow entry of a recently evicted page,
> +			 * or a swap entry from shmem/tmpfs.  Skip
> +			 * over it.
>  			 */
>  			continue;
>  		}
> @@ -1265,9 +1265,9 @@ repeat:
>  				goto restart;
>  			}
>  			/*
> -			 * Otherwise, shmem/tmpfs must be storing a swap entry
> -			 * here as an exceptional entry: so stop looking for
> -			 * contiguous pages.
> +			 * A shadow entry of a recently evicted page,
> +			 * or a swap entry from shmem/tmpfs.  Stop
> +			 * looking for contiguous pages.
>  			 */
>  			break;
>  		}
> @@ -1341,10 +1341,17 @@ repeat:
>  				goto restart;
>  			}
>  			/*
> -			 * This function is never used on a shmem/tmpfs
> -			 * mapping, so a swap entry won't be found here.
> +			 * A shadow entry of a recently evicted page.
> +			 *
> +			 * Those entries should never be tagged, but
> +			 * this tree walk is lockless and the tags are
> +			 * looked up in bulk, one radix tree node at a
> +			 * time, so there is a sizable window for page
> +			 * reclaim to evict a page we saw tagged.
> +			 *
> +			 * Skip over it.
>  			 */
> -			BUG();
> +			continue;
>  		}
>  
>  		if (!page_cache_get_speculative(page))
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 29501f040568..c47dffdcb246 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6686,16 +6686,20 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
>  		pgoff = pte_to_pgoff(ptent);
>  
>  	/* page is moved even if it's not RSS of this task(page-faulted). */
> -	page = find_get_page(mapping, pgoff);
> -
>  #ifdef CONFIG_SWAP
>  	/* shmem/tmpfs may report page out on swap: account for that too. */
> -	if (radix_tree_exceptional_entry(page)) {
> -		swp_entry_t swap = radix_to_swp_entry(page);
> -		if (do_swap_account)
> -			*entry = swap;
> -		page = find_get_page(swap_address_space(swap), swap.val);
> -	}
> +	if (shmem_mapping(mapping)) {
> +		page = find_get_entry(mapping, pgoff);
> +		if (radix_tree_exceptional_entry(page)) {
> +			swp_entry_t swp = radix_to_swp_entry(page);
> +			if (do_swap_account)
> +				*entry = swp;
> +			page = find_get_page(swap_address_space(swp), swp.val);
> +		}
> +	} else
> +		page = find_get_page(mapping, pgoff);
> +#else
> +	page = find_get_page(mapping, pgoff);
>  #endif
>  	return page;
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e5cc39ab0751..6a78c814bebf 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -484,14 +484,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  	unsigned long count = 0;
>  	int i;
>  
> -	/*
> -	 * Note: this function may get called on a shmem/tmpfs mapping:
> -	 * pagevec_lookup() might then return 0 prematurely (because it
> -	 * got a gangful of swap entries); but it's hardly worth worrying
> -	 * about - it can rarely have anything to free from such a mapping
> -	 * (most pages are dirty), and already skips over any difficulties.
> -	 */
> -
>  	pagevec_init(&pvec, 0);
>  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> -- 
> 1.9.2
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ