linux-kernel - Re: [BUG] infinite loop in find_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.00.1109141337080.29178@sister.anvils>
Date:	Wed, 14 Sep 2011 13:38:11 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Shaohua Li <shaohua.li@...el.com>
cc:	Eric Dumazet <eric.dumazet@...il.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>,
	Lin Ming <mlin@...pku.edu.cn>,
	Justin Piszcz <jpiszcz@...idpixels.com>,
	Pawel Sikora <pluto@...k.net>
Subject: Re: [BUG] infinite loop in find_get_pages()

On Wed, 14 Sep 2011, Shaohua Li wrote:
> On Wed, 2011-09-14 at 16:43 +0800, Eric Dumazet wrote:
> > Le mercredi 14 septembre 2011 à 16:20 +0800, Shaohua Li a écrit :
> > > 2011/9/14 Shaohua Li <shli@...nel.org>:
> > > > it appears we didn't account skipped swap entry in find_get_pages().
> > > > does the attached patch help?
> > > I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
> > > drop caches. The debug patch fixes it at my side.
> > > Eric, please try it.
> > > 
> > 
> > Hello Shaohua
> > 
> > I tried it with added traces :
> > 
> > 
> > [  277.077855] mv used greatest stack depth: 3336 bytes left
> > [  310.558012] nr_found=2 nr_skip=2
> > [  310.558139] nr_found=14 nr_skip=14
> > [  332.195162] nr_found=2 nr_skip=2
> > [  332.195274] nr_found=14 nr_skip=14
> > [  352.315273] nr_found=14 nr_skip=14
> > [  372.673575] nr_found=14 nr_skip=14
> > [  397.115463] nr_found=14 nr_skip=14
> > [  403.391694] cc1 used greatest stack depth: 3184 bytes left
> > [  404.761194] cc1 used greatest stack depth: 2640 bytes left
> > [  417.306510] nr_found=14 nr_skip=14
> > [  440.198051] nr_found=14 nr_skip=14
> > 
> > I also used :
> > 
> > -	if (unlikely(!ret && nr_found))
> > +	if (unlikely(!ret && nr_found > nr_skip))
> >  		goto restart;
> nr_found > nr_skip is better
> 
> > It seems to fix the bug. I suspect it also aborts
> > invalidate_mapping_pages() if we skip 14 pages, but existing comment
> > states its OK :
> > 
> >         /*
> >          * Note: this function may get called on a shmem/tmpfs mapping:
> >          * pagevec_lookup() might then return 0 prematurely (because it
> >          * got a gangful of swap entries); but it's hardly worth worrying
> >          * about - it can rarely have anything to free from such a mapping
> >          * (most pages are dirty), and already skips over any difficulties.
> >          */
> that might be a problem, let Hugh answer if it is.

Thanks to you all for suffering, reporting and investigating this.
Yes, in 3.1-rc I have converted an extremely rare try-again-once
into a too-easily stumbled-upon endless loop.

Would it be a problem to give up early on a shmem/tmpfs mapping in
invalidate_mapping_pages()?  No, not really: it's rare for it to find
anything it can throw away from tmpfs, because it cannot recognize the
clean swapcache pages (getting it to work on those would be nice, and
something I did look into once, but it's not a job for today), and
entirely clean pages (readonly mmap'ed zeroes never touched) are uncommon.

However, I did independently run across scan_mapping_unevictable_pages()
a few days ago: that uses pagevec_lookup() on shmem when doing SHM_UNLOCK,
and although the normal case would be that everything then is in memory,
I think it's not impossible for some to be swapped out (already swapped
out at SHM_LOCK time, and not touched since), which should not stop it
from doing its work on unswapped pages beyond.

My preferred patch is below: but it does add a cond_resched() into
find_get_pages(), which is really below the level at which we usually
do cond_resched().  All callers appear fine with it, and in practice
it would be very^14 rare on anything other than shmem/tmpfs: so this
being rc6 I'm reluctant to make matters worse with a might_sleep().

But I'm not signing this off yet, because I'm still mystified by the
several reports of seemingly the same problem on 3.0.1 and 3.0.2,
which I fear the patch below (even if adjusted to apply) will do
nothing to help - there are no swap entries in radix_tree in 3.0.

My suspicion is that there's some path by which a page gets trapped
in the radix_tree with page count 0.  While it's easy to imagine that
THP's use of compaction and compaction's use of migration could have
made a bug there more common, I do not see it.

I'd like to think about that a little more before finalizing the
patch below - does it work, and does it look acceptable so far?
Of course, the mods to truncate.c and vmscan.c are not essential
parts of this fix, just things to tidy up while on the subject.
Right now I must attend to some other stuff, will return tomorrow.

Hugh

---

 mm/filemap.c  |   14 ++++++++++----
 mm/truncate.c |    8 --------
 mm/vmscan.c   |    2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)

--- 3.1-rc6/mm/filemap.c	2011-08-07 23:44:41.231928061 -0700
+++ linux/mm/filemap.c	2011-09-14 12:24:26.431242155 -0700
@@ -829,8 +829,8 @@ unsigned find_get_pages(struct address_s
 	unsigned int ret;
 	unsigned int nr_found;
 
-	rcu_read_lock();
 restart:
+	rcu_read_lock();
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
 				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
@@ -849,12 +849,15 @@ repeat:
 				 * to root: none yet gotten, safe to restart.
 				 */
 				WARN_ON(start | i);
+				rcu_read_unlock();
 				goto restart;
 			}
 			/*
 			 * Otherwise, shmem/tmpfs must be storing a swap entry
 			 * here as an exceptional entry: so skip over it -
-			 * we only reach this from invalidate_mapping_pages().
+			 * we only reach this from invalidate_mapping_pages(),
+			 * or SHM_UNLOCK's scan_mapping_unevictable_pages() -
+			 * in each case it's correct to skip a swapped entry.
 			 */
 			continue;
 		}
@@ -871,14 +874,17 @@ repeat:
 		pages[ret] = page;
 		ret++;
 	}
+	rcu_read_unlock();
 
 	/*
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found)) {
+		cond_resched();
+		start += nr_found;
 		goto restart;
-	rcu_read_unlock();
+	}
 	return ret;
 }
 
--- 3.1-rc6/mm/truncate.c	2011-08-07 23:44:41.299928402 -0700
+++ linux/mm/truncate.c	2011-09-14 11:23:19.513059010 -0700
@@ -336,14 +336,6 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
-	/*
-	 * Note: this function may get called on a shmem/tmpfs mapping:
-	 * pagevec_lookup() might then return 0 prematurely (because it
-	 * got a gangful of swap entries); but it's hardly worth worrying
-	 * about - it can rarely have anything to free from such a mapping
-	 * (most pages are dirty), and already skips over any difficulties.
-	 */
-
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
--- 3.1-rc6/mm/vmscan.c	2011-08-28 22:10:26.516791859 -0700
+++ linux/mm/vmscan.c	2011-09-14 11:25:27.701694661 -0700
@@ -3375,8 +3375,8 @@ void scan_mapping_unevictable_pages(stru
 		pagevec_release(&pvec);
 
 		count_vm_events(UNEVICTABLE_PGSCANNED, pg_scanned);
+		cond_resched();
 	}
-
 }
 
 /**