linux-kernel - Re: Deadlock possibly caused by too_many

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101020055717.GA12752@localhost>
Date:	Wed, 20 Oct 2010 13:57:17 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Torsten Kaiser <just.for.lkml@...glemail.com>
Cc:	Neil Brown <neilb@...e.de>, Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"Li, Shaohua" <shaohua.li@...el.com>
Subject: Re: Deadlock possibly caused by too_many_isolated.

On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote:
> On Tue, Oct 19, 2010 at 10:43 AM, Torsten Kaiser
> <just.for.lkml@...glemail.com> wrote:
> > On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@...e.de> wrote:
> >> Yes, thanks for the report.
> >> This is a real bug exactly as you describe.
> >>
> >> This is how I think I will fix it, though it needs a bit of review and
> >> testing before I can be certain.
> >> Also I need to check raid10 etc to see if they can suffer too.
> >>
> >> If you can test it I would really appreciate it.
> >
> > I did test it, but while it seemed to fix the deadlock, the system
> > still got unusable.
> > The still running "vmstat 1" showed that the swapout was still
> > progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds.
> >
> > I also tried to additionally add Wu's patch:
> > --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800
> > +++ linux-next/mm/vmscan.c      2010-10-19 00:13:04.000000000 +0800
> > @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone
> >               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> >       }
> >
> > +       /*
> > +        * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that
> > +        * they won't get blocked by normal ones and form circular deadlock.
> > +        */
> > +       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> > +               inactive >>= 3;
> > +
> >       return isolated > inactive;
> >
> > Either it did help somewhat, or I was more lucky on my second try, but
> > this time I needed ~5 tries instead of only 2 to get the system mostly
> > stuck again. On the testrun with Wu's patch the writeout pattern was
> > more stable, a burst of ~80kb each 20 seconds. But I would suspect
> > that the size of the burst is rather random.
> >
> > I do have a complete SysRq+T dump from the first run, I can send that
> > to anyone how wants it.
> > (It's 190k so I don't want not spam it to the list)
> 
> Is this call trace from the SysRq+T violation the rule to only
> allocate one bio from bio_alloc() until its submitted?
> 
> [  549.700038] Call Trace:
> [  549.700038]  [<ffffffff81566b54>] schedule_timeout+0x144/0x200
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81565e22>] io_schedule_timeout+0x42/0x60
> [  549.700038]  [<ffffffff81083123>] mempool_alloc+0x163/0x1b0
> [  549.700038]  [<ffffffff81053560>] ? autoremove_wake_function+0x0/0x40
> [  549.700038]  [<ffffffff810ea2b9>] bio_alloc_bioset+0x39/0xf0
> [  549.700038]  [<ffffffff810ea38d>] bio_clone+0x1d/0x50
> [  549.700038]  [<ffffffff814318ed>] make_request+0x23d/0x850
> [  549.700038]  [<ffffffff81082e20>] ? mempool_alloc_slab+0x10/0x20
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff81436e63>] md_make_request+0xc3/0x220
> [  549.700038]  [<ffffffff81083099>] ? mempool_alloc+0xd9/0x1b0
> [  549.700038]  [<ffffffff811ec153>] generic_make_request+0x1b3/0x370
> [  549.700038]  [<ffffffff810ea2d6>] ? bio_alloc_bioset+0x56/0xf0
> [  549.700038]  [<ffffffff811ec36a>] submit_bio+0x5a/0xd0
> [  549.700038]  [<ffffffff81080cf5>] ? unlock_page+0x25/0x30
> [  549.700038]  [<ffffffff810a871e>] swap_writepage+0x7e/0xc0
> [  549.700038]  [<ffffffff81090d99>] shmem_writepage+0x1c9/0x240
> [  549.700038]  [<ffffffff8108c9cb>] pageout+0x11b/0x270
> [  549.700038]  [<ffffffff8108cd78>] shrink_page_list+0x258/0x4d0
> [  549.700038]  [<ffffffff8108d9e7>] shrink_inactive_list+0x187/0x310
> [  549.700038]  [<ffffffff8102dcb1>] ? __wake_up_common+0x51/0x80
> [  549.700038]  [<ffffffff811fc8b2>] ? cpumask_next_and+0x22/0x40
> [  549.700038]  [<ffffffff8108e1c0>] shrink_zone+0x3e0/0x470
> [  549.700038]  [<ffffffff8108e797>] try_to_free_pages+0x157/0x410
> [  549.700038]  [<ffffffff81087c92>] __alloc_pages_nodemask+0x412/0x760
> [  549.700038]  [<ffffffff810b27d6>] alloc_pages_current+0x76/0xe0
> [  549.700038]  [<ffffffff810b6dad>] new_slab+0x1fd/0x2a0
> [  549.700038]  [<ffffffff81045cd0>] ? process_timeout+0x0/0x10
> [  549.700038]  [<ffffffff810b8721>] __slab_alloc+0x111/0x540
> [  549.700038]  [<ffffffff81059961>] ? prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff810b92bb>] kmem_cache_alloc+0x9b/0xa0
> [  549.700038]  [<ffffffff81059961>] prepare_creds+0x21/0xb0
> [  549.700038]  [<ffffffff8104a919>] sys_setresgid+0x29/0x120
> [  549.700038]  [<ffffffff8100242b>] system_call_fastpath+0x16/0x1b
> [  549.700038]  ffff88011e125ea8 0000000000000046 ffff88011e125e08
> ffffffff81073c59
> [  549.700038]  0000000000012780 ffff88011ea905b0 ffff88011ea90808
> ffff88011e125fd8
> [  549.700038]  ffff88011ea90810 ffff88011e124010 0000000000012780
> ffff88011e125fd8
> 
> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one
> bio. That bio is the submitted, but the submit path seems to get into
> make_request from raid1.c and that allocates a second bio from
> bio_alloc() via bio_clone().
> 
> I am seeing this pattern (swap_writepage calling
> md_make_request/make_request and then getting stuck in mempool_alloc)
> more than 5 times in the SysRq+T output...

I bet the root cause is the failure of pool->alloc(__GFP_NORETRY)
inside mempool_alloc(), which can be fixed by this patch.

Thanks,
Fengguang
---

concurrent direct page reclaim problem

  __GFP_NORETRY page allocations may fail when there are many concurrent page
  allocating tasks, but not necessary in real short of memory. The root cause
  is, tasks will first run direct page reclaim to free some pages from the LRU
  lists and put them to the per-cpu page lists and the buddy system, and then
  try to get a free page from there.  However the free pages reclaimed by this
  task may be consumed by other tasks when the direct reclaim task is able to
  get the free page for itself.

  Let's retry it a bit harder.

--- linux-next.orig/mm/page_alloc.c	2010-10-20 13:44:50.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-10-20 13:50:54.000000000 +0800
@@ -1700,7 +1700,7 @@ should_alloc_retry(gfp_t gfp_mask, unsig
 				unsigned long pages_reclaimed)
 {
 	/* Do not loop if specifically requested */
-	if (gfp_mask & __GFP_NORETRY)
+	if (gfp_mask & __GFP_NORETRY && pages_reclaimed > (1 << (order + 12)))
 		return 0;
 
 	/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/