linux-kernel - Re: iotop: khugepaged at 99.99% (2.6.38.X)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110506011319.GH7838@random.random>
Date:	Fri, 6 May 2011 03:13:19 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Thomas Sattler <tsattler@....de>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Mel Gorman <mel@....ul.ie>
Subject: Re: iotop: khugepaged at 99.99% (2.6.38.X)

On Fri, May 06, 2011 at 12:04:14AM +0200, Thomas Sattler wrote:
> It happened again: This time with 2.6.38.4 after 13 days uptime.
> In fact it was "13 days after last boot", since this machine is
> hibernated quite often. I waited only two minutes before I run
> 'reboot' as root.
> 
> > Please next time can you run SYSRQ+t too in addition of SYSRQ+l?
> 
> See http://pastebin.com/raw.php?i=XnXXfC40 (It seems to me SYSRQ+l
> did not work at all? And does also not work on 2.6.38.5?)
> 
> see http://pastebin.com/raw.php?i=Zuv0VnUP for 'top/iotop'

Ok this time we're onto something.

The 3 tasks (khugepaged, thunderbird-bin, convert) are allocating
hugepages, and all 3 get stuck in the convestion_wait loop of
shrink_zone controlled by too_many_isolated() indefinitely in trying
to free memory (likely for compaction). kswapd is idle, rightfully so
because it's up to khugepaged the task to allocate hugepages in
background.

So to me it looks like either too_many_isolated is wrong, or maybe it
could be the loop of compaction_suitable that is insisting too much.

Admittedly if there are SWAP_CLUSTER_MAX 2M pages, the isolated pages
will rocket up fast to 64M (while if those were 4k pages it'd go up
max 128k), but if they're all in the loop nr_isolated_anon and they
never return should have been zero. Maybe they return but compaction
suitable makes them loop again. I'm uncertain what's going on yet.

The threshold of the per-cpu vmstat should be well under 512 pages, so
likely the lack of synchronization for the stats isn't to blame for
this. For now we'll assume the per-cpu stats aren't the problem.

Now the thing I want to rule out first is an accounting error in the
isolated pages, so when it hangs again I'd like to see the output of:

     grep anon /proc/zoneinfo

So we can see immediately what are the values of nr_isolated_anon and
nr_inactive_anon (the hang should only happen when nr_isolated_anon >
nr_inactive_anon).

You can already run "grep threshold /proc/zoneinfo" on the system
where you reproduced the hang the last time (the one running 2.6.38.4)
the one with 1.5G of ram. They all should be well below 512 (so in
theory not causing troubles because of the per-cpu stats, and with so
few cpus it shouldn't have been such a longstanding problem anyway).

If you didn't reboot that system after the last hang, you can already
run "grep anon /proc/zoneinfo" while the system is mostly idle, then
all nr_isolated_anon should be zero. If they're not zero and they stay
not zero on a idle system, we've an accounting bug to fix. If they're
all zero like they should, then we're likely looping in the compaction
suitable.

On my busy kernels:

grep nr_isolated_anon /proc/zoneinfo 
    nr_isolated_anon 0
    nr_isolated_anon 0
    nr_isolated_anon 0

grep nr_isolated_anon /proc/zoneinfo 
    nr_isolated_anon 0
    nr_isolated_anon 0

grep nr_isolated_anon /proc/zoneinfo 
    nr_isolated_anon 0
    nr_isolated_anon 0
    nr_isolated_anon 0

No apparent accounting problem here despite quite some load and
uptime.

I've already a patch to try for the compaction suitable loop but I'll
wait your feedback and I need to think a bit more about this.

This patch may help you to reproduce much quicker, I'll try that too
to see if I can reproduce... (ignore the sync_migration = true, it
won't hurt but it's unrelated to the debug patch, just apply it if
you've trouble reproducing it again, when compaction succeeds, and it
does 99% of the time even with the less reliable async initial mode,
it likely hides the problem very well)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..c2f3646 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2105,8 +2105,9 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
+	sync_migration = true;

+#if 0
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
@@ -2115,6 +2116,7 @@ rebalance:
 					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
+#endif

 	/*
 	 * If we failed to make any progress reclaiming, then we are

CC'ed Mel so he can check this too.

Thanks a lot for the help.
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/