[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160623091830.GA32535@sig21.net>
Date: Thu, 23 Jun 2016 11:18:30 +0200
From: Johannes Stezenbach <js@...21.net>
To: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Michal Hocko <mhocko@...nel.org>
Subject: Re: 4.6.2 frequent crashes under memory + IO pressure
On Tue, Jun 21, 2016 at 08:47:51PM +0900, Tetsuo Handa wrote:
> Johannes Stezenbach wrote:
> >
> > a man's got to have a hobby, thus I'm running Android AOSP
> > builds on my home PC which has 4GB of RAM, 4GB swap.
> > Apparently it is not really adequate for the job but used to
> > work with a 4.4.10 kernel. Now I upgraded to 4.6.2
> > and it crashes usually within 30mins during compilation.
>
> Such reproducer is welcomed.
> You might be hitting OOM livelock using innocent workload.
>
> > The crash is a hard hang, mouse doesn't move, no reaction
> > to keyboard, nothing in logs (systemd journal) after reboot.
>
> Yes, it seems to me that your system is OOM livelocked.
I got from my crash log that X is hanging in
i915_gem_object_get_pages_gtt, and network is dead
due to order 0 allocation errors causing a series of
"ath9k_htc: RX memory allocation error", which is
what makes the issue so unpleasant.
The particular command which triggers it seems to be
Jill from the Android Java toolchain
(http://tools.android.com/tech-docs/jackandjill),
which runs as "java -Xmx3500m -jar $(JILL_JAR)", i.e.
potentially eating all my available RAM when linking
the Android framework.
Meanwhile I found some RAM and linux-4.6.2 runs stable
with 8GB for this workload. The build time (for the
partial AOSP rebuild that fairly reliably triggered the hangup)
dropped from ~20min to ~17min (so it wasn't trashing too
badly), swap usage dropped from ~50% (of 4GB) to <5%.
> It is sad that we haven't merged kmallocwd which will report
> which memory allocations are stalling
> ( http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ).
Would you like me to try it? It wouldn't prevent the hang, though,
just print better debug ouptut to serial console, right?
Or would it OOM kill some process?
> > Then I tried 4.5.7, it seems to be stable so far.
> >
> > I'm using dm-crypt + lvm + ext4 (swap also in lvm).
> >
> > Now I hooked up a laptop to the serial port and captured
> > some logs of the crash which seems to be repeating
> >
> > [ 2240.842567] swapper/3: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
> > or
> > [ 2241.167986] SLUB: Unable to allocate memory on node -1, gfp=0x2080020(GFP_ATOMIC)
> >
> > over and over. Based on the backtraces in the log I decided
> > to hot-unplug USB devices, and twice the kernel came
> > back to live, but on the 3rd crash it was dead for good.
>
> The values
>
> DMA free:12kB min:32kB
> DMA32 free:2268kB min:6724kB
> Normal free:84kB min:928kB
>
> suggest that memory reserves are spent for pointless purpose. Maybe your system is
> falling into situation which was mitigated by commit 78ebc2f7146156f4 ("mm,writeback:
> don't use memory reserves for wb_start_writeback"). Thus, applying that commit to
> your 4.6.2 kernel might help avoiding flood of these allocation failure messages.
I could try. Could you let me know if booting with mem=4G
is equivalent, or do I need to use memmap= or physically remove
the RAM (which is not so easy since the CPU fan is in the way).
> > Before I pressed the reset button I used SysRq-W. At the bottom
> > is a "BUG: workqueue lockup", it could be the result of
> > the log spew on serial console taking so long but it looks
> > like some IO is never completing.
>
> But even after you apply that commit, I guess you will still see silent hang up
> because the page allocator would think there is still reclaimable memory. So, is
> it possible to also try current linux.git kernels? I'd like to know whether
> "OOM detection rework" (which went to 4.7) helps giving up reclaiming and
> invoking the OOM killer with your workload.
>
> Maybe __GFP_FS allocations start invoking the OOM killer. But maybe __GFP_FS
> allocations still remain stuck waiting for !__GFP_FS allocations whereas !__GFP_FS
> allocations gives up without invoking the OOM killer (i.e. effectively no "give up").
I could also try. Same question about mem= though.
What is your opinion about older kernels (4.4, 4.5) working?
I think I've seen some OOM messages with the older kernels,
Jill was killed and I restarted the build to complete it.
A full bisect would take more than a day, I don't think
I have the time for it.
Since I use dm-crypt + lvm, should we add more Cc or do
you think it is an mm issue?
> > Below I'm pasting some log snippets, let me know if you like
> > it so much you want more of it ;-/ The total log is about 1.7MB.
>
> Yes, I'd like to browse it. Could you send it to me?
Did you get any additional insights from it?
Thanks,
Johannes
Powered by blists - more mailing lists