linux-kernel - Re: VM/networking crash cause #1: page allocation failure (order:1, GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 7 Nov 2007 09:01:17 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Frank van Maarseveen <frankvm@...nkvm.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)

On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote:
> For quite some time I'm seeing occasional lockups spread over 50 different
> machines I'm maintaining. Symptom: a page allocation failure with order:1,
> GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free
> pages, almost no swap used) followed by a lockup (everything dead). I've
> collected all (12) crash cases which occurred the last 10 weeks on 50
> machines total (i.e. 1 crash every 41 weeks on average). The kernel
> messages are summarized to show the interesting part (IMO) they have
> in common. Over the years this has become the crash cause #1 for stable
> kernels for me (fglrx doesn't count ;).
>
> One note: I suspect that reporting a GFP_ATOMIC allocation failure in an
> network driver via that same driver (netconsole) may not be the smartest
> thing to do and this could be responsible for the lockup itself. However,
> the initial page allocation failure remains and I'm not sure how to
> address that problem.

It isn't unexpected. If an atomic allocation doesn't have enough memory,
it kicks off kswapd to start freeing memory for it. However, it cannot
wait for memory to become free (it's GFP_ATOMIC), so it has to return
failure. GFP_ATOMIC allocation paths are designed so that the kernel can
recover from this situation, and a subsequent allocation will have free
memory.

Probably in production kernels we should default to only reporting this
when page reclaim is not making any progress.


> I still think the issue is memory fragmentation but if so, it looks
> a bit extreme to me: One system with 2GB of ram crashed after a day,
> merely running a couple of TCP server programs. All systems have either
> 1 or 2GB ram and at least 1G of (merely unused) swap.

You can reduce the chances of it happening by increasing
/proc/sys/vm/min_free_kbytes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/