linux-kernel - Re: VM/networking crash cause #1: page allocation failure (order:1, GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200711081655.38371.nickpiggin@yahoo.com.au>
Date:	Thu, 8 Nov 2007 16:55:38 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Frank van Maarseveen <frankvm@...nkvm.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)

On Thursday 08 November 2007 00:48, Frank van Maarseveen wrote:
> On Wed, Nov 07, 2007 at 09:01:17AM +1100, Nick Piggin wrote:
> > On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote:
> > > For quite some time I'm seeing occasional lockups spread over 50
> > > different machines I'm maintaining. Symptom: a page allocation failure
> > > with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems
> > > (lots of free pages, almost no swap used) followed by a lockup
> > > (everything dead). I've collected all (12) crash cases which occurred
> > > the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on
> > > average). The kernel messages are summarized to show the interesting
> > > part (IMO) they have in common. Over the years this has become the
> > > crash cause #1 for stable kernels for me (fglrx doesn't count ;).
> > >
> > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in
> > > an network driver via that same driver (netconsole) may not be the
> > > smartest thing to do and this could be responsible for the lockup
> > > itself. However, the initial page allocation failure remains and I'm
> > > not sure how to address that problem.
> >
> > It isn't unexpected. If an atomic allocation doesn't have enough memory,
> > it kicks off kswapd to start freeing memory for it. However, it cannot
> > wait for memory to become free (it's GFP_ATOMIC), so it has to return
> > failure. GFP_ATOMIC allocation paths are designed so that the kernel can
> > recover from this situation, and a subsequent allocation will have free
> > memory.
> >
> > Probably in production kernels we should default to only reporting this
> > when page reclaim is not making any progress.
> >
> > > I still think the issue is memory fragmentation but if so, it looks
> > > a bit extreme to me: One system with 2GB of ram crashed after a day,
> > > merely running a couple of TCP server programs. All systems have either
> > > 1 or 2GB ram and at least 1G of (merely unused) swap.
> >
> > You can reduce the chances of it happening by increasing
> > /proc/sys/vm/min_free_kbytes.
>
> It's 3807 everywhere by default here which means roughly 950 pages if I
> understand correctly. However, the problem occurs with much more free
> pages as it seems. "grep '  free:' messages*" on the netconsole logging
> machine shows:

But it's an order-1 allocation, which may not be available due to
fragmentation. Although you might have large amounts of memory free
at a given point, fragmentation can be triggered earlier when free
memory gets very low (because order-0 allocations may have taken up
all of the free order-1 pages).

Increasing it is known to help. Although you shouldn't crash due to
allocation failures... it would be nice if you could connect a serial
or vga console and see what's happening...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/