linux-kernel - Re: Vanilla-Kernel 3 - page allocation failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E9E71EC.6040701@profihost.ag>
Date:	Wed, 19 Oct 2011 08:45:00 +0200
From:	Philipp Herz - Profihost AG <p.herz@...fihost.ag>
To:	Thadeu Lima de Souza Cascardo <cascardo@...ux.vnet.ibm.com>
CC:	linux-kernel@...r.kernel.org
Subject: Re: Vanilla-Kernel 3 - page allocation failure

Hello Cascardo,

 > echo m>  /proc/sysrq-trigger
Thanks,
I have pasted another Call Trace including memory stats at

* http://pastebin.com/vjLHuqtk

Not sure if memory stats are close enough to the call trace event.

If not, do we have to recompile the kernel to get call traces and memory 
stats at the same time?

 > Do your workload works better on a previous version? I had problems
 > using something like 2.6.32.
Yes,
kernel version was 2.6.32.40 before and we never had those messages 
appearing.

Regards,
Philipp

Am 18.10.2011 16:35, schrieb Thadeu Lima de Souza Cascardo:
> On Tue, Oct 18, 2011 at 03:24:44PM +0200, Philipp Herz - Profihost AG wrote:
>> Hello Cascardo
>>
>>> Usually, after the stack dump, there is some
>>> statistics about memory.
>> Yes, i have seen this in other posts as well.
>>
>>> I have seen that these may be suppressed
>>> if you have a NUMA system with lots of nodes.
>> Yes, in our case it seems to be suppressed.
>>
>>> Check for NODE_SHIFT in your
>>> config. If it's greater than 8, that output may have been suppressed.
>> CONFIG_NODES_SHIFT=10 will be the answer.
>>
>> Is there any way to get those stats without recompiling the kernel?
>>
>>> But you may have just ignored the statistics because of the
>>> stack dump.
>> No, i was also wondering why other do have these ;-)
>>
>> Regards,
>> Philipp
>>
>
> echo m>  /proc/sysrq-trigger
>
> will show you that same output, but not at the time the memory failure
> happens. It may still show you what is the condition of memory on your
> nodes.
>
> I am not that much versed in the VM. It just happens that I had very
> similar issues lately and was trying to undertand it a little more. I
> still have to solve these issues myself.
>
> In my case, the workload is IO bound on extX filesystems and I see that
> other systems have these failures due to this memory pressure. Usually,
> after stopping the workload and unmounting the filesystems, I get most
> of the memory in the system freed.
>
> Most of the failures are from GFP_ATOMIC allocations, because those
> won't reclaim memory, but they won't allocate if there is only freed
> memory below the threshold. Setting this threshold to a lower value
> like it was suggested (min_free_kbytes) would have helped, but, then,
> this allows whatever is putting pressure on your memory to also allocate
> below the threshold and you end up in the same situation (or a worse
> one).
>
> Do your workload works better on a previous version? I had problems
> using something like 2.6.32.
>
> Regards,
> Cascardo.
>
>> Am 18.10.2011 14:38, schrieb Thadeu Lima de Souza Cascardo:
>>> On Tue, Oct 18, 2011 at 02:07:38PM +0200, Philipp Herz - Profihost AG wrote:
>>>> Hello Cascardo,
>>>>
>>>> thanks for your detailed answer!
>>>>
>>>> I have uploaded two call traces to pastebin for further investigation.
>>>>
>>>> Maybe this can help you.
>>>>
>>>> * http://pastebin.com/Psg2dGYC (kworker)
>>>> * http://pastebin.com/pPFjZqxL (php5)
>>>>
>>>> Regards,
>>>> Philipp
>>>>
>>>
>>> Hello, Philipp.
>>>
>>> That only tells us that you have a TCP workload in your system. This is
>>> the subsystem that is trying to allocate memory. However, we do not know
>>> why there is failure. Usually, after the stack dump, there is some
>>> statistics about memory. I have seen that these may be suppressed if you
>>> have a NUMA system with lots of nodes. Check for NODE_SHIFT in your
>>> config. If it's greater than 8, that output may have been suppressed.
>>> But you may have just ignored the statistics because of the stack dump.
>>>
>>> Regards,
>>> Cascardo.
>>>
>>>>
>>>> Am 18.10.2011 13:32, schrieb Thadeu Lima de Souza Cascardo:
>>>>> On Tue, Oct 18, 2011 at 12:25:03PM +0200, Philipp Herz - Profihost AG wrote:
>>>>>> After updating kernel (x86_64) to stable version 3 there are a few
>>>>>> messages appearing in the kernel log such as
>>>>>>
>>>>>> kworker/0:1: page allocation failure: order:1, mode:0x20
>>>>>> mysql: page allocation failure: order:1, mode:0x20
>>>>>> php5: page allocation failure: order:1, mode:0x20
>>>>>>
>>>>>> Searching the net showed that these messages are known to occur since 2004.
>>>>>>
>>>>>> Some people were able to get rid of them by setting
>>>>>> /proc/sys/vm/min_free_kbytes to a high enough value. This does not
>>>>>> help in our case.
>>>>>>
>>>>>>
>>>>>> Is there a kernel comand line argument to avoid these messages?
>>>>>>
>>>>>> As of mm/page_alloc.c these messages are marked to be only warning
>>>>>> messages and would not appear if 'gpf_mask' was set to __GFP_NOWARN
>>>>>> in function warn_alloc_failed.
>>>>>>
>>>>>> How does this mask get set? Is it set by the "external" process
>>>>>> knocking at the memory manager?
>>>>>>
>>>>>
>>>>> Hello, Philipp.
>>>>>
>>>>> This happens when kernel tries to allocate memory, sometimes in response
>>>>> to some request by the user space, but also in other contexts. For
>>>>> example, an interrupt by a network driver may try to allocate memory. In
>>>>> this context, it will use GFP_ATOMIC as a mask, for example. The most
>>>>> usual flags in the kernel are GFP_KERNEL and GFP_ATOMIC.
>>>>>
>>>>>> What is the magic behind the 'order' and 'mode'?
>>>>>>
>>>>>
>>>>> The order is the binary log of the number of pages requested. So, order 1
>>>>> allocations are 2 pages, order 4 would be 16 pages, for example.
>>>>>
>>>>> The mode is, in fact, gfp_flags. 0x20 is GFP_ATOMIC. This kind of
>>>>> allocation cannot do IO or access the filesystem. Also, it cannot wait
>>>>> for reclaim memory from cache.
>>>>>
>>>>> This warning is usually followed by some statistics about memory use
>>>>> in your system. Please post it to give more information about this
>>>>> situation.
>>>>>
>>>>> I have watched some of this happen when lots of cache is used by some
>>>>> filesystems. Perhaps, some tweaking of the vm sysctl options may help,
>>>>> but I can point any magic tweaking right now.
>>>>>
>>>>> Regards,
>>>>> Cascardo.
>>>>>
>>>>>> I'm not a subscriber, so please CC me a copy of messages related to
>>>>>> the subject. I'm not sure if I can help much by looking at the
>>>>>> inside of the kernel, but I will try my best to answer any questions
>>>>>> concerning this issue.
>>>>>>
>>>>>> Best regards, Philipp
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>>>> the body of a message to majordomo@...r.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>>>
>>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/