[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53dd1e2c-f07e-ee5b-51a1-0ef8adb53926@linux.microsoft.com>
Date: Tue, 15 Sep 2020 08:48:08 -0700
From: Vijay Balakrishna <vijayb@...ux.microsoft.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Oleg Nesterov <oleg@...hat.com>,
Song Liu <songliubraving@...com>,
Andrea Arcangeli <aarcange@...hat.com>,
Pavel Tatashin <pasha.tatashin@...een.com>,
Allen Pais <apais@...rosoft.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after
memory hotplug as expected by khugepaged
On 9/15/2020 1:18 AM, Michal Hocko wrote:
> On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>>
>>
>> On 9/14/2020 7:33 AM, Michal Hocko wrote:
>>> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>>>> When memory is hotplug added or removed the min_free_kbytes must be
>>>> recalculated based on what is expected by khugepaged. Currently
>>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>>> default set when THP enabled is lost. This leaves the system with small
>>>> min_free_kbytes which isn't suitable for systems especially with network
>>>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>>> kills.
>>>
>>> Care to explain some more please? The whole point of increasing
>>> min_free_kbytes for THP is to get a larger free memory with a hope that
>>> huge pages will be more likely to appear. While this might help for
>>> other users that need a high order pages it is definitely not the
>>> primary reason behind it. Could you provide an example with some more
>>> data?
>>
>> Thanks Michal. I haven't looked into THP as part of my investigation, so I
>> cannot comment.
>>
>> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
>> during normal reboot/shutdown. This memory is hotplug hot-added as movable
>> type via systemd late service during start-of-day.
>>
>> In our stress test first we ran into HW WATCHDOG recovery, on enabling
>> kernel watchdog we started seeing soft lockup hung task notices, failure
>> symptons varied, where stack trace of hung tasks sometimes trying to
>> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
>> timeouts, OOM process kills etc., During investigation we reran stress test
>> without hotplug use case. Surprisingly this run didn't encounter the said
>> problems. This led to comparing what is different between the two runs,
>> while looking at various globals, studying hotplug code I uncovered the
>> issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
>> min_free_kbytes went down to 8703 from 22528 after hotplug add.
>
> Did you try to increase min_free_kbytes manually after hot remove? Btw.
No, in our use case memory hot remove done during shutdown.
> I would consider oom killer invocation due to min_free_kbytes really
> weird behavior. If anything the higher value would cause more memory
> reclaim and potentially oom rather than smaller one.
Yes, we wondered about it too. One panic stack trace (after many OOM kills)
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G
O 5.4.51-xxx #1
[330321.196900] Hardware name: Overlake (DT)
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!
Then while searching I came across documented warning below. In above
instance panic after OOM kills happened after 3+ days of stress run (a
mixure of ttcp, cpuloadgen and fio).
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
Warning
Extreme values can damage your system. Setting min_free_kbytes to an
extremely low value prevents the system from reclaiming memory, which
can result in system hangs and OOM-killing processes. However, setting
min_free_kbytes too high (for example, to 5–10% of total system memory)
causes the system to enter an out-of-memory state immediately, resulting
in the system spending too much time reclaiming memory.
Powered by blists - more mailing lists