[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YCt+cVvWPbWvt2rG@dhcp22.suse.cz>
Date: Tue, 16 Feb 2021 09:12:33 +0100
From: Michal Hocko <mhocko@...e.com>
To: Eiichi Tsukata <eiichi.tsukata@...anix.com>
Cc: corbet@....net, mike.kravetz@...cle.com, mcgrof@...nel.org,
keescook@...omium.org, yzaikin@...gle.com,
akpm@...ux-foundation.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, felipe.franciosi@...anix.com
Subject: Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom
On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote:
> Hugepages can be preallocated to avoid unpredictable allocation latency.
> If we run into 4k page shortage, the kernel can trigger OOM even though
> there were free hugepages. When OOM is triggered by user address page
> fault handler, we can use oom notifier to free hugepages in user space
> but if it's triggered by memory allocation for kernel, there is no way
> to synchronously handle it in user space.
Can you expand some more on what kind of problem do you see?
Hugetlb pages are, by definition, a preallocated, unreclaimable and
admin controlled pool of pages. Under those conditions it is expected
and required that the sizing would be done very carefully. Why is that a
problem in your particular setup/scenario?
If the sizing is really done properly and then a random process can
trigger OOM then this can lead to malfunctioning of those workloads
which do depend on hugetlb pool, right? So isn't this a kinda DoS
scenario?
> This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> enabled, it first tries to free a hugepage if available before invoking
> the oom-killer. The default value is disabled not to change the current
> behavior.
Why is this interface not hugepage size aware? It is quite different to
release a GB huge page or 2MB one. Or is it expected to release the
smallest one? To the implementation...
[...]
> +static int sacrifice_hugepage(void)
> +{
> + int ret;
> +
> + spin_lock(&hugetlb_lock);
> + ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);
... no it is going to release the default huge page. This will be 2MB in
most cases but this is not given.
Unless I am mistaken this will free up also reserved hugetlb pages. This
would mean that a page fault would SIGBUS which is very likely not
something we want to do right? You also want to use oom nodemask rather
than a full one.
Overall, I am not really happy about this feature even when above is
fixed, but let's hear more the actual problem first.
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists