linux-kernel - Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YCt+cVvWPbWvt2rG@dhcp22.suse.cz>
Date:   Tue, 16 Feb 2021 09:12:33 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Eiichi Tsukata <eiichi.tsukata@...anix.com>
Cc:     corbet@....net, mike.kravetz@...cle.com, mcgrof@...nel.org,
        keescook@...omium.org, yzaikin@...gle.com,
        akpm@...ux-foundation.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, felipe.franciosi@...anix.com
Subject: Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote:
> Hugepages can be preallocated to avoid unpredictable allocation latency.
> If we run into 4k page shortage, the kernel can trigger OOM even though
> there were free hugepages. When OOM is triggered by user address page
> fault handler, we can use oom notifier to free hugepages in user space
> but if it's triggered by memory allocation for kernel, there is no way
> to synchronously handle it in user space.

Can you expand some more on what kind of problem do you see?
Hugetlb pages are, by definition, a preallocated, unreclaimable and
admin controlled pool of pages. Under those conditions it is expected
and required that the sizing would be done very carefully. Why is that a
problem in your particular setup/scenario?

If the sizing is really done properly and then a random process can
trigger OOM then this can lead to malfunctioning of those workloads
which do depend on hugetlb pool, right? So isn't this a kinda DoS
scenario?

> This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> enabled, it first tries to free a hugepage if available before invoking
> the oom-killer. The default value is disabled not to change the current
> behavior.

Why is this interface not hugepage size aware? It is quite different to
release a GB huge page or 2MB one. Or is it expected to release the
smallest one? To the implementation...

[...]
> +static int sacrifice_hugepage(void)
> +{
> +	int ret;
> +
> +	spin_lock(&hugetlb_lock);
> +	ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);

... no it is going to release the default huge page. This will be 2MB in
most cases but this is not given.

Unless I am mistaken this will free up also reserved hugetlb pages. This
would mean that a page fault would SIGBUS which is very likely not
something we want to do right? You also want to use oom nodemask rather
than a full one.

Overall, I am not really happy about this feature even when above is
fixed, but let's hear more the actual problem first.
-- 
Michal Hocko
SUSE Labs