linux-kernel - Re: [PATCH v4 0/4] Deterministic charging of shared memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izM-DpKs1=-F04-8Qp55VFYC57zERpKkFBk4_QNsOmnOoQ@mail.gmail.com>
Date:   Fri, 19 Nov 2021 21:27:34 -0800
From:   Mina Almasry <almasrymina@...gle.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Jonathan Corbet <corbet@....net>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Hugh Dickins <hughd@...gle.com>, Shuah Khan <shuah@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Greg Thelen <gthelen@...gle.com>,
        Dave Chinner <david@...morbit.com>,
        Roman Gushchin <guro@...com>, "Theodore Ts'o" <tytso@....edu>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v4 0/4] Deterministic charging of shared memory

On Fri, Nov 19, 2021 at 9:01 PM Matthew Wilcox <willy@...radead.org> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
>
> Why is ENOSPC the right error instead of ENOMEM?

Returning ENOMEM from mem_cgroup_charge_mapping() will cause the
application to get ENOMEM from non-pagefault paths (which is perfectly
fine), and get stuck in a loop trying to resolve the pagefault in the
pagefault path (less fine). The logic is here:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1432

ENOMEM gets bubbled up here as VM_FAULT_OOM and on remote charges the
behavior I see is that the kernel loops the pagefault forever until
memory is freed in the remote memcg, and it may never will.

ENOSPC gets bubbled up here as a VM_FAULT_SIGBUS and and sends a
SIGBUS to the allocating process. The conjecture here is that it's
preferred to send a SIGBUS to the allocating process rather than have
it be stuck in a loop trying to resolve a pagefault.