linux-kernel - Re: [PATCH v2] mm: terminate shrink

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <201712091708.GHG60458.MHFOVSFOQtOFLJ@I-love.SAKURA.ne.jp>
Date:   Sat, 9 Dec 2017 17:08:42 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     surenb@...gle.com
Cc:     mhocko@...nel.org, akpm@...ux-foundation.org, hannes@...xchg.org,
        hillf.zj@...baba-inc.com, minchan@...nel.org,
        mgorman@...hsingularity.net, ying.huang@...el.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        timmurray@...gle.com, tkjos@...gle.com
Subject: Re: [PATCH v2] mm: terminate shrink_slab loop if signal is pending

Suren Baghdasaryan wrote:
> On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
> <penguin-kernel@...ove.sakura.ne.jp> wrote:
> >> > >> This change checks for pending
> >> > >> fatal signals inside shrink_slab loop and if one is detected
> >> > >> terminates this loop early.
> >> > >
> >> > > This changelog doesn't really address my previous review feedback, I am
> >> > > afraid. You should mention more details about problems you are seeing
> >> > > and what causes them.
> 
> The problem I'm facing is that a SIGKILL sent from user space to kill
> the least important process is delayed enough for OOM-killer to get a
> chance to kill something else, possibly a more important process. Here
> "important" is from user's point of view. So the delay in SIGKILL
> delivery effectively causes extra kills. Traces indicate that this
> delay happens when process being killed is in direct reclaim and
> shrinkers (before I fixed them) were the biggest cause for the delay.

Sending SIGKILL from userspace is not releasing memory fast enough to prevent
the OOM killer from invoking? Yes, under memory pressure, even an attempt to
send SIGKILL from userspace could be delayed due to e.g. page fault.

Unless it is memcg OOM, you could try OOM notifier callback for checking
whether there are SIGKILL pending processes and wait for timeout if any.
This situation resembles timeout-based OOM killing discussion, where the OOM
killer is enabled again (based on an assumption that the OOM victim got stuck
due to OOM lockup) after some timeout from previous OOM-killing. And since
we did not merge timeout-based approach, there is no timestamp field for
remembering when the SIGKILL was delivered. Therefore, an attempt to wait for
timeout would become messy.

Regardless of whether you try OOM notifier callback, I think that adding
__GFP_KILLABLE and allow bailing out without calling out_of_memory() and
warn_alloc() will help terminating killed processes faster. I think that
majority of memory allocation requests can give up upon SIGKILL. Important
allocation requests which should not give up upon SIGKILL (e.g. committing
to filesystem metadata and storage) can be offloaded to kernel threads.

> 
> >> > > If we have a shrinker which takes considerable
> >> > > amount of time them we should be addressing that. If that is not
> >> > > possible then it should be documented at least.
> 
> I already submitted patches for couple shrinkers. Problem became less
> pronounced and less frequent but the retry loop Tetsuo mentioned still
> visibly delays the delivery. The worst case I've seen after fixing
> shrinkers is 43ms.

You meant "delays the termination (of the killed process)" rather than
"delays the delivery (of SIGKILL)", didn't you?

> > I agree that making waits/loops killable is generally good. But be sure to be
> > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
> > basis (i.e. no guarantee that the allocating thread will leave the page allocator
> > slowpath immediately) and check for fatal_signal_pending() only if
> > __GFP_KILLABLE is set. That is,
> >
> > +               /*
> > +                * We are about to die and free our memory.
> > +                * Stop shrinking which might delay signal handling.
> > +                */
> > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
> > +                       break;
> >
> > at shrink_slab() etc. and
> >
> > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
> > +                       goto nopage;
> >
> > at __alloc_pages_slowpath().
> 
> I was thinking about something similar and will experiment to see if
> this solves the problem and if it has any side effects. Anyone sees
> any obvious problems with this approach?
> 

It is Michal who thinks that "killability for a particular allocation request sounds
like a hack" ( http://lkml.kernel.org/r/201606112335.HBG09891.OLFJOFtVMOQHSF@I-love.SAKURA.ne.jp ).
I'm willing to give up memory allocations from functions which are called by
system calls if SIGKILL is pending. Thus, it should be time to try __GFP_KILLABLE.