linux-kernel - Re: [RFC -v2] panic_on_oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <201506200354.ABC87533.OFFMtSLOFHJVQO@I-love.SAKURA.ne.jp>
Date:	Sat, 20 Jun 2015 03:54:25 +0900
From:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:	mhocko@...e.cz
Cc:	linux-mm@...ck.org, rientjes@...gle.com, hannes@...xchg.org,
	tj@...nel.org, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> Yes I was thinking about this as well because the primary assumption
> of the OOM killer is that the victim will release some memory. And it
> doesn't matter whether the OOM killer was constrained or the global
> one. So the above looks good at first sight, I am just afraid it is too
> relaxed for cases where many tasks are sharing mm.

Excuse me for again exceeding the scope of your patch (only trying to handle
sysctl_panic_on_oom == 1 case), but I think my patch (trying to also handle
sysctl_panic_on_oom == 0 case) will not be too relaxed for cases where many
tasks are sharing mm, for my approach can set task_struct->memdie_start
to jiffies for all tasks sharing mm.

> The primary problem that we have is that we do not have any reliable
> under_oom() check and we simply try to approximate it by heuristics
> which work well enough in most cases. I admit that oom_victims is not
> the ideal one and there might be better. As mentioned above we can check
> watermarks on all zones and cancel the timer if at least one allows for
> an allocation. But there are surely downsides of that approach as well
> because the OOM killer could have been triggered for a higher order
> allocation and we might be still under OOM for those requests.

In my approach, threads calling out_of_memory() is considered as somewhat
reliable under_oom() check. I think this is more reliable than checking
watermarks because the caller of out_of_memory() declared that watermark
is still low for that caller.

> > We need to keep track of per global OOM victim's timeout (e.g. "struct
> > task_struct"->memdie_start ) ?
> 
> I do not think this will help anything. It will just lead to a different
> set of corner cases. E.g.
> 
> 1) mark_oom_victim(T1) memdie_start = jiffies
> 2) fatal_signal_pending(T2) memdie_start = jiffies + delta
> 3) T2 releases memory - No OOM anymore
> 4) out_of_memory - check_memdie_timeout(T1) - KABOOM

Two possible corner cases for my approach are shown below.

One case is that the system can not panic of threads are unable to call
out_of_memory() for some reason.

The other case is that the system will panic if a sequence shown below
occurred.

  (1) First round of OOM state begins.
  (2) Somebody calls out_of_memory().
  (3) OOM victim threads assign jiffies to their memdie_start.
  (4) Somebody else releases memory before timeout expires.
  (5) There comes a moment where nobody needs to call out_of_memory()
      because watermark is no longer low.
  (6) First round of OOM state ends.
  (7) Some of OOM victim threads remain stuck for some reason.
  (8) There comes a moment where somebody needs to call out_of_memory()
      because watermark is again low.
  (9) Second round of OOM state begins.
  (10) Somebody calls out_of_memory().
  (11) The caller of out_of_memory() finds memdie_start which was
       assigned by first round of OOM state. But since the caller of
       out_of_memory() cannot tell whether memdie_start is assigned by
       first round of OOM state or not, the caller will use memdie_start
       as if assigned by second round of OOM state.
  (12) The timeout comes earlier than it should be.

If (7) does not last till (12), we will not hit this case.

If we can distinguish round number of OOM state (e.g. srcu_read_lock() at
memory allocation entry and srcu_read_unlock() at memory allocation return,
while synchronize_srcu() from kernel thread for OOM-killer), we will not hit
this case because (11) will not occur. Well, maybe just comparing
current->oom_start (assigned before calling out_of_memory() for the first
time of this memory allocation request) and victim->memdie_start can do it?

  if (time_after(current->oom_start, victim->memdie_start)) {
    if (time_after(jiffies, current->oom_start + timeout))
      panic();
  } else {
    if (time_after(jiffies, victim->memdie_start + timeout))
      panic();
  }

Well, if without analysis purpose,

  if (time_after(jiffies, oom_start + sysctl_panic_on_oom_timeout * HZ))
    panic();

(that is, pass the jiffies as of calling out_of_memory() for the first time
of this memory allocation request as an argument to out_of_memory(), and
compare at check_panic_on_oom()) is sufficient? Very simple implementation
because we do not use mod_timer()/del_timer().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/