[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <57be50b2-a97a-e559-e4bd-10d923895f83@i-love.sakura.ne.jp>
Date: Sat, 14 Sep 2019 15:15:22 +0900
From: Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
To: Michal Hocko <mhocko@...nel.org>
Cc: linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Petr Mladek <pmladek@...e.com>,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH (resend)] mm,oom: Defer dump_tasks() output.
On 2019/09/10 20:00, Tetsuo Handa wrote:
> On 2019/09/09 22:04, Michal Hocko wrote:
>> On Mon 09-09-19 21:40:24, Tetsuo Handa wrote:
>>> On 2019/09/09 20:36, Michal Hocko wrote:
>>>> This is not an improvement. It detaches the oom report and tasks_dump
>>>> for an arbitrary amount of time because the worder context might be
>>>> stalled for an arbitrary time. Even long after the oom is resolved.
>>>
>>> A new worker thread is created if all existing worker threads are busy
>>> because this patch solves OOM situation quickly when a new worker thread
>>> cannot be created due to OOM situation.
>>>
>>> Also, if a worker thread cannot run due to CPU starvation, the same thing
>>> applies to dump_tasks(). In other words, dump_tasks() cannot complete due
>>> to CPU starvation, which results in more costly and serious consequences.
>>> Being able to send SIGKILL and reclaim memory as soon as possible is
>>> an improvement.
>>
>> There might be zillion workers waiting to make a forward progress and
>> you cannot expect any timing here. Just remember your own experiments
>> with xfs and low memory conditions.
>
> Even if there were zillion workers waiting to make a forward progress, the
> worker for processing dump_tasks() output can make a forward progress. That's
> how workqueue works. (If you still don't trust workqueue, I can update my patch
> to use a kernel thread.) And if there were zillion workers waiting to make a
> forward progress, completing the OOM killer quickly will be more important than
> keep blocking zillion workers waiting for the OOM killer to solve OOM situation.
> Preempting a thread calling out_of_memory() by zillion workers is a nightmare. ;-)
>
>>
>>>> Not to mention that 1:1 (oom to tasks) information dumping is
>>>> fundamentally broken. Any task might be on an oom list of different
>>>> OOM contexts in different oom scopes (think of OOM happening in disjunct
>>>> NUMA sets).
>>>
>>> I can't understand what you are talking about. This patch just defers
>>> printk() from /proc/sys/vm/oom_dump_tasks != 0. Please look at the patch
>>> carefully. If you are saying that it is bad that OOM victim candidates for
>>> OOM domain B, C, D ... cannot be printed if printing of OOM victim candidates
>>> for OOM domain A has not finished, I can update this patch to print them.
>>
>> You would have to track each ongoing oom context separately.
>
> I can update my patch to track each OOM context separately. But
>
>> And not
>> only those from different oom scopes because as a matter of fact a new
>> OOM might trigger before the previous dump_tasks managed to be handled.
>
> please be aware that we are already dropping OOM messages from different scopes
> due to __ratelimit(&oom_rs). The difference is, given that __ratelimit(&oom_rs)
> can work, nothing but which OOM messages will be dropped when cluster of OOM
> events from multiple different scopes happened.
>
> And "OOM events from multiple different scopes can trivially happen" is a
> violation for commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer
> locking") saying
>
> However, the OOM killer is a fairly cold error path, there is really no
> reason to optimize for highly performant and concurrent OOM kills.
>
> where we will need "per an OOM scope locking mechanism" in order to avoid
> deferring OOM killer event in current thread's OOM scope due to processing
> OOM killer events in other threads' OOM scopes.
>
Here is a delta patch.
>From d34ef26275635d14c746ed46e5478c1dc0319ca2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Date: Sat, 14 Sep 2019 15:11:02 +0900
Subject: [PATCH] mm,oom: Don't suppress dump_tasks() output from different OOM
scopes.
Michal Hocko is complaining that "mm,oom: Defer dump_tasks() output."
needlessly suppresses concurrent OOM killer invocations from different
OOM scopes. This patch changes dump_tasks() not to suppress such output.
---
kernel/fork.c | 1 +
mm/oom_kill.c | 64 ++++++++++++++++++++++++++++++++++++++---------------------
2 files changed, 42 insertions(+), 23 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index f601168..c86a126d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1854,6 +1854,7 @@ static __latent_entropy struct task_struct *copy_process(
p = dup_task_struct(current, node);
if (!p)
goto fork_out;
+ INIT_LIST_HEAD(&p->oom_task_info.list);
/*
* This _must_ happen before we call free_task(), i.e. before we jump
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index db62f50..7f57cea 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -379,6 +379,7 @@ static void select_bad_process(struct oom_control *oc)
static unsigned int oom_killer_seq; /* Serialized by oom_lock. */
static LIST_HEAD(oom_candidate_list); /* Serialized by oom_lock. */
+static LIST_HEAD(oom_tmp_candidate_list); /* Serialized by oom_lock. */
static int add_candidate_task(struct task_struct *p, void *arg)
{
@@ -395,8 +396,13 @@ static int add_candidate_task(struct task_struct *p, void *arg)
p = find_lock_task_mm(p);
if (!p)
return 0;
- get_task_struct(p);
oti = &p->oom_task_info;
+ /* p might be still waiting for dump_candidate_tasks(). */
+ if (!list_empty(&oti->list)) {
+ task_unlock(p);
+ return 1;
+ }
+ get_task_struct(p);
mm = p->mm;
oti->seq = oom_killer_seq;
memcpy(oti->comm, p->comm, sizeof(oti->comm));
@@ -409,14 +415,14 @@ static int add_candidate_task(struct task_struct *p, void *arg)
oti->swapents = get_mm_counter(mm, MM_SWAPENTS);
oti->score_adj = p->signal->oom_score_adj;
task_unlock(p);
- list_add_tail(&oti->list, &oom_candidate_list);
+ list_add_tail(&oti->list, &oom_tmp_candidate_list);
return 0;
}
static void dump_candidate_tasks(struct work_struct *work)
{
bool first = true;
- unsigned int seq;
+ unsigned int seq = 0;
struct oom_task_info *oti;
if (work) /* Serialize only if asynchronous. */
@@ -424,7 +430,10 @@ static void dump_candidate_tasks(struct work_struct *work)
while (!list_empty(&oom_candidate_list)) {
oti = list_first_entry(&oom_candidate_list,
struct oom_task_info, list);
- seq = oti->seq;
+ if (seq != oti->seq) {
+ seq = oti->seq;
+ first = true;
+ }
if (first) {
pr_info("OOM[%u]: Tasks state (memory values in pages):\n",
seq);
@@ -436,7 +445,7 @@ static void dump_candidate_tasks(struct work_struct *work)
seq, oti->pid, oti->uid, oti->tgid, oti->total_vm,
oti->mm_rss, oti->pgtables_bytes, oti->swapents,
oti->score_adj, oti->comm);
- list_del(&oti->list);
+ list_del_init(&oti->list);
if (work)
mutex_unlock(&oom_lock);
put_task_struct(container_of(oti, struct task_struct,
@@ -464,32 +473,41 @@ static void dump_candidate_tasks(struct work_struct *work)
static void dump_tasks(struct oom_control *oc)
{
struct task_struct *p;
+ int ret = 0;
/*
- * Suppress as long as there is any OOM victim candidate from past
- * rounds of OOM killer invocations. We could change this to suppress
- * only if there is an OOM victim candidate in the same OOM domain if
- * we want to see OOM victim candidates from different OOM domains.
- * But since dump_header() is already ratelimited, I don't know whether
- * it makes difference to suppress OOM victim candidates from different
- * OOM domains...
+ * Suppress if OOM victim candidates in the same OOM scope from past
+ * OOM killer invocations are still waiting for dump_candidate_tasks(),
+ * for it is possible that the OOM reaper or exit_mmap() sets
+ * MMF_OOM_SKIP before dump_candidate_tasks() completes. Otherwise,
+ * call dump_candidate_tasks() after SIGKILL is sent to OOM victims and
+ * the OOM reaper started reclaiming.
*/
- if (!list_empty(&oom_candidate_list))
- return;
if (is_memcg_oom(oc))
- mem_cgroup_scan_tasks(oc->memcg, add_candidate_task, oc);
+ ret = mem_cgroup_scan_tasks(oc->memcg, add_candidate_task, oc);
else {
rcu_read_lock();
- for_each_process(p)
- add_candidate_task(p, oc);
+ for_each_process(p) {
+ ret = add_candidate_task(p, oc);
+ if (ret)
+ break;
+ }
rcu_read_unlock();
}
- /*
- * Report OOM victim candidates after SIGKILL is sent to OOM victims
- * and the OOM reaper started reclaiming.
- */
- if (!list_empty(&oom_candidate_list))
- queue_work(system_long_wq, &oom_dump_candidates_work);
+ if (ret) {
+ while (!list_empty(&oom_tmp_candidate_list)) {
+ struct oom_task_info *oti =
+ list_first_entry(&oom_tmp_candidate_list,
+ struct oom_task_info, list);
+
+ list_del_init(&oti->list);
+ put_task_struct(container_of(oti, struct task_struct,
+ oom_task_info));
+ }
+ return;
+ }
+ list_splice_tail_init(&oom_tmp_candidate_list, &oom_candidate_list);
+ queue_work(system_unbound_wq, &oom_dump_candidates_work);
}
static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
--
1.8.3.1
Powered by blists - more mailing lists