linux-kernel - [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220512044634.63586-1-ligang.bdlg@bytedance.com>
Date:   Thu, 12 May 2022 12:46:29 +0800
From:   Gang Li <ligang.bdlg@...edance.com>
To:     akpm@...ux-foundation.org
Cc:     songmuchun@...edance.com, hca@...ux.ibm.com, gor@...ux.ibm.com,
        agordeev@...ux.ibm.com, borntraeger@...ux.ibm.com,
        svens@...ux.ibm.com, ebiederm@...ssion.com, keescook@...omium.org,
        viro@...iv.linux.org.uk, rostedt@...dmis.org, mingo@...hat.com,
        peterz@...radead.org, acme@...nel.org, mark.rutland@....com,
        alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
        namhyung@...nel.org, david@...hat.com, imbrenda@...ux.ibm.com,
        apopple@...dia.com, adobriyan@...il.com,
        stephen.s.brennan@...cle.com, ohoono.kwon@...sung.com,
        haolee.swjtu@...il.com, kaleshsingh@...gle.com,
        zhengqi.arch@...edance.com, peterx@...hat.com, shy828301@...il.com,
        surenb@...gle.com, ccross@...gle.com, vincent.whitchurch@...s.com,
        tglx@...utronix.de, bigeasy@...utronix.de, fenghua.yu@...el.com,
        linux-s390@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-perf-users@...r.kernel.org,
        Gang Li <ligang.bdlg@...edance.com>
Subject: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY

TLDR:
If a mempolicy is in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY), out_of_memory() will
select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system.

Problem:
Before this patch series, oom will only kill the process with the highest memory usage.
by selecting process with the highest oom_badness on the entire system to kill.

This works fine on UMA system, but may have some accidental killing on NUMA system.

As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1,
a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out
will be killed then.

A lot of our AMD machines have 8 numa nodes. In these systems, there is a greater chance
of triggering this problem.

OOM before patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1      Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45(Killed first. Maximum memory consumption)
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(Killed then. Node1 is full)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Solution:
We store per node rss in mm_rss_stat for each process.

If a page allocation with mempolicy in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY)
triger oom. We will calculate oom_badness with rss counter for the corresponding node. Then
select the process with the highest oom_badness on the corresponding node to kill.

OOM after patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1     Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(killed)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Gang Li (5):
  mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
  mm: add numa_count field for rss_stat
  mm: add numa fields for tracepoint rss_stat
  mm: enable per numa node rss_stat count
  mm, oom: enable per numa node oom for CONSTRAINT_MEMORY_POLICY

 arch/s390/mm/pgtable.c        |   4 +-
 fs/exec.c                     |   2 +-
 fs/proc/base.c                |   6 +-
 fs/proc/task_mmu.c            |  14 ++--
 include/linux/mm.h            |  59 ++++++++++++-----
 include/linux/mm_types_task.h |  16 +++++
 include/linux/oom.h           |   2 +-
 include/trace/events/kmem.h   |  27 ++++++--
 kernel/events/uprobes.c       |   6 +-
 kernel/fork.c                 |  70 +++++++++++++++++++-
 mm/huge_memory.c              |  13 ++--
 mm/khugepaged.c               |   4 +-
 mm/ksm.c                      |   2 +-
 mm/madvise.c                  |   2 +-
 mm/memory.c                   | 116 ++++++++++++++++++++++++----------
 mm/migrate.c                  |   2 +
 mm/migrate_device.c           |   2 +-
 mm/oom_kill.c                 |  59 ++++++++++++-----
 mm/rmap.c                     |  16 ++---
 mm/swapfile.c                 |   4 +-
 mm/userfaultfd.c              |   2 +-
 21 files changed, 317 insertions(+), 111 deletions(-)

-- 
2.20.1