[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0e27425e-1fb6-bc7c-9845-71dc805897c3@bytedance.com>
Date: Wed, 15 Jun 2022 18:13:12 +0800
From: Gang Li <ligang.bdlg@...edance.com>
To: Michal Hocko <mhocko@...e.com>
Cc: akpm@...ux-foundation.org, songmuchun@...edance.com,
hca@...ux.ibm.com, gor@...ux.ibm.com, agordeev@...ux.ibm.com,
borntraeger@...ux.ibm.com, svens@...ux.ibm.com,
ebiederm@...ssion.com, keescook@...omium.org,
viro@...iv.linux.org.uk, rostedt@...dmis.org, mingo@...hat.com,
peterz@...radead.org, acme@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
namhyung@...nel.org, david@...hat.com, imbrenda@...ux.ibm.com,
apopple@...dia.com, adobriyan@...il.com,
stephen.s.brennan@...cle.com, ohoono.kwon@...sung.com,
haolee.swjtu@...il.com, kaleshsingh@...gle.com,
zhengqi.arch@...edance.com, peterx@...hat.com, shy828301@...il.com,
surenb@...gle.com, ccross@...gle.com, vincent.whitchurch@...s.com,
tglx@...utronix.de, bigeasy@...utronix.de, fenghua.yu@...el.com,
linux-s390@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
linux-perf-users@...r.kernel.org
Subject: Re: Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for
CONSTRAINT_MEMORY_POLICY
Hi, I've done some benchmarking in the last few days.
On 2022/5/17 00:44, Michal Hocko wrote:
> Sorry, I have only now found this email thread. The limitation of the
> NUMA constrained oom is well known and long standing. Basically the
> whole thing is a best effort as we are lacking per numa node memory
> stats. I can see that you are trying to fill up that gap but this is
> not really free. Have you measured the runtime overhead? Accounting is
> done in a very performance sensitive paths and it would be rather
> unfortunate to make everybody pay the overhead while binding to a
> specific node or sets of nodes is not the most common usecase.
## CPU consumption
According to the result of Unixbench. There is less than one percent
performance loss in most cases.
On 40c512g machine.
40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... | Pipe | Fork | syscall | total |
+----------+----------+-----+----------+---------+---------+---------+
| off | 2920.24 | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on | 2919.15 | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04% | ... | -0.39% | 2.12% | -3.95% | 0.28% |
+----------+----------+-----+----------+---------+---------+---------+
1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... | Pipe | Fork | syscall | total |
+----------+----------+-----+---------+--------+---------+---------+
| off | 1515.37 | ... | 1473.97 | 546.88 | 1152.37 | 1671.2 |
| on | 1508.09 | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48% | ... | 0.01% | 2.68% | 0.31% | 0.51% |
+----------+----------+-----+---------+--------+---------+---------+
## MEM consumption
per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes
per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes
zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes
> Also have you tried to have a look at cpusets? Those should be easier to
> make a proper selection as it should be possible to iterate over tasks
> belonging to a specific cpuset much more easier - essentialy something
> similar to memcg oom killer. We do not do that right now and by a very
> brief look at the CONSTRAINT_CPUSET it seems that this code is not
> really doing much these days. Maybe that would be a more appropriate way
> to deal with more precise node aware oom killing?
Looks like both CONSTRAINT_MEMORY_POLICY and CONSTRAINT_CPUSET can
be uesd to deal with node aware oom killing.
I think we can calculate badness in this way:
If constraint=CONSTRAINT_MEMORY_POLICY, get badness by `nodemask`.
If constraint=CONSTRAINT_CPUSET, get badness by `mems_allowed`.
example code:
```
long oom_badness(struct task_struct *p, struct oom_control *oc)
long points;
...
if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY)) {
for_each_node_mask(nid, oc->nodemask)
points += get_mm_counter(p->mm, -1, nid)
} else if (unlikely(oc->constraint == CONSTRAINT_CPUSET)) {
for_each_node_mask(nid, cpuset_current_mems_allowed)
points += get_mm_counter(p->mm, -1, nid)
} else {
points = get_mm_rss(p->mm);
}
points += get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) \
+ mm_pgtables_bytes(p->mm) / PAGE_SIZE;
...
}
```
>
> [...]
>> 21 files changed, 317 insertions(+), 111 deletions(-)
>
> The code footprint is not free either. And more importantnly does this
> even work much more reliably? I can see quite some NUMA_NO_NODE
> accounting (e.g. copy_pte_range!).Is this somehow fixable?
> Also how do those numbers add up. Let's say you increase the counter as
> NUMA_NO_NODE but later on during the clean up you decrease based on the
> page node?
> Last but not least I am really not following MM_NO_TYPE concept. I can
> only see add_mm_counter users without any decrements. What is going on
> there?
There are two usage scenarios of NUMA_NO_NODE in this patch.
1. placeholder when swap pages in and out of swapfile.
```
// mem to swapfile
dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
inc_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
// swapfile to mem
inc_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page));
dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE);
```
In *_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE),
NUMA_NO_NODE is a placeholder. It means this page does not exist in any
node anymore.
2. placeholder in `add_mm_rss_vec` and `sync_mm_rss` for per process mm
counter synchronization with SPLIT_RSS_COUNTING enabled.
MM_NO_TYPE is also a placeholder in `*_mm_counter`, `add_mm_rss_vec` and
`sync_mm_rss`.
These placeholders are very strange. Maybe I should introduce a helper
function for mm->rss_stat.numa_count counting instead of using
placeholder.
Powered by blists - more mailing lists