[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1817008.tdWV9SEqCh@natalenko.name>
Date: Fri, 13 May 2022 22:53:16 +0200
From: Oleksandr Natalenko <oleksandr@...alenko.name>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: cgel.zte@...il.com, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, corbet@....net,
xu xin <xu.xin16@....com.cn>,
Yang Yang <yang.yang29@....com.cn>,
Ran Xiaokai <ran.xiaokai@....com.cn>,
wangyong <wang.yong12@....com.cn>,
Yunkai Zhang <zhang.yunkai@....com.cn>,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v6] mm/ksm: introduce ksm_force for each process
Hello.
On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote:
> On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@...alenko.name> wrote:
> > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote:
> > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@...alenko.name> wrote:
> > >
> > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs
> > > > > of this mm to be involved in KSM scanning without explicitly calling
> > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when
> > > > > the klob of /sys/kernel/mm/ksm/run is set as 1.
> > > > >
> > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this
> > > > > process (fallback to the default state) and unmerge those merged pages
> > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process,
> > > > > but still leave MADV_MERGEABLE areas merged.
> > > >
> > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still).
> > > >
> > > > So, what changed since that discussion?
> > > >
> > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach.
> > >
> > > Why are you patching the kernel with a new syscall rather than using
> > > process_madvise()?
> >
> > Because I'm not sure how to use `process_madvise()` to achieve $subj properly.
> >
> > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging.
> >
> > For that, all the eligible VMAs have to be traversed.
> >
> > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is:
> >
> > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and
> > b) may induce more latency than needed because the target task has to be stopped to avoid races.
>
> OK. And what happens to new vmas that the target process creates after
> the process_madvise()?
Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]).
[1] https://gitlab.com/post-factum/uksmd
> > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`,
>
> Is it not?
It is not:
```
1158 static bool
1159 process_madvise_behavior_valid(int behavior)
1160 {
1161 switch (behavior) {
1162 case MADV_COLD:
1163 case MADV_PAGEOUT:
1164 case MADV_WILLNEED:
1165 return true;
1166 default:
1167 return false;
1168 }
1169 }
```
Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it.
[2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@suse.cz/
[3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/
> > I cannot just call it like this:
> >
> > ```
> > iovec.iov_base = 0;
> > iovec.iov_len = ~0ULL;
> > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0);
> > ```
> >
> > to cover the whole address space because iovec expects total size to be under ssize_t.
> >
> > Or maybe there's no need to cover the whole address space, only the lower half of it?
>
> Call process_madvise() twice, once for each half?
And still get `-ENOMEM`?
```
1191 /*
1192 * If the interval [start,end) covers some unmapped address
1193 * ranges, just ignore them, but return -ENOMEM at the end.
1194 * - different from the way of handling in mlock etc.
1195 */
```
I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right.
> > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?..
> >
> > I'm more than happy to read your comments on this.
> >
>
> I see the problem. I do like the simplicity of the ksm_force concept.
> Are there alternative ideas?
I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API.
[4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/
[5] https://lore.kernel.org/lkml/20190515145151.GG16651@dhcp22.suse.cz/
--
Oleksandr Natalenko (post-factum)
Powered by blists - more mailing lists