linux-kernel - Re: [RFC] Add mempressure cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sat, 1 Dec 2012 03:18:11 -0800
From:	Anton Vorontsov <anton.vorontsov@...aro.org>
To:	Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Luiz Capitulino <lcapitulino@...hat.com>
Cc:	David Rientjes <rientjes@...gle.com>,
	Pekka Enberg <penberg@...nel.org>,
	Glauber Costa <glommer@...allels.com>,
	Michal Hocko <mhocko@...e.cz>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Greg Thelen <gthelen@...gle.com>,
	Leonid Moiseichuk <leonid.moiseichuk@...ia.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Minchan Kim <minchan@...nel.org>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@...sung.com>,
	John Stultz <john.stultz@...aro.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, linaro-kernel@...ts.linaro.org,
	patches@...aro.org, kernel-team@...roid.com, aquini@...hat.com,
	riel@...hat.com, Robert Love <rlove@...gle.com>,
	Colin Cross <ccross@...roid.com>,
	Arve Hjønnevåg <arve@...roid.com>
Subject: Re: [RFC] Add mempressure cgroup

On Fri, Nov 30, 2012 at 03:47:25PM -0200, Luiz Capitulino wrote:
[...]
> > Query-and-control scheme looks very attractive, and that's actually
> > resembles my "balance" level idea, when userland tells the kernel how much
> > reclaimable memory it has. Except the your scheme works in the reverse
> > direction, i.e. the kernel becomes in charge.
> > 
> > But there is one, rather major issue: we're crossing kernel-userspace
> > boundary. And with the scheme we'll have to cross the boundary four times:
> > query / reply-available / control / reply-shrunk / (and repeat if
> > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> > synchronously (all the four stages), and/or we have to make a "userspace
> > shrinker" thread working in parallel with the normal shrinker, and here,
> > I'm afraid, we'll see more strange interactions. :)
[...]
> Andrew's idea seems to give a lot more freedom to apps, IMHO.

OK, thinking about it some more...

===
=== Long explanations below, scroll to 'END' for the short version. :)
===

The typical query-control shrinker interaction would look like this:

   Kernel: "Can you please free <Y> pages?"
 Userland: "Here you go, <Z> pages freed."

Now let's assume that we are the Activity Manager, so we know that we have
<N> reclaimable pages in total (it's not always possible to know, but
let's pretend we do know). And assume that we are the only source of
reclaimable pages (this is important). OK, the kernel asks us to reclaim
<Y> pages.

Now, what if we divide <Y> (needed pages) by <N> (total reclaimable
pages)? :)

This will be the memory pressure factor, what a coincidence. E.g. if Y >=
N, the factor would be >= 1, which was our definition of OOM. If no pages
needed, the factor is 0.

Okay, let's see how our current vmpressure notification works inside:

- The notification comes every 'window size' (<W>) pages scanned;

- Alongside with the notification itself we can also receive the pressure
  factor <F> (it is 1 - reclaimed/scanned). (We use levels nowadays, but
  internally it is still the factor.)

So, by doing <W> * <F> we can find out the amount of memory that the
kernel was missing this round (scanned - reclaimed), which pretty much the
same meaning as "Please free <Y> pages" in the "userland-shrinker" scheme
above.

Except that in the notifications case the "<Y>" was is in the past
already, so we should read "the kernel had difficulty with reclaiming <Y>
pages", and userland just received the notification about this past event.
The <Y> pages were probably reclaimed already.

Now, can we assume that in the next second, the system will need the same
<Y> pages reclaimed? Well, if the window size was small enough, it's OK to
assume that the workload didn't change much. So, yes, we can assume this,
the only "bad" thing that can happen, we can free a little bit more than
it was needed.

Let's look how we'd use the raw factor in the imaginary userland shrinker:

	while (1) {
		/* blocking, triggers every "window size" pages, <W> */
		factor = get_pressure();

		/* Finds the smallest chunk(s) w/ size >= <W> * <F> */
		resource = get_resource(factor);

		free(resource);
	}

So, in the each round we'd free at least <W> * <F> pages. Again, the
product just tells how much memory it is best to free at this time, which
by definition is 'scanned - reclaimed' (<F> = 1 - reclaimed/scanned; <W> =
scanned). That is, we don't need the factor, we need the scanned and
reclaimed difference.

In sum:

- Reporting the 'scanned - reclaimed' seems like an option for
  implementing the userland shrinker;

- B using small 'window size' we can mitigate effect of async nature of
  our shrinker.

Although, the shrinker is not a substitution to the pressure factor (or
levels). The plain "I need <Y> pages" still does not tell how bad things
there are in the system, how much scanning there are. So, the
reclaimed/scanned ratio is important, too.

===
=== END
===

The lengthy text above boils down to this:

Yes, I tend to agree that Andrew's idea gives some freedom to the apps,
and that with the three levels it is not possible to implement a good,
predictable "userland shrinker". Even though we don't need it just now.

Based on the above, I think I have a solution for this. For the next RFC,
I'd like to keep the pressure levels, but I will also add a file that will
report 'scanned - reclaimed' difference. I'll call it something like
nr_to_reclaim. Since the 'scanned - reclaimed' is still an approximation
(although I believe a good one), we may want to tune it without breaking
things.

And with the nr_to_reclaim, implementing a predictable userland shrinker
will be a piece of cake: apps will blindly free the given amount of pages,
nothing more.

Thanks,
Anton.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/