linux-kernel - Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20121117012114.GA22910@lizard.sbx05663.mountca.wayport.net>
Date:	Fri, 16 Nov 2012 17:21:15 -0800
From:	Anton Vorontsov <anton.vorontsov@...aro.org>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Glauber Costa <glommer@...allels.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Pekka Enberg <penberg@...nel.org>,
	Mel Gorman <mgorman@...e.de>,
	Leonid Moiseichuk <leonid.moiseichuk@...ia.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Minchan Kim <minchan@...nel.org>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@...sung.com>,
	John Stultz <john.stultz@...aro.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, linaro-kernel@...ts.linaro.org,
	patches@...aro.org, kernel-team@...roid.com,
	linux-man@...r.kernel.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Michal Hocko <mhocko@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>, Tejun Heo <tj@...nel.org>
Subject: Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:
> > > I'm wondering if we should have more than three different levels.
> > > 
> > 
> > In the case I outlined below, for backwards compatibility. What I
> > actually mean is that memcg *currently* allows arbitrary notifications.
> > One way to merge those, while moving to a saner 3-point notification, is
> > to still allow the old writes and fit them in the closest bucket.
> 
> Yeah, but I'm wondering why three is the right answer.

You were not Cc'ed, so let me repeat why I ended up w/ the levels (not
necessary three levels), instead of relying on the 0..100 scale:

 The main change is that I decided to go with discrete levels of the
 pressure.

 When I started writing the man page, I had to describe the 'reclaimer
 inefficiency index', and while doing this I realized that I'm describing
 how the kernel is doing the memory management, which we try to avoid in
 the vmevent. And applications don't really care about these details:
 reclaimers, its inefficiency indexes, scanning window sizes, priority
 levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
 I guess Mel Gorman was right, we need some sort of levels.

 What applications (well, activity managers) are really interested in is
 this:

 1. Do we we sacrifice resources for new memory allocations (e.g. files
    cache)?
 2. Does the new memory allocations' cost becomes too high, and the system
    hurts because of this?
 3. Are we about to OOM soon?

 And here are the answers:

 1. VMEVENT_PRESSURE_LOW
 2. VMEVENT_PRESSURE_MED
 3. VMEVENT_PRESSURE_OOM

 There is no "high" pressure, since I really don't see any definition of
 it, but it's possible to introduce new levels without breaking ABI.

Later I came up with the fourth level:

 Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE
 with an additional nr_pages threshold, which basically hits the kernel
 about how many easily reclaimable pages userland has (that would be a
 part of our definition for the mild/balance pressure level).

I.e. the fourth level can serve as a two-way communication w/ the kernel.
But again, this would be just an extension, I don't want to introduce this
now.

> > > Umm, why do users of cpusets not want to be able to trigger memory 
> > > pressure notifications?
> > > 
> > Because cpusets only deal with memory placement, not memory usage.
> 
> The set of nodes that a thread is allowed to allocate from may face memory 
> pressure up to and including oom while the rest of the system may have a 
> ton of free memory.  Your solution is to compile and mount memcg if you 
> want notifications of memory pressure on those nodes.  Others in this 
> thread have already said they don't want to rely on memcg for any of this 
> and, as Anton showed, this can be tied directly into the VM without any 
> help from memcg as it sits today.  So why implement a simple and clean 

You meant 'why not'?

> mempressure cgroup that can be used alone or co-existing with either memcg 
> or cpusets?
> 
> > And it is not that moving a task to cpuset disallows you to do any of
> > this: you could, as long as the same set of tasks are mounted in a
> > corresponding memcg.
> > 
> 
> Same thing with a separate mempressure cgroup.  The point is that there 
> will be users of this cgroup that do not want the overhead imposed by 
> memcg (which is why it's disabled in defconfig) and there's no direct 
> dependency that causes it to be a part of memcg.

There's also an API "inconvenince issue" with memcg's usage_in_bytes
stuff: applications have a hard time resetting the threshold to 'emulate'
the pressure notifications, and they also have to count bytes (like 'total
- used = free') to set the threshold. While a separate 'pressure'
notifications shows exactly what apps actually want to know: the pressure.

Thanks,
Anton.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/