linux-kernel - Re: [PATCH -mm v9 0/8] idle memory tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20150721163402.43ad2527d9b8caa476a1c9e1@linux-foundation.org>
Date:	Tue, 21 Jul 2015 16:34:02 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Vladimir Davydov <vdavydov@...allels.com>
Cc:	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Minchan Kim <minchan@...nel.org>,
	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>,
	Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	David Rientjes <rientjes@...gle.com>,
	Pavel Emelyanov <xemul@...allels.com>,
	Cyrill Gorcunov <gorcunov@...nvz.org>,
	Jonathan Corbet <corbet@....net>, <linux-api@...r.kernel.org>,
	<linux-doc@...r.kernel.org>, <linux-mm@...ck.org>,
	<cgroups@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	Kees Cook <keescook@...omium.org>
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking

On Sun, 19 Jul 2015 15:31:09 +0300 Vladimir Davydov <vdavydov@...allels.com> wrote:

> Hi,
> 
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
> 
> It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46
> It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well
> 
> ---- USE CASES ----
> 
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
> 
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
> 
> Also, as noted by Minchan, this would be useful for per-process reclaim
> (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
> pages only by smart user memory manager.
> 
> ---- USER API ----
> 
> The user API consists of two new proc files:
> 
>  * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
>    to a page, indexed by PFN.

What are the bit mappings?  If I read the first byte of /proc/kpageidle
I get PFN #0 in bit zero of that byte?  And the second byte of
/proc/kpageidle contains PFN #8 in its LSB, etc?

Maybe this is covered in the documentation file.

> When the bit is set, the corresponding page is
>    idle. A page is considered idle if it has not been accessed since it was
>    marked idle.

Perhaps we can spell out in some detail what "accessed" means?  I see
you've hooked into mark_page_accessed(), so a read from disk is an
access.  What about a write to disk?  And what about a page being
accessed from some random device (could hook into get_user_pages()?) Is
getting written to swap an access?  When a dirty pagecache page is
written out by kswapd or direct reclaim?

This also should be in the permanent documentation.

> To mark a page idle one should set the bit corresponding to the
>    page by writing to the file. A value written to the file is OR-ed with the
>    current bitmap value. Only user memory pages can be marked idle, for other
>    page types input is silently ignored. Writing to this file beyond max PFN
>    results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
>    set.
> 
>    This file can be used to estimate the amount of pages that are not
>    used by a particular workload as follows:
> 
>    1. mark all pages of interest idle by setting corresponding bits in the
>       /proc/kpageidle bitmap
>    2. wait until the workload accesses its working set
>    3. read /proc/kpageidle and count the number of bits set

Security implications.  This interface could be used to learn about a
sensitive application by poking data at it and then observing its
memory access patterns.  Perhaps this is why the proc files are
root-only (whcih I assume is sufficient).  Some words here about the
security side of things and the reasoning behind the chosen permissions
would be good to have.

>  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>    memory cgroup each page is charged to, indexed by PFN.

Actually "closest online ancestor".  This also should be in the
interface documentation.

> Only available when CONFIG_MEMCG is set.

CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume?

> 
>    This file can be used to find all pages (including unmapped file
>    pages) accounted to a particular cgroup. Using /proc/kpageidle, one
>    can then estimate the cgroup working set size.
> 
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.

Why were these put in /proc anyway?  Rather than under /sys/fs/cgroup
somewhere?  Presumably because /proc/kpageidle is useful in non-memcg
setups.

> ---- PERFORMANCE EVALUATION ----

"^___" means "end of changelog".  Perhaps that should have been
"^---\n" - unclear.

> Documentation/vm/pagemap.txt           |  22 ++-

I think we'll need quite a lot more than this to fully describe the
interface?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/