linux-kernel - Re: [ckrm-tech] RFC: Memory Controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4545FDCD.3080107@in.ibm.com>
Date:	Mon, 30 Oct 2006 18:57:41 +0530
From:	Balbir Singh <balbir@...ibm.com>
To:	Paul Menage <menage@...gle.com>
CC:	dev@...nvz.org, vatsa@...ibm.com, sekharan@...ibm.com,
	ckrm-tech@...ts.sourceforge.net, haveblue@...ibm.com,
	linux-kernel@...r.kernel.org, pj@....com, matthltc@...ibm.com,
	dipankar@...ibm.com, rohitseth@...gle.com
Subject: Re: [ckrm-tech] RFC: Memory Controller

Paul Menage wrote:
> On 10/30/06, Balbir Singh <balbir@...ibm.com> wrote:
>> +----+---------+------+---------+------------+----------------+-----------+
>> |ii  |  No     | Yes  | configfs| Memory,    | Plans to       |   Yes     |
>> |    |         |      |         | task limit.| provide a      |           |
>> |    |         |      |         | Plans to   | framework      |           |
>> |    |         |      |         | allow      | to write new   |           |
>> |    |         |      |         | CPU and I/O| controllers    |           |
> 
> I have a port of Rohit's memory controller to run over my generic containers.

Cool!

> 
>> d. Fake NUMA Nodes
>>
>> This approach was suggested while discussing the memory controller
>>
>> Advantages
>>
>> (i)   Accounting for zones is already present
>> (ii)  Reclaim code can directly deal with zones
>>
>> Disadvantages
>>
>> (i)   The approach leads to hard partitioning of memory.
>> (ii)  It's complex to
>>       resize the node. Resizing is required to allow change of limits for
>>       resource management.
>> (ii)  Addition/Deletion of a resource group would require memory hotplug
>>       support for add/delete a node. On deletion of node, its memory is
>>       not utilized until a new node of a same or lesser size is created.
>>       Addition of node, requires reserving memory for it upfront.
> 
> A much simpler way of adding/deleting/resizing resource groups is to
> partition the system at boot time into a large number of fake numa
> nodes (say one node per 64MB in the system) and then use cpusets to
> assign the appropriate number of nodes each group. We're finding a few
> ineffiencies in the current code when using such a large number of
> small nodes (e.g. slab alien node caches), but we're confident that we
> can iron those out.
> 

You'll also end up with per zone page cache pools for each zone. A list of
active/inactive pages per zone (which will split up the global LRU list).
What about the hard-partitioning. If a container/cpuset is not using its full
64MB of a fake node, can some other node use it? Also, won't you end up
with a big zonelist?

>> (iv)   How do we account for shared pages? Should it be charged to the first
>>        container which touches the page or should it be charged equally among
>>        all containers sharing the page?
> 
> A third option is to allow inodes to be associated with containers in
> their own right, and charge all pages for those inodes to the
> associated container. So if several different containers are sharing a
> large data file, you can put that file in its own container, and you
> then have an exact count of how many pages are in use in that shared
> file.
> 
> This is cheaper than having to keep track of multiple users of a page,
> and is also useful when you're trying to do scheduling, to decide who
> to evict. Suppose you have two jobs each allocating 100M of anonymous
> memory and each accessing all of a 1G shared file, and you need to
> free up 500M of memory in order to run a higher-priority job.
> 
> If you charge the first user, then it will appear that the first job
> is using 1.1G of memory and the second is using 100M of memory. So you
> might evict the first job, thinking it would free up 1.1G of memory -
> but it would actually only free up 100M of memory, since the shared
> pages would still be in use by the second job.
> 
> If you share the charge between both users, then it would appear that
> each job is using 600M of memory - but it's still the case that
> evicting either one would only free up 100M of memory.
> 
> If you can see that the shared file that they're both using is
> accounting for 1G of the memory total, and that they're each using
> 100M of anon memory, then it's easier to see that you'd need to evict
> *both* jobs in order to free up 500M of memory.

Consider the other side of the story. lets say we have a shared lib shared
among quite a few containers. We limit the usage of the inode containing
the shared library to 50M. Tasks A and B use some part of the library
and cause the container "C" to reach the limit. Container C is charged
for all usage of the shared library. Now no other task, irrespective of which
container it belongs to, can touch any new pages of the shared library.

We might also be interested in limiting the page cache usage of a container.
In such cases, this solution might not work out to be the best.

What you are suggesting is to virtually group the inodes by container rather
than task. It might make sense in some cases, but not all.

We could consider implementing the controllers in phases

1. RSS control (anon + mapped pages)
2. Page Cache control
3. Kernel accounting and control




-- 
	Cheers,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/