linux-kernel - Re: [PATCH v3 0/4] Per-container dcache limitation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E524090.1080708@parallels.com>
Date:	Mon, 22 Aug 2011 08:42:08 -0300
From:	Glauber Costa <glommer@...allels.com>
To:	Dave Chinner <david@...morbit.com>
CC:	<linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
	<containers@...ts.linux-foundation.org>,
	Pavel Emelyanov <xemul@...allels.com>,
	Al Viro <viro@...iv.linux.org.uk>,
	Hugh Dickins <hughd@...gle.com>,
	Nick Piggin <npiggin@...nel.dk>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	James Bottomley <JBottomley@...allels.com>
Subject: Re: [PATCH v3 0/4] Per-container dcache limitation

On 08/17/2011 10:27 PM, Dave Chinner wrote:
> On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
>> On 08/16/2011 10:43 PM, Dave Chinner wrote:
>>> On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
>>>> Hello,
>>>>
>>>> This series is just like v2, except it addresses
>>>> Eric's comments regarding percpu variables.
>>>>
>>>> Let me know if there are further comments, and
>>>> I'll promply address them as well. Otherwise,
>>>> I feel this is ready for inclusion
>>
>> Hi David,
>>
>> I am not answering everything now, since I'm travelling, but let me
>> get to this one:
>>
>>> Just out of couriousity, one thing I've noticed about dentries is
>>> that in general at any given point in time most dentries are unused.
>>> Under the workloads I'm testing, even when I have a million cached
>>> dentries, I only have roughly 7,000 accounted as used.  That is, most
>>> of the dentries in the system are on a LRU and accounted in
>>> sb->s_nr_dentry_unused of their owner superblock.
>>>
>>> So rather than introduce a bunch of new infrastructure to track the
>>> number of dentries allocated, why not simply limit the number of
>>> dentries allowed on the LRU? We already track that, and the shrinker
>>> already operates on the LRU, so we don't really need any new
>>> infrastructure.
>> Because this only works well for cooperative workloads. And we can't
>> really assume that in the virtualization world. One container can
>> come up with a bogus workload - not even hard to write - that has
>> the sole purpose of punishing every resource sharer of him.
>
> Sure, but as I've said before you can prevent the container from
> consuming too many dentries (via a hard limit) simply by adding a
> inode quota per container.  This is exactly the sort of
> uncooperative behaviour filesystem quotas were invented to
> prevent.
>
> Perhaps we should separate the DOS case from the normal
> (co-operative) use case.
>
> As i mentioned previously, your inode allocation based DOS (while
> (1); mkdir x; cd x; done type cases) example is trivial to prevent
> with quotas. It was claimed that is was not possible to prevent with
> filesystem quotas, I left proving that as an exercise for the
> reader,but I feel I need to re-iterate my point with an example.
>
> That is, if you can't create a million inodes in the container, you
> can't instantiate a million dentries in the container.  For example,
> use project quotas on XFS to create directory tree containers with
> hard limits on the number of inodes:

David,

The dentry -> inode relationship is a N:1 relationship. Therefore, it is
hard to believe that your example below would still work if we were 
trying to fill the cache through link operations, instead of operations 
like mkdir, that enforce a 1:1 relationship.

Caping the dentry numbers, OTOH, caps the # of inodes as well. Although 
we *do* can have inodes lying around in the caches without an associated 
dentry at some point in time, we cannot have inodes *pinned* into the 
cache without an associated dentry. So they will soon enough go away.

So maybe it is the other way around here, and it is people that wants an 
inode capping that should model it after a dentry cache capping.
>
> $ cat /etc/projects
> 12345:/mnt/scratch/projects/foo
> $ cat /etc/projid
> foo:12345
> $ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
> $ mkdir -p /mnt/scratch/projects/foo
> $ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Setting up project foo (path /mnt/scratch/projects/foo)...
> Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
> $ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
> $ sudo xfs_quota -c "quota -p foo" /mnt/scratch
> $ cd /mnt/scratch/projects/foo/
> $ ~/src/fs_mark-3.3/dir-depth
> count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
> $
>
> It stopped at 1435 directories because the container
> (/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
> there.

As said above, mkdir enforces a 1:1 relationship (because directories 
can't be hard linked) that can't be guaranteed in the general case. For 
the general case, one can have a dentry cache bigger than Linus' ego 
while instantiating only one inode in the process.

> And rather than a ENOMEM error (could be caused by
> anything) the error is EDQUOT which is a clear indication that a
> resource limit has been hit. That's a far better failure from a user
> perspective because they know -exactly- why their application
> failed - the container resource limits are too low....
Well, if this is the problem, I am happy returning EDQUOT if we fail to 
find room for more dentries, or anything else we can agree upon instead 
of ENOMEM.

>
> IOWs, you don't need to touch the dentry cache -at all- to provide
> the per-subtree hard resource limiting you are trying to acheive -
> filesystem quotas can already acheive that for you. Project quotas
> used in this manner (as directory tree quotas) provide exactly the
> "per-subtree" hard resource limiting that you were trying to acheive
> with your original dentry mobs proposal.
well, I myself am done for now with the per-subtree proposal. I am 
completely fine with per-sb for a while now.

>
>>> The limiting can be lazily - we don't need to limit the growth of
>>> dentries until we start to run out of memory. If the superblock
>>> shrinker is aware of the limits, then when it gets called by memory
>>> reclaim it can do all the work of reducing the number of items on
>>> the LRU down to the threshold at that time.
>>
>> Well, this idea itself can be considered, independent of which path
>> we're taking. We can, if we want, allow the dentry cache to grow
>> indefinitely if we're out of memory pressure. But it kinda defies
>> the
>> purpose of a hard limit...
>
> See my comments above about filesystem quotas providing hard limits.
>
>>> IOWs, the limit has no impact on performance until memory is scarce,
>>> at which time memory reclaim enforces the limits on LRU size and
>>> clean up happens automatically.
>>>
>>> This also avoids all the problems of setting a limit lower than the
>>> number of active dentries required for the workload (i.e. avoids
>>> spurious ENOMEM errors trying to allocate dentries), allows
>>> overcommitment when memory is plentiful (which will benefit
>>> performance) but it brings the caches back to defined limits when
>>> memory is not plentiful (which solves the problem you are having).
>> No, this is not really the problem we're having.
>> See above.
>>
>> About ENOMEM, I don't really see what's wrong with them here.
>
> Your backup program runs inside the container. Filesystem traversal
> balloons the dentry cache footprint, and so it is likely to trigger
> spurious ENOMEM when trying to read files in the container because
> it can't allocate a dentry for random files as it traverses. That'll
> be fun when it comes to restoring backups and discovering they
> aren't complete....
>
> There's also the "WTF caused the ENOMEM error" problem I mentioned
> earlier....
>
>> For a
>> container, running out of his assigned kernel memory, should be
>> exactly the same as running out of real physical memory. I do agree
>> that it changes the feeling of the system a little bit, because it
>> then happens more often. But it is still right in principle.
>
> The difference is the degree - when the system runs out of memory,
> it tries -really hard- before failing the allocation. Indeed, it'll
> swap, it'll free memory in other subsystems, it'll back off on disk
> congestion, it will try multiple times to free memory, escalation
> priority each time it retries. IOws, it jumps through all sorts of
> hoops to free memory before it finally fails. And then the memory,
> more often than not, comes from some subsystem other than the dentry
> cache, so it is rare that a dentry allocation actually relies on the
> dentry cache (and only the dentry cache) being shrunk to provide
> memory for the new dentry.

This is apples to oranges comparison. If instead of using the mechanism 
I proposed, we go for a quota-based mechanism like you mentioned, we'll 
fail just as often. Just with EDQUOT instead of ENOMEM.

Humm.. while writing that I just looked back on the code, and it seems 
it will be hard to return anything but ENOMEM, since it is part of the 
interface contract. OTOH, the inode allocation function goes for the 
same kind of contract - returning NULL in case of error - meaning that 
everything of this kind that does not involve the filesystem will end up 
the same way. Be it in the icache, or dcache.

> Your dentry hard limit is has no fallback or other mechanisms to try
> - if the VFS caches cannot be shrunk immediately, then ENOMEM will
> occur.  There's no retries, there's no waiting for disk congestion
> to clear, there's no backoff, there's no increase in reclaim
> desparation as previous attempts to free dentries fail. This greatly
> increases the chances of ENOMEM from _d_alloc() way above when a
> normal machine would see because it doesn't have any of the
> functionality that memory reclaim has. And, fundamentally, that sort
> of complexity doesn't belong in the dentry cache...

I don't see how/why a user application should care. An error means "Hey 
Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, 
we tried really hard, but yet could not do it".

As far as a container is concerned, The only way to mimic the behavior 
you described would be to allow a single container to use up at most X 
bytes of general kernel memory. So when the dcache reaches the wall, it 
can borrow from somewhere else. Not that I am considering this...

>
> Another interesting case to consider is internally fragmented dentry
> cache slabs, where the active population of the pages is sparse.
> This sort of population density is quite common on machines with
> sustained long term multiple workload usage (exactly what you'd
> expect on a containerised system). Hence dentry allocation can be
> done without increasing memory footprint at all. Likewise, freeing
> dentries won't free any memory at all. In this case, what has your
> hard limit bought you? An ENOMEM error in a situation where memory
> allocation is actually free from a resource consumption perspective.

Again, I don't think those hard limits are about used memory. So if 
freeing a dentry may not free memory, I'm still fine with that. If the 
kernel as a whole needs memory later, it can do something to reclaim it.
*Unless* I hold a dentry reference. So the solution to me seems to be 
not allowing more than X to be held in the first place.

Again: I couldn't care less about how much *memory* it is actually using 
at a certain point in time.
>
> These are the sorts of corner case problems that hard limits on
> cache sizes have. That's the  problem I see with the hard limit
> approach: it looks simple, but it is full of corner cases when you
> look more deeply. Users are going to hit these corner cases and
> want to fix those "problems". We'll have to try to fix them without
> really even being able to reproduce them reliably. We'll end up
> growing heurisitics to try to detect when problems are about to
> happen, and complexity to try to avoid those corner case problems.
> We'll muddle along with something that sort of works for the cases
> we can reproduce, but ultimately is untestable and unverifiable. In
> contrast, a lazy lru limiting solution is simple to implement and
> verify and has none of the warts that hard limiting exposes to user
> applications.

What you described does not seem to me as a corner case.
"By using this option you can't use more than X entries, if you do, 
you'll fail" sounds pretty precise to me.

>
> Hence I'd prefer to avoid all the warts of hard limiting by ignoring
> the DOS case that leads to requiring a hard limit as it can be
> solved by other existing means. Limiting the size of the inactive
> cache (generally dominates cache usage) seems like a much lower
> impact manner of acheiving the same thing.

Again, I understand you, but I don't think we're really solving the same 
thing.

> Like I said previously - I've had people asking me whether limiting
> the size of the inode cache is possible for the past 5 years, and
> all their use cases are solved by the lazy mechanism I described. I
> think that most of the OpenVZ dcache size problems will also go away
> with the lazy solution as well, as most workloads with a large
> dentry cache footprint don't actively reference (and therefore pin)
> the entire working set at the same time....

Except for the malicious ones, of course.

> Cheers,

Thank you very much for your time, Dave!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/