[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20061203124752.15e35357.akpm@osdl.org>
Date: Sun, 3 Dec 2006 12:47:52 -0800
From: Andrew Morton <akpm@...l.org>
To: wcheng@...hat.com
Cc: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] prune_icache_sb
On Sun, 03 Dec 2006 12:49:42 -0500
Wendy Cheng <wcheng@...hat.com> wrote:
> Andrew Morton wrote:
>
> >On Thu, 30 Nov 2006 11:05:32 -0500
> >Wendy Cheng <wcheng@...hat.com> wrote:
> >
> >
> >
> >>
> >>The idea is, instead of unconditionally dropping every buffer associated
> >>with the particular mount point (that defeats the purpose of page
> >>caching), base kernel exports the "drop_pagecache_sb()" call that allows
> >>page cache to be trimmed. More importantly, it is changed to offer the
> >>choice of not randomly purging any buffer but the ones that seem to be
> >>unused (i_state is NULL and i_count is zero). This will encourage
> >>filesystem(s) to pro actively response to vm memory shortage if they
> >>choose so.
> >>
> >>
> >
> >argh.
> >
> >
> I read this as "It is ok to give system admin(s) commands (that this
> "drop_pagecache_sb() call" is all about) to drop page cache. It is,
> however, not ok to give filesystem developer(s) this very same function
> to trim their own page cache if the filesystems choose to do so" ?
If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
administrators - it's a convenience thing for developers, to get repeatable
benchmarks. Attempts to make it a per-numa-node control for admin purposes have
been rejected.
> >In Linux a filesystem is a dumb layer which sits between the VFS and the
> >I/O layer and provides dumb services such as reading/writing inodes,
> >reading/writing directory entries, mapping pagecache offsets to disk
> >blocks, etc. (This model is to varying degrees incorrect for every
> >post-ext2 filesystem, but that's the way it is).
> >
> >
> Linux kernel, particularly the VFS layer, is starting to show signs of
> inadequacy as the software components built upon it keep growing. I have
> doubts that it can keep up and handle this complexity with a development
> policy like you just described (filesystem is a dumb layer ?). Aren't
> these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect
> example why trying to do too many things inside vfs layer for so many
> filesystems is a bad idea ?
That's not a very well-chosen example, but yes, the old ext2-based model has
needed to be extended as new filesystems come along.
> By the way, since we're on this subject,
> could we discuss a little bit about vfs rename call (or I can start
> another new discussion thread) ?
>
> Note that linux do_rename() starts with the usual lookup logic, followed
> by "lock_rename", then a final round of dentry lookup, and finally comes
> to filesystem's i_op->rename call. Since lock_rename() only calls for
> vfs layer locks that are local to this particular machine, for a cluster
> filesystem, there exists a huge window between the final lookup and
> filesystem's i_op->rename calls such that the file could get deleted
> from another node before fs can do anything about it. Is it possible
> that we could get a new function pointer (lock_rename) in
> inode_operations structure so a cluster filesystem can do proper locking ?
That would need a new thread, and probably (at least pseudo-) code, and
cc's to the appropriate maintainers (although that part of the kernel isn't
really maintained any more - it has fallen into the patch-and-run model).
> >>From our end (cluster locks are expensive - that's why we cache them),
> >>one of our kernel daemons will invoke this newly exported call based on
> >>a set of pre-defined tunables. It is then followed by a lock reclaim
> >>logic to trim the locks by checking the page cache associated with the
> >>inode (that this cluster lock is created for). If nothing is attached to
> >>the inode (based on i_mapping->nrpages count), we know it is a good
> >>candidate for trimming and will subsequently drop this lock (instead of
> >>waiting until the end of vfs inode life cycle).
> >>
> >>
> >
> >Again, I don't understand why you're tying the lifetime of these locks to
> >the VFS inode reclaim mechanisms. Seems odd.
> >
> >
> Cluster locks are expensive because:
>
> 1. Every node in the cluster has to agree about it upon granting the
> request (communication overhead).
> 2. It involves disk flushing if bouncing between nodes. Say one node
> requests a read lock after another node's write... before the read lock
> can be granted, the write node needs to flush the data to the disk (disk
> io overhead).
>
> For optimization purpose, we want to refrain the disk flush after writes
> and hope (and encourage) the next person who requests the lock to be on
> the very same node (to take the advantage of OS write-back logic).
> That's why the locks are cached on the very same node. It will not get
> removed unless necessary.
> What would be better to build the lock caching on top of the existing
> inode cache logic - since these are the objects that the cluster locks
> are created for in the first place.
hmm, I suppose that makes sense.
Are there dentries associated with these locks?
> >If you want to put an upper bound on the number of in-core locks, why not
> >string them on a list and throw away the old ones when the upper bound is
> >reached?
> >
> >
> Don't take me wrong. DLM *has* a tunable to set the max lock counts. We
> do drop the locks but to drop the right locks, we need a little bit help
> from VFS layer. Latency requirement is difficult to manage.
>
> >Did you look at improving that lock-lookup algorithm, btw? Core kernel has
> >no problem maintaining millions of cached VFS objects - is there any reason
> >why your lock lookup cannot be similarly efficient?
> >
> >
> Don't be so confident. I did see some complaints from ext3 based mail
> servers in the past - when the storage size was large enough, people had
> to explicitly umount the filesystem from time to time to rescue their
> performance. I don't recall the details at this moment though.
People have had plenty of problems with oversized inode-caches in the past,
but I think they were due to memory consumption, not to lookup inefficiency.
My question _still_ remains unanswered. Third time: is is possible to
speed up this lock-lookup code?
Perhaps others can take a look at it - where is it?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists