[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.1.10.0808121708160.3462@nehalem.linux-foundation.org>
Date: Tue, 12 Aug 2008 17:28:28 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Al Viro <viro@...IV.linux.org.uk>
cc: OGAWA Hirofumi <hirofumi@...l.parknet.co.jp>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC] readdir mess
On Wed, 13 Aug 2008, Al Viro wrote:
>
> As for whether we want to bother... I've looked through about half of the
> ->readdir instances. We _do_ want to touch them, with cattle prod if nothing
> else.
The really sad part is that readdir() really is also the thing that should
make us change locking. That i_mutex thing is fine and dandy for
everything else, but for readdir() we really would be much better off with
a rwsem - and only take it for reading.
Right now, readdir() is one of the most serialized parts of the whole
kernel. Sad. And while it's a per-directory lock, there are directories
that get much more reading than others, and this has been a small
scalability issue (for samba and apache) for years.
> 9p: touching belief that f_pos can't be changed under us.
> adfs: ditto.
The thing is, generic_file_llseek() takes i_mutex, exactly because of
issues like this. Of course, you have to ask for it (the _default_ llseek
does not do it), and you're right that 9p does not.
Strangely enough, at least 9p _does_ use it for regular files. I'm not
sure how come it decided to do that, but whatever.
> ext3: take a look at comments around filldir call. Yes, they are _that_
> ancient, and so's the logics around revalidate. ext2 is sane, but
> that hadn't propagated. Refuses to go through more than one block,
> BTW. Revalidation loop is buggered if we have corrupt data, while
> we are at it.
> ext4: ditto
The reason ext2 is ok is that you long long ago fixed it to use the page
cache. That got rid of a _lot_ of the crap, and made all the IO look like
regular files (including read-ahead etc). Ext2 _used_ to be the same crap
that ext3 is.
I so wish that ext3 could do the same thing, but no. I still think it
should be possible, but the whole journalling is designed for buffer
heads.
> freevxfs: AFAICS simply bogus (grep for nblocks there).
> hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
> at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
> handling appears to be odd - how the hell does decrementing ->f_pos
> help anything? And hfs_dir_release() removes from list without any
> locks, so that's almost certainly racy as well.
> hfsplus: ditto
I don't dispute at all that the readdir() thing is one of the weakest
points of the whole VFS layer. And part of it is that there is no good
caching helper for it at the VFS level, so we always end up having to do
everything at the low-level filesystem level, and that invariably ends up
being sh*t compared to the shared VFS routines.
I'm convinced that the reason we do well on most other filesystem accesses
is exactly the fact that a filesystem basically has to be crazy to try to
do their own version, and in many cases cannot really do it at all (eg you
can't really even avoid using the dcache or the page cache and actually
get any valid semantics).
But readdir() is the _one_ operation where the low-level filesystem still
basically does it all itself. Which is why we can't fix locking, and why
even simple changes are hard because it's not just complex code, it's
complex code in 50+ filesystems with almost zero shared code!
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists