lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 29 Dec 2006 15:59:27 -0800 (PST)
From:	Linus Torvalds <>
To:	Theodore Tso <>
cc:	Andrew Morton <>,
	Segher Boessenkool <>,
	David Miller <>,,,,,
	Linux Kernel Mailing List <>,,,,,,,
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

On Fri, 29 Dec 2006, Theodore Tso wrote:
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_ 
later), but the real problem with it isn't actually the VM behaviour, but 
simply the fact that cached performance absoluely _sucks_ with the buffer 

With the physically indexed buffer cache thing, you end up always having 
to do these complicated translations into block numbers for every single 
access, and at some point when I benchmarked it, it was a huge overhead 
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost 
of translation - because you can't cheaply check whether the next block is 
there, the cost of even asking the question "should I try to read ahead?" 
is much much higher. As a result, read-ahead is seriously limited, because 
it's so expensive for the cached case (which is still hopefully the 
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having 
ext3_readdir() actually do _readahead_ though the page cache, even though 
it does everything else through the buffer cache. And that just happens to 
work because we hopefully have physically contiguous blocks, but when that 
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any 
problems for the VM, since directories cannot be mmap'ed anyway. But it's 
really pitiful, and it really doesn't work very well. Of course, other 
filesystems _also_ suck at this, and other operating systems haev even 
MORE problems, so people don't always seem to realize how horribly 
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache 
directories well. Open some horrible file manager on /usr/bin with cold 
caches, and weep. The biggest problem is the inode indirection, but at 
some point when I looked at why it sucked, it was doing basically 
synchronous single-buffer reads on the directory too, because readahead 
didn't work properly.

I was hoping that something like SpadFS would actually take off, because 
it seemed to do a lot of good design choices (having inodes in-line in the 
directory for when there are no hardlinks is probably a requirement for a 
good filesystem these days. The separate inode table had its uses, but 
indirection in a filesystem really does suck, and stat information is too 
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis 
written ;)

To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to
More majordomo info at

Powered by blists - more mailing lists