[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0701071028450.3661@woody.osdl.org>
Date: Sun, 7 Jan 2007 11:13:06 -0800 (PST)
From: Linus Torvalds <torvalds@...l.org>
To: Christoph Hellwig <hch@...radead.org>
cc: Willy Tarreau <w@....eu>, "H. Peter Anvin" <hpa@...or.com>,
git@...r.kernel.org, nigel@...el.suspend2.net,
"J.H." <warthog9@...nel.org>,
Randy Dunlap <randy.dunlap@...cle.com>,
Andrew Morton <akpm@...l.org>, Pavel Machek <pavel@....cz>,
kernel list <linux-kernel@...r.kernel.org>,
webmaster@...nel.org
Subject: Re: How git affects kernel.org performance
On Sun, 7 Jan 2007, Linus Torvalds wrote:
>
> A year or two ago I did a totally half-assed code for the non-hashed
> readdir that improved performance by an order of magnitude for ext3 for a
> test-case of mine, but it was subtly buggy and didn't do the hashed case
> AT ALL.
Btw, this isn't the test-case, but it's a half-way re-creation of
something like it. It's _really_ stupid, but here's what you can do:
- compile and run this idiotic program. It creates a directory called
"throwaway" that is ~44kB in size, and if I did things right, it should
not be totally contiguous on disk with the current ext3 allocation
logic.
- as root, do "echo 3 > /proc/sys/vm/drop_caches" to get a cache-cold
schenario.
- do "time ls throwaway > /dev/null".
I don't know what people consider to be reasonable performance, but for
me, it takes about half a second to do a simple "ls". NOTE! This is _not_
reading inode stat information or anything like that. It literally takes
0.3-0.4 seconds to read ~44kB off the disk. That's a whopping 125kB/s
throughput on a reasonably fast modern disk.
That's what we in the industry call "sad".
And that's on a totally unloaded machine. There was _nothing_ else going
on. No IO congestion, no nothing. Just the cost of synchronously doing
ten or eleven disk reads.
The fix?
- proper read-ahead. Right now, even if the directory is totally
contiguous on disk (just remove the thing that writes data to the
files, so that you'll have empty files instead of 8kB files), I think
we do those reads totally synchronously if the filesystem was mounted
with directory hashing enabled.
Without hashing, the directory will be much smaller too, so readdir()
will have less data to read. And it _should_ do some readahead,
although in my testing, the best I could do was still 0.185s for a (now
shrunken) 28kB directory.
- better directory block allocation patterns would likely help a lot,
rather than single blocks. That's true even without any read-ahead (at
least the disk wouldn't need to seek, and any on-disk track buffers etc
would work better), but with read-ahead and contiguous blocks it should
be just a couple of IO's (the indirect stuff means that it's more than
one), and so you should see much better IO patterns because the
elevator can try to help too.
Maybe I just have unrealistic expectations, but I really don't like how a
fairly small 50kB directory takes an appreciable fraction of a second to
read.
Once it's cached, it still takes too long, but at least at that point the
individual getdents calls take just tens of microseconds.
Here's cold-cache numbers (notice: 34 msec for the first one, and 17 msec
in the middle.. The 5-6ms range indicates a single IO for the intermediate
ones, which basically says that each call does roughly one IO, except the
first one that does ~5 (probably the indirect index blocks), and two in
the middle who are able to fill up the buffer from the IO done by the
previous one (4kB buffers, so if the previous getdents() happened to just
read the beginning of a block, the next one might be able to fill
everything from that block without having to do IO).
getdents(3, /* 103 entries */, 4096) = 4088 <0.034830>
getdents(3, /* 102 entries */, 4096) = 4080 <0.006703>
getdents(3, /* 102 entries */, 4096) = 4080 <0.006719>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000354>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000017>
getdents(3, /* 102 entries */, 4096) = 4080 <0.005302>
getdents(3, /* 102 entries */, 4096) = 4080 <0.016957>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000017>
getdents(3, /* 102 entries */, 4096) = 4080 <0.003530>
getdents(3, /* 83 entries */, 4096) = 3320 <0.000296>
getdents(3, /* 0 entries */, 4096) = 0 <0.000006>
Here's the pure CPU overhead: still pretty high (200 usec! For a single
system call! That's disgusting! In contrast, a 4kB read() call takes 7
usec on this machine, so the overhead of doing things one dentry at a
time, and calling down to several layers of filesystem is quite high):
getdents(3, /* 103 entries */, 4096) = 4088 <0.000204>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000122>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000112>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000153>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000018>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000103>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000217>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000018>
getdents(3, /* 102 entries */, 4096) = 4080 <0.000095>
getdents(3, /* 83 entries */, 4096) = 3320 <0.000089>
getdents(3, /* 0 entries */, 4096) = 0 <0.000006>
but you can see the difference.. The real cost is obviously the IO.
Linus
----
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
static char buffer[8192];
static int create_file(const char *name)
{
int fd = open(name, O_RDWR | O_CREAT | O_TRUNC, 0666);
if (fd < 0)
return fd;
write(fd, buffer, sizeof(buffer));
close(fd);
return 0;
}
int main(int argc, char **argv)
{
int i;
char name[256];
/* Fill up the buffer with some random garbage */
for (i = 0; i < sizeof(buffer); i++)
buffer[i] = "abcdefghijklmnopqrstuvwxyz\n"[i % 27];
if (mkdir("throwaway", 0777) < 0 || chdir("throwaway") < 0) {
perror("throwaway");
exit(1);
}
/*
* Create a reasonably big directory by having a number
* of files with non-trivial filenames, and with some
* real content to fragment the directory blocks..
*/
for (i = 0; i < 1000; i++) {
snprintf(name, sizeof(name),
"file-name-%d-%d-%d-%d",
i / 1000,
(i / 100) % 10,
(i / 10) % 10,
(i / 1) % 10);
create_file(name);
}
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists