lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0701071028450.3661@woody.osdl.org>
Date:	Sun, 7 Jan 2007 11:13:06 -0800 (PST)
From:	Linus Torvalds <torvalds@...l.org>
To:	Christoph Hellwig <hch@...radead.org>
cc:	Willy Tarreau <w@....eu>, "H. Peter Anvin" <hpa@...or.com>,
	git@...r.kernel.org, nigel@...el.suspend2.net,
	"J.H." <warthog9@...nel.org>,
	Randy Dunlap <randy.dunlap@...cle.com>,
	Andrew Morton <akpm@...l.org>, Pavel Machek <pavel@....cz>,
	kernel list <linux-kernel@...r.kernel.org>,
	webmaster@...nel.org
Subject: Re: How git affects kernel.org performance



On Sun, 7 Jan 2007, Linus Torvalds wrote:
> 
> A year or two ago I did a totally half-assed code for the non-hashed 
> readdir that improved performance by an order of magnitude for ext3 for a 
> test-case of mine, but it was subtly buggy and didn't do the hashed case 
> AT ALL.

Btw, this isn't the test-case, but it's a half-way re-creation of 
something like it. It's _really_ stupid, but here's what you can do:

 - compile and run this idiotic program. It creates a directory called 
   "throwaway" that is ~44kB in size, and if I did things right, it should 
   not be totally contiguous on disk with the current ext3 allocation 
   logic.

 - as root, do "echo 3 > /proc/sys/vm/drop_caches" to get a cache-cold 
   schenario.

 - do "time ls throwaway > /dev/null".

I don't know what people consider to be reasonable performance, but for 
me, it takes about half a second to do a simple "ls". NOTE! This is _not_ 
reading inode stat information or anything like that. It literally takes 
0.3-0.4 seconds to read ~44kB off the disk. That's a whopping 125kB/s 
throughput on a reasonably fast modern disk.

That's what we in the industry call  "sad".

And that's on a totally unloaded machine. There was _nothing_ else going 
on. No IO congestion, no nothing. Just the cost of synchronously doing 
ten or eleven disk reads.

The fix?

 - proper read-ahead. Right now, even if the directory is totally 
   contiguous on disk (just remove the thing that writes data to the 
   files, so that you'll have empty files instead of 8kB files), I think 
   we do those reads totally synchronously if the filesystem was mounted 
   with directory hashing enabled.

   Without hashing, the directory will be much smaller too, so readdir() 
   will have less data to read. And it _should_ do some readahead, 
   although in my testing, the best I could do was still 0.185s for a (now 
   shrunken) 28kB directory. 

 - better directory block allocation patterns would likely help a lot, 
   rather than single blocks. That's true even without any read-ahead (at 
   least the disk wouldn't need to seek, and any on-disk track buffers etc 
   would work better), but with read-ahead and contiguous blocks it should 
   be just a couple of IO's (the indirect stuff means that it's more than 
   one), and so you should see much better IO patterns because the 
   elevator can try to help too.

Maybe I just have unrealistic expectations, but I really don't like how a 
fairly small 50kB directory takes an appreciable fraction of a second to 
read.

Once it's cached, it still takes too long, but at least at that point the 
individual getdents calls take just tens of microseconds.

Here's cold-cache numbers (notice: 34 msec for the first one, and 17 msec 
in the middle.. The 5-6ms range indicates a single IO for the intermediate 
ones, which basically says that each call does roughly one IO, except the 
first one that does ~5 (probably the indirect index blocks), and two in 
the middle who are able to fill up the buffer from the IO done by the 
previous one (4kB buffers, so if the previous getdents() happened to just 
read the beginning of a block, the next one might be able to fill 
everything from that block without having to do IO).

	getdents(3, /* 103 entries */, 4096)    = 4088 <0.034830>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.006703>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.006719>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000354>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000017>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.005302>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.016957>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000017>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.003530>
	getdents(3, /* 83 entries */, 4096)     = 3320 <0.000296>
	getdents(3, /* 0 entries */, 4096)      = 0 <0.000006>

Here's the pure CPU overhead: still pretty high (200 usec! For a single 
system call! That's disgusting! In contrast, a 4kB read() call takes 7 
usec on this machine, so the overhead of doing things one dentry at a 
time, and calling down to several layers of filesystem is quite high):

	getdents(3, /* 103 entries */, 4096)    = 4088 <0.000204>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000122>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000112>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000153>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000018>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000103>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000217>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000018>
	getdents(3, /* 102 entries */, 4096)    = 4080 <0.000095>
	getdents(3, /* 83 entries */, 4096)     = 3320 <0.000089>
	getdents(3, /* 0 entries */, 4096)      = 0 <0.000006>

but you can see the difference.. The real cost is obviously the IO.

		Linus

----
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>

static char buffer[8192];

static int create_file(const char *name)
{
	int fd = open(name, O_RDWR | O_CREAT | O_TRUNC, 0666);
	if (fd < 0)
		return fd;

	write(fd, buffer, sizeof(buffer));
	close(fd);
	return 0;
}

int main(int argc, char **argv)
{
	int i;
	char name[256];

	/* Fill up the buffer with some random garbage */
	for (i = 0; i < sizeof(buffer); i++)
		buffer[i] = "abcdefghijklmnopqrstuvwxyz\n"[i % 27];

	if (mkdir("throwaway", 0777) < 0 || chdir("throwaway") < 0) {
		perror("throwaway");
		exit(1);
	}

	/*
	 * Create a reasonably big directory by having a number
	 * of files with non-trivial filenames, and with some
	 * real content to fragment the directory blocks..
	 */
	for (i = 0; i < 1000; i++) {
		snprintf(name, sizeof(name),
			"file-name-%d-%d-%d-%d",
			i / 1000,
			(i / 100) % 10,
			(i / 10) % 10,
			(i / 1) % 10);
		create_file(name);
	}
	return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ