linux-ext4 - Re: fsck performance.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110224085956.GF16661@bitwizard.nl>
Date:	Thu, 24 Feb 2011 09:59:56 +0100
From:	Rogier Wolff <R.E.Wolff@...Wizard.nl>
To:	Andreas Dilger <adilger@...ger.ca>
Cc:	Ted Ts'o <tytso@....edu>, Rogier Wolff <R.E.Wolff@...wizard.nl>,
	linux-ext4@...r.kernel.org
Subject: Re: fsck performance.

On Wed, Feb 23, 2011 at 05:41:31PM -0700, Andreas Dilger wrote:
> On 2011-02-23, at 4:17 PM, Ted Ts'o wrote:
> > On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> >> 
> >> If you have the opportunity, I wonder whether the entire need for
> >> tdb can be avoided in your case by using swap and the icount
> >> optimization patches previously posted?  
> > 
> > Unfortunately, there are people who are still using 32-bit CPU's, so
> > no, swap is not a solution here.
> 

> I agree it isn't a solution in all cases, but avoiding GB-sized
> realloc() in the code was certainly enough to fix problems for the
> original people who hit them.  It likely also avoids a lot of
> memcpy() (depending on how realloc is implemented).

So, assuming that the biggest alloc is 1Gb. 
Assuming that we realloc (I haven't seen the code), at twice
the size every time, we'll alloc 1M, then 2M then 4M etc. up to 1G. 

In the last case we'll realloc the 512M pointer to a 1G region. Note
that this requires a contiguous 1G area of free addressing space
within the 3G total available addressing space. But let's ignore that
problem for now.

So for the 1G alloc we'll have to memcpy 512Mb of existing data.
The previous one required a memcpy of 256Mb etc etc. The total is
just under 1G. 

So you're proposing to optimize out a memcpy of 1G of my main memory.

When it boots, my system says: pIII_sse  :  4884.000 MB/sec

So it can handle xor at almost 5G/second. It should be able to do
memcpy (xor with a bunch of zeroes) at that speed. But lets assume
that the libc guys are stupid and mangaged to make it 10 times slower.

So you're proposing to optimize out 1G of memcopy at 0.5G/second or
two seconds of CPU time on an fsck that takes over 24
hours. Congratulations! You've made e2fsck about 0.0023 percent
faster!

Andreas, I really value your efforts to improve e2fsck. But optmizing
code can be done by looking at the code and saying: "this looks
inefficient, lets fix it up". However you're quickly going to be
spending time on optimizations that don't really matter.

(My second computer was a DOS 3.x machine. DOS came with a utility
called "sort". It does what you expect from a DOS program: It refuses
to sort datafiles larger than 64k. So I rewrote it. Turns out my
implementation was 100 times slower in reading in the dataset than the
original version. I did manage to sort 100 times faster than the
original version. End result? Mine was 10 times faster than the
original. They optimized something that didn't matter. I just read
some decades-old literature on sorting and implemented that).h

I firmly believe that a factor of ten performance improvement can be
achieved for fsck for my filesystem. It should be possible to fsck the
filesystem in 3.3 hours.

There are a total of 342M inodes. That's 87Gb. reading that at a
leasurely 50M/second gives us 1700 seconds, or half an hour. (it
should be possible to do better: I have 4 drives each doing 90M/sec,
allowing a total of over 300M/sec).

Then I have 2.7T of data. With old ext2/ext3 that requires indirect
blocks worth 2.7G of data. reading that at 10M/sec (it will be
shattered) requires 270 seconds or 5 minutes.

I have quite a lot of directories. So those might take some time.  The
cputime of actually doing the checks should be possible to overlap
with the IO.

Anyway, although in theory 10x should be possible, I expect that 5x is
a more realistic goal.

	Roger.

-- 
** R.E.Wolff@...Wizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html