[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <00f408c1-215e-39e2-dec4-8f05eb604f97@uls.co.za>
Date: Tue, 22 Aug 2017 15:47:25 +0200
From: Jaco Kroon <jaco@....co.za>
To: Theodore Ts'o <tytso@....edu>
Cc: Doug Porter <dsp@...com>, linux-ext4@...r.kernel.org,
Omar Sandoval <osandov@...com>
Subject: Re: *** SPAM *** Re: [PATCH] e2fsck: improve performance of
region_allocate()
Hi Ted,
On 22/08/2017 14:45, Theodore Ts'o wrote:
>> The attached improves CPU performance from O(e*h) to O(e) and memory from
>> O(h) to O(1). The patch below does similar for CPU but nothing for memory
>> (In my case it took fsck down by a significant margin).
> I thought you were saying you had some false positives (where it was
> causing e2fsck to complain about some valid extent trees in your file
> system)? That's why I've not acted on your proposed patch until I had
> time to study the code more closely to understand what was going on
> with the errors I thought you had reported.
I did in fact manage to clear it, but raised it so that you could
confirm. With all three patches I sent you directly applied:
338 tests succeeded 0 tests failed
> As far as the other (icount) optimization is concerned, that's on my
> todo list but I'm trying to understand how much of a win it's going to
> be on a densly hard linked file system, and whether the complexity due
> to the handling the potential increased memory usage is worth it. If
> we leave to be something that has to be manually enabled using
> /etc/e2fsck.conf, very few people will use it. If we need to somehow
> figure out how it's safe / necessary to activate the icount
> optimization, especially if there are potentially multiple fsck's
> running in parallel, this starts to get really complicated. So
> perhaps all we can do is leave it as a manually enabled option, but I
> still need to think about that a bit.
My (purely theoretical) assessment on that (attached - proof-of-concept
quality at best) below.
n = the number of files with a link count > 1;
t = the total number of possible inodes;
g = sum of link counts for all inodes with link count >1; and
c = the average link-count for inodes with link count >1 (g/n).
my case n = ~100 million and t = ~2.8 billion. I don't have an estimate
of g for my system, but reckon it can be in the range of 2 billion quite
trivially.
Pass1 assessment: insertions are always onto the end of the list
(inodes traversed sequentially), and cpu and memory wise reduces to O(n)
(memory = 96/8 * n = 12n bytes, 1.2GB in my case). I don't think there
is improvements to be had during pass1.
Pass2, current code: we clone the list of inodes with link-count>1,
avoiding expensive mid-array insertions, this reduces to a really fast
O(n). "increments" are expensive, and each increment results in a
binary search for the correct entry in the array, costing O(log(n)) - 27
comparison ops for my 40TB filesystem. We perform this g number of
times, so overall cpu is O(g.log(n)). memory remains identical to above.
Pass2, full map (possible "optimization"):
We allocate an array of __u16 for t inodes. In my use-case this
requires 2.8bn index size, or 5.6GB RAM. Memory *normally* goes up to O(t).
Only if n > 16.7% of t does memory usage decrease - based on the fact
that my filesystem has average file size of 394KB, and is <1TB remaining
on a 40TB file system at 107m files, n/t in my case = 3.5%, I find this
use-case extremely unlikely. We can thus argue we ALWAYS sacrifice
memory, from O(n) to O(t).
In terms of CPU however the increase operation now becomes O(1), from
O(log(n)). Depending on the value of g this can be a huge gain, and
since the O(1) here is a VERY FAST O(1) I suspect the 27x factor in my
use-case is an underestimate of the actual factor (I suspect even if we
have only 1 entry in the list the fact that we have to go through just
one search iteration is at least 3-5x more expensive in terms of the
constant, thus I *suspect* that a sensible cpu speedup for this check
for comparing link counts in inodes compared to actual directory
structures is probably around 100x during pass2. I could very well be
wrong.
That said, assuming that we will need to go out to disk to retrieve
directory structures we're probably going to be IO bound at this point
anyway, but you're the better person to comment on how disk/cpu bound
pass2 is in general at this stage.
Performing CPU benchmarks that doesn't require disk should be relatively
trivial (iterate through a set of values for n and g and generate random
data - at no point does the cpu utilization depend on the value of t, so
we can measure the effective difference on even super-crazy values of n
and g (constrained by g <= n*Max_Link_Count). If this is something that
would be interesting I'll happily see if I can generate something.
My opinion: if pass2 is currently generally CPU bound for such use-cases
we should look at this further, if disk bound then there is no point.
We can easily calculate the values of n, t and g during pass1 (or even
as setup code for pass2, everything we need is available to icount.c
during construction time of pass2 data structure). Perhaps it makes
sense to switch to full_map implementation heuristically based on those
values, but we'd need to understand the potential impact. We can
trivially fail back to the current implementation if the memory
allocation fails.
Kind Regards,
Jaco
View attachment "0002-e2fsck-add-optimization-for-heavy-hard-linked-pass2.patch" of type "text/x-patch" (7008 bytes)
Powered by blists - more mailing lists