linux-ext4 - Re: *** SPAM *** Re: [PATCH] e2fsck: improve performance of region

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 22 Aug 2017 15:47:25 +0200
From:   Jaco Kroon <jaco@....co.za>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Doug Porter <dsp@...com>, linux-ext4@...r.kernel.org,
        Omar Sandoval <osandov@...com>
Subject: Re: *** SPAM *** Re: [PATCH] e2fsck: improve performance of
 region_allocate()

Hi Ted,

On 22/08/2017 14:45, Theodore Ts'o wrote:
>> The attached improves CPU performance from O(e*h) to O(e) and memory from
>> O(h) to O(1).  The patch below does similar for CPU but nothing for memory
>> (In my case it took fsck down by a significant margin).
> I thought you were saying you had some false positives (where it was
> causing e2fsck to complain about some valid extent trees in your file
> system)?  That's why I've not acted on your proposed patch until I had
> time to study the code more closely to understand what was going on
> with the errors I thought you had reported.
I did in fact manage to clear it, but raised it so that you could 
confirm.  With all three patches I sent you directly applied:

338 tests succeeded     0 tests failed

> As far as the other (icount) optimization is concerned, that's on my
> todo list but I'm trying to understand how much of a win it's going to
> be on a densly hard linked file system, and whether the complexity due
> to the handling the potential increased memory usage is worth it.  If
> we leave to be something that has to be manually enabled using
> /etc/e2fsck.conf, very few people will use it.  If we need to somehow
> figure out how it's safe / necessary to activate the icount
> optimization, especially if there are potentially multiple fsck's
> running in parallel, this starts to get really complicated.  So
> perhaps all we can do is leave it as a manually enabled option, but I
> still need to think about that a bit.
My (purely theoretical) assessment on that (attached - proof-of-concept 
quality at best) below.

n = the number of files with a link count > 1;
t = the total number of possible inodes;
g = sum of link counts for all inodes with link count >1; and
c = the average link-count for inodes with link count >1 (g/n).

my case n = ~100 million and t = ~2.8 billion.  I don't have an estimate 
of g for my system, but reckon it can be in the range of 2 billion quite 
trivially.

Pass1 assessment:  insertions are always onto the end of the list 
(inodes traversed sequentially), and cpu and memory wise reduces to O(n) 
(memory = 96/8 * n = 12n bytes, 1.2GB in my case).  I don't think there 
is improvements to be had during pass1.

Pass2, current code:  we clone the list of inodes with link-count>1, 
avoiding expensive mid-array insertions, this reduces to a really fast 
O(n).  "increments" are expensive, and each increment results in a 
binary search for the correct entry in the array, costing O(log(n)) - 27 
comparison ops for my 40TB filesystem.  We perform this g number of 
times, so overall cpu is O(g.log(n)).  memory remains identical to above.

Pass2, full map (possible "optimization"):

We allocate an array of __u16 for t inodes.  In my use-case this 
requires 2.8bn index size, or 5.6GB RAM.  Memory *normally* goes up to O(t).

Only if n > 16.7% of t does memory usage decrease - based on the fact 
that my filesystem has average file size of 394KB, and is <1TB remaining 
on a 40TB file system at 107m files, n/t in my case = 3.5%, I find this 
use-case extremely unlikely.  We can thus argue we ALWAYS sacrifice 
memory, from O(n) to O(t).

In terms of CPU however the increase operation now becomes O(1), from 
O(log(n)).  Depending on the value of g this can be a huge gain, and 
since the O(1) here is a VERY FAST O(1) I suspect the 27x factor in my 
use-case is an underestimate of the actual factor (I suspect even if we 
have only 1 entry in the list the fact that we have to go through just 
one search iteration is at least 3-5x more expensive in terms of the 
constant, thus I *suspect* that a sensible cpu speedup for this check 
for comparing link counts in inodes compared to actual directory 
structures is probably around 100x during pass2.  I could very well be 
wrong.

That said, assuming that we will need to go out to disk to retrieve 
directory structures we're probably going to be IO bound at this point 
anyway, but you're the better person to comment on how disk/cpu bound 
pass2 is in general at this stage.

Performing CPU benchmarks that doesn't require disk should be relatively 
trivial (iterate through a set of values for n and g and generate random 
data - at no point does the cpu utilization depend on the value of t, so 
we can measure the effective difference on even super-crazy values of n 
and g (constrained by g <= n*Max_Link_Count).  If this is something that 
would be interesting I'll happily see if I can generate something.

My opinion: if pass2 is currently generally CPU bound for such use-cases 
we should look at this further, if disk bound then there is no point.  
We can easily calculate the values of n, t and g during pass1 (or even 
as setup code for pass2, everything we need is available to icount.c 
during construction time of pass2 data structure).  Perhaps it makes 
sense to switch to full_map implementation heuristically based on those 
values, but we'd need to understand the potential impact.  We can 
trivially fail back to the current implementation if the memory 
allocation fails.

Kind Regards,
Jaco

View attachment "0002-e2fsck-add-optimization-for-heavy-hard-linked-pass2.patch" of type "text/x-patch" (7008 bytes)