linux-ext4 - Re: [RFC] improve malloc for large filesystems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BCFC8274-0A4E-42E7-9D11-647D47316BD2@whamcloud.com>
Date:   Thu, 21 Nov 2019 14:41:49 +0000
From:   Alex Zhuravlev <azhuravlev@...mcloud.com>
To:     "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: [RFC] improve malloc for large filesystems

> On 21 Nov 2019, at 12:18, Artem Blagodarenko <artem.blagodarenko@...il.com> wrote:
> 
> cr=0 and cr=1 are also important. They allow to accelerate allocator, by increasing allocation window.
> Step by step from smallest to the biggest value in preallocation table.
> Your optimisation can reset this allocation window grow.

Does it?
If allocated chunk is smaller than goal one (to predicted filesize), then next allocation will try to allocate the remained, AFAIR.
But prediction doesn’t depend on the previous allocation, essentially it just align current file size + allocation to the next 2^n
(or from the table in other versions).

cr=0 is an optimisation on its own as it checks only 2^n chunks which is cheap in terms of cycles.
cr=1 is a more expensive, we skip groups likely having no good extents (using average), but we still search for the goal size.

> Assume we have one fragmented part of disk and all other parts are quite free.
> Allocator will spend a lot of time to go through this fragmented part, because  will brake cr0 and cr1 and get
> range that satisfy c3. 

Even at cr=3 we still search for the goal size.

Thus we shouldn’t really allocate bad chunks because we break cr=0 and cr=1, we just stop to look for nicely looking groups
and fallback to regular (more expensive) search for free extents.

> 
> c3 requirement is quite simple “get first group that have enough free blocks to allocate requested range”.

This is only group selection, then we try to find that extent within that group, can fail and move to the next group.
EXT4_MB_HINT_FIRST is set outside of the main cr=0..3 loop.

> With hight probability allocator find such group at the start of c3 loop, so goal (allocator starts its searching from goal) will not significantly changed.
> Thus allocator go through this fragmented range using small steps.
> 
> Without suggested optimisation, allocator skips this fragmented range at moment and continue to allocate blocks.

1000 groups * 5ms avg.time = 5 seconds to skip 1000 bad uninitialized groups. This is the real problem.
You mentioned 4M groups...

Thanks, Alex