linux-ext4 - Ext4 mballoc behavior with mb_optimize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220727105123.ckwrhbilzrxqpt24@quack3>
Date:   Wed, 27 Jul 2022 12:51:23 +0200
From:   Jan Kara <jack@...e.cz>
To:     linux-ext4@...r.kernel.org
Cc:     Ted Tso <tytso@....edu>,
        Harshad Shirwadkar <harshadshirwadkar@...il.com>
Subject: Ext4 mballoc behavior with mb_optimize_scan=1

Hello,

before going on vacation I was tracking down why reaim benchmark regresses
(10-20%) with larger number of processes with the new mb_optimize_scan
strategy of mballoc. After a while I have reproduced the regression with a
simple benchmark that just creates, fsyncs, and deletes lots of small files
(22k) from 16 processes, each process has its own directory. The immediate
reason for the slow down is that with mb_optimize_scan=1 the file blocks
are spread among more block groups and thus we have more bitmaps to update
in each transaction. 

So the question is why mballoc with mb_optimize_scan=1 spreads allocations
more among block groups. The situation is somewhat obscured by group
preallocation feature of mballoc where each *CPU* holds a preallocation and
small (below 64k) allocations on that CPU are allocated from this
preallocation. If I trace creating of these group preallocations I can see
that the block groups they are taken from look like:

mb_optimize_scan=0:
49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113
33 33 97 81 49 81 17 49

mb_optimize_scan=1:
127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117
116 115 116 114 113 111 110 109 108 107 106 105 104 104

So we can see that while with mb_optimize_scan=0 the preallocation is
always take from one of a few groups (among which we jump mostly randomly)
which mb_optimize_scan=1 we consistently drift from higher block groups to
lower block groups.

The mb_optimize_scan=0 behavior is given by the fact that search for free
space always starts in the same block group where the inode is allocated
and the inode is always allocated in the same block group as its parent
directory. So the create-delete benchmark generally keeps all inodes for
one process in the same block group and thus allocations are always
starting in that block group. Because files are small, we always succeed in
finding free space in the starting block group and thus allocations are
generally restricted to the several block groups where parent directories
were originally allocated.

With mb_optimize_scan=1 the block group to allocate from is selected by
ext4_mb_choose_next_group_cr0() so in this mode we completely ignore the
"pack inode with data in the same group" rule. The reason why we keep
drifting among block groups is that whenever free space in a block group is
updated (blocks allocated / freed) we recalculate largest free order (see
mb_mark_used() and mb_free_blocks()) and as a side effect that removes
group from the bb_largest_free_order_node list and reinserts the group at
the tail.

I have two questions about the mb_optimize_scan=1 strategy:

1) Shouldn't we respect the initial goal group and try to allocate from it
in ext4_mb_regular_allocator() before calling ext4_mb_choose_next_group()?

2) The rotation of groups in mb_set_largest_free_order() seems a bit
undesirable to me. In particular it seems pointless if the largest free
order does not change. Was there some rationale behind it?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR