[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <11AF8D3C-411F-436C-AC8D-B1C057D02091@dilger.ca>
Date: Tue, 12 Nov 2024 11:44:11 -0700
From: Andreas Dilger <adilger@...ger.ca>
To: Theodore Ts'o <tytso@....edu>
Cc: Jan Kara <jack@...e.cz>,
Li Dongyang <dongyangli@....com>,
linux-ext4@...r.kernel.org,
Alex Zhuravlev <bzzz@...mcloud.com>
Subject: Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@....edu> wrote:
>
> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
>>> 1048576 records - 95 seconds
>>> 2097152 records - 580 seconds
>>
>> These are really high numbers of revoke records. Deleting couple GB of
>> metadata doesn't happen so easily. Are they from a real workload or just
>> a stress test?
>
> For context, the background of this is that this has been an
> out-of-tree that's been around for a very long time, for use with
> Lustre servers where apparently, this very large number of revoke
> records is a real thing.
Yes, we've seen this in production if there was a crash after deleting
many millions of log records. This causes remount to take potentially
several hours before completing (and this was made worse by HA causing
failovers due to mount being "stuck" doing the journal replay).
>> If my interpretation is correct, then rhashtable is unnecessarily
>> huge hammer for this. Firstly, as the big hash is needed only during
>> replay, there's no concurrent access to the data
>> structure. Secondly, we just fill the data structure in the
>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
>> elements we need to store in the table in advance (well, currently
>> we don't but it's trivial to modify PASS_SCAN to get that number).
>>
>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
>> up number of revoke records we're going to process and then prepare
>> a static hash of appropriate size for replay (we can just use the
>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
>> hash table allocated for replay and point journal->j_revoke to
>> it). And once recovery completes jbd2_journal_clear_revoke() can
>> free the table and point journal->j_revoke back to the original
>> table. What do you think?
>
> Hmm, that's a really nice idea; Andreas, what do you think?
Implementing code to manually count and resize the recovery hashtable
will also have its own complexity, including possible allocation size
limits for a huge hash table. That could be worked around by kvmalloc(),
but IMHO this essentially starts "open coding" something rhashtable was
exactly designed to avoid.
Since the rhashtable is only used during the journal recovery phase,
IMHO it isn't adding any complexity to the normal operational code, but
if there was ongoing overhead then this would be a different discussion.
Cheers, Andreas
Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)
Powered by blists - more mailing lists