[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <acaa68b3-9884-415e-9808-e426068fac53@huaweicloud.com>
Date: Sat, 9 Nov 2024 11:12:30 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>, Li Dongyang <dongyangli@....com>
Cc: linux-ext4@...r.kernel.org, Andreas Dilger <adilger@...ger.ca>,
Alex Zhuravlev <bzzz@...mcloud.com>
Subject: Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
On 2024/11/8 18:33, Jan Kara wrote:
> On Tue 05-11-24 14:44:28, Li Dongyang wrote:
>> Resizable hashtable should improve journal replay time when
>> we have million of revoke records.
>> Notice that rhashtable is used during replay only,
>> as removal with list_del() is less expensive and it's still used
>> during regular processing.
>>
>> before:
>> 1048576 records - 95 seconds
>> 2097152 records - 580 seconds
>
> These are really high numbers of revoke records. Deleting couple GB of
> metadata doesn't happen so easily. Are they from a real workload or just
> a stress test?
>
>> after:
>> 1048576 records - 2 seconds
>> 2097152 records - 3 seconds
>> 4194304 records - 7 seconds
>
> The gains are very nice :).
>
>> Signed-off-by: Alex Zhuravlev <bzzz@...mcloud.com>
>> Signed-off-by: Li Dongyang <dongyangli@....com>
>
>> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
>> index 667f67342c52..d9287439171c 100644
>> --- a/fs/jbd2/recovery.c
>> +++ b/fs/jbd2/recovery.c
>> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
>> memset(&info, 0, sizeof(info));
>> sb = journal->j_superblock;
>>
>> + err = jbd2_journal_init_recovery_revoke(journal);
>> + if (err)
>> + return err;
>> +
>> /*
>> * The journal superblock's s_start field (the current log head)
>> * is always zero if, and only if, the journal was cleanly
>> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
>> index 4556e4689024..d6e96099e9c9 100644
>> --- a/fs/jbd2/revoke.c
>> +++ b/fs/jbd2/revoke.c
>> @@ -90,6 +90,7 @@
>> #include <linux/bio.h>
>> #include <linux/log2.h>
>> #include <linux/hash.h>
>> +#include <linux/rhashtable.h>
>> #endif
>>
>> static struct kmem_cache *jbd2_revoke_record_cache;
>> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
>>
>> struct jbd2_revoke_record_s
>> {
>> - struct list_head hash;
>> + union {
>> + struct list_head hash;
>> + struct rhash_head linkage;
>> + };
>> tid_t sequence; /* Used for recovery only */
>> unsigned long long blocknr;
>> };
>> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
>> * single block.
>> */
>>
>> +static const struct rhashtable_params revoke_rhashtable_params = {
>> + .key_len = sizeof(unsigned long long),
>> + .key_offset = offsetof(struct jbd2_revoke_record_s, blocknr),
>> + .head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
>> +};
>> +
>
> I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH
> is just too small for replays of a journal with huge numbers of revoked
> blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing
> performance issues also during normal operation when we track there revokes
> for the current transaction?
>
> If my interpretation is correct, then rhashtable is unnecessarily huge
> hammer for this. Firstly, as the big hash is needed only during replay,
> there's no concurrent access to the data structure. Secondly, we just fill
> the data structure in the PASS_REVOKE scan and then use it. Thirdly, we
> know the number of elements we need to store in the table in advance (well,
> currently we don't but it's trivial to modify PASS_SCAN to get that
> number).
>
> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up
> number of revoke records we're going to process and then prepare a static
> hash of appropriate size for replay (we can just use the standard hashing
> fs/jbd2/revoke.c uses, just with differently sized hash table allocated for
> replay and point journal->j_revoke to it). And once recovery completes
> jbd2_journal_clear_revoke() can free the table and point journal->j_revoke
> back to the original table. What do you think?
>
Sounds reasonable to me. I'd vote for this solution, this is a really simple
and clear solution, and I believe it can achieve similar gains as rhashtable.
Thanks,
Yi.
Powered by blists - more mailing lists