linux-ext4 - Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <zdoxlcp3dl7ab2d3svzcr6lu226cq6lw7qzhbw3ac4wd7ftrrd@pmu4f3cpp2mh>
Date: Thu, 16 Jan 2025 19:04:08 +0100
From: Jan Kara <jack@...e.cz>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Jan Kara <jack@...e.cz>, Theodore Ts'o <tytso@....edu>, 
	Li Dongyang <dongyangli@....com>, Ext4 Developers List <linux-ext4@...r.kernel.org>, 
	Alex Zhuravlev <bzzz@...mcloud.com>
Subject: Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

On Wed 15-01-25 17:08:03, Andreas Dilger wrote:
> On Nov 13, 2024, at 7:47 AM, Jan Kara <jack@...e.cz> wrote:
> > 
> > On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
> >> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@....edu> wrote:
> >>> 
> >>> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> >>>>> 1048576 records - 95 seconds
> >>>>> 2097152 records - 580 seconds
> >>>> 
> >>>> These are really high numbers of revoke records. Deleting couple GB of
> >>>> metadata doesn't happen so easily. Are they from a real workload or just
> >>>> a stress test?
> >>> 
> >>> For context, the background of this is that this has been an
> >>> out-of-tree that's been around for a very long time, for use with
> >>> Lustre servers where apparently, this very large number of revoke
> >>> records is a real thing.
> >> 
> >> Yes, we've seen this in production if there was a crash after deleting
> >> many millions of log records.  This causes remount to take potentially
> >> several hours before completing (and this was made worse by HA causing
> >> failovers due to mount being "stuck" doing the journal replay).
> > 
> > Thanks for clarification!
> > 
> >>>> If my interpretation is correct, then rhashtable is unnecessarily
> >>>> huge hammer for this. Firstly, as the big hash is needed only during
> >>>> replay, there's no concurrent access to the data
> >>>> structure. Secondly, we just fill the data structure in the
> >>>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> >>>> elements we need to store in the table in advance (well, currently
> >>>> we don't but it's trivial to modify PASS_SCAN to get that number).
> >>>> 
> >>>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> >>>> up number of revoke records we're going to process and then prepare
> >>>> a static hash of appropriate size for replay (we can just use the
> >>>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> >>>> hash table allocated for replay and point journal->j_revoke to
> >>>> it). And once recovery completes jbd2_journal_clear_revoke() can
> >>>> free the table and point journal->j_revoke back to the original
> >>>> table. What do you think?
> >>> 
> >>> Hmm, that's a really nice idea; Andreas, what do you think?
> >> 
> >> Implementing code to manually count and resize the recovery hashtable
> >> will also have its own complexity, including possible allocation size
> >> limits for a huge hash table.  That could be worked around by kvmalloc(),
> >> but IMHO this essentially starts "open coding" something rhashtable was
> >> exactly designed to avoid.
> > 
> > Well, I'd say the result is much simpler than rhashtable code since
> > you don't need all that dynamic reallocation and complex locking. Attached is a patch that implements my suggestion. I'd say it is
> > simpler than having two types of revoke block hashing depending on
> > whether we are doing recovery or running the journal.
> > 
> > I've tested it and it seems to work fine (including replay of a
> > journal with sufficiently many revoke blocks) but I'm not sure
> > I can do a meaningful performance testing (I cannot quite reproduce
> > the slow replay times even when shutting down the filesystem after
> > deleting 1000000 directories). So can you please give it a spin?
> 
> Alex posted test results on the other rhashtable revoke thread,
> which show both the rhashtable and Jan's dynamically-allocated hash
> table perform much better than the original fixed-size hash table.

Yes, thanks for the testing! Patch submitted.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR