[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4238c09a-c28d-43b6-8d1a-a5debcb2d8c5@suse.com>
Date: Wed, 1 Nov 2023 10:42:53 +0800
From: Heming Zhao <heming.zhao@...e.com>
To: Joseph Qi <joseph.qi@...ux.alibaba.com>,
Srivathsa Dara <srivathsa.d.dara@...cle.com>, mark@...heh.com,
jlbec@...lplan.org
Cc: rajesh.sivaramasubramaniom@...cle.com, junxiao.bi@...cle.com,
ocfs2-devel@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ocfs2: call ocfs2_abort when journal abort
On 11/1/23 09:13, Joseph Qi wrote:
>
>
> On 10/30/23 8:00 PM, Srivathsa Dara wrote:
>> From: Ryan Ding <ryan.ding@...cle.com>
>>
>>
>> journal can not recover from abort state, so we should take following
>> action to prevent file system from corruption:
>>
>> 1. change to readonly filesystem when local mount. We can not afford
>> further write, so change to RO state is reasonable.
>>
>> 2. panic when cluster mount. Because we can not release lock resource in
>> this state, other node will hung when it require a lock owned by this
>> node. So panic and remaster is a reasonable choise.
>>
>> ocfs2_abort() will do all the above work.
>>
>> Signed-off-by: Ryan Ding <ryan.ding@...cle.com>
>> Signed-off-by: Srivathsa Dara <srivathsa.d.dara@...cle.com>
>> ---
>> fs/ocfs2/journal.c | 27 +++++++++++++++------------
>> 1 file changed, 15 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index ce215565d061..6dace475f019 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -14,7 +14,6 @@
>> #include <linux/kthread.h>
>> #include <linux/time.h>
>> #include <linux/random.h>
>> -#include <linux/delay.h>
>> #include <linux/writeback.h>
>>
>> #include <cluster/masklog.h>
>> @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct ocfs2_super *osb, int quota)
>>
>> static int ocfs2_commit_thread(void *arg)
>> {
>> - int status;
>> + int status = 0;
>> struct ocfs2_super *osb = arg;
>> struct ocfs2_journal *journal = osb->journal;
>>
>> @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
>> wait_event_interruptible(osb->checkpoint_event,
>> atomic_read(&journal->j_num_trans)
>> || kthread_should_stop());
>> + if (status < 0) {
>> + /* As we can not terminate by ourself, just enter an
>> + * empty loop to wait for stop.
>> + */
>> + continue;
>> + }
>>
>> status = ocfs2_commit_cache(osb);
>> if (status < 0) {
>> - static unsigned long abort_warn_time;
>> -
>> - /* Warn about this once per minute */
>> - if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
>> - mlog(ML_ERROR, "status = %d, journal is "
>> - "already aborted.\n", status);
>> /*
>> - * After ocfs2_commit_cache() fails, j_num_trans has a
>> - * non-zero value. Sleep here to avoid a busy-wait
>> - * loop.
>> + * journal can not recover from abort state, there is
>> + * no need to keep commit cache. So we should either
>> + * change to readonly(local mount) or just panic
>> + * (cluster mount).
>> + * We should also clear j_num_trans to prevent further
>> + * commit.
>> */
>> - msleep_interruptible(1000);
>> + atomic_set(&journal->j_num_trans, 0);
>
> Unconditionally clear 'j_num_trans' here seems buggy.
> This may cause other nodes corrupt filesystem.
Yes, you described this buggy condition during review my patch:
https://lore.kernel.org/ocfs2-devel/2b90546e-d23e-7d45-22dd-a887a73f3e61@linux.alibaba.com/
btw, I ever send a v2 patch [1] about handling journal abort:
[1] https://lore.kernel.org/ocfs2-devel/20230626112453.22571-1-heming.zhao@suse.com/
This patch never get any feedback.
And I also send another mail [2] related with above patch:
[2] https://lore.kernel.org/ocfs2-devel/20230626150916.je3egonzb3crvbl5@p15/
please note, [2] is the problems from fsck.ocfs2, which is completely unrelated to [1].
Heming
>
> Thanks,
> Joseph
>
>> + ocfs2_abort(osb->sb, "Detected aborted journal");
>> }
>>
>> if (kthread_should_stop() && atomic_read(&journal->j_num_trans)){
>
Powered by blists - more mailing lists