[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878qddoz8v.wl-me@linux.beauty>
Date: Sun, 01 Feb 2026 12:37:36 +0800
From: Li Chen <me@...ux.beauty>
To: Matthew Wilcox <willy@...radead.org>
Cc: Mark Fasheh <mark@...heh.com>,
Joel Becker <jlbec@...lplan.org>,
Joseph Qi <joseph.qi@...ux.alibaba.com>,
ocfs2-devel <ocfs2-devel@...ts.linux.dev>,
linux-kernel <linux-kernel@...r.kernel.org>,
Jan Kara <jack@...e.com>
Subject: Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads
Hi Matthew,
Thank you very much for the detailed explanation and for your patience.
On Sat, 31 Jan 2026 00:36:28 +0800,
Matthew Wilcox wrote:
>
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> >
> > > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> > > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> > > > holding journal->j_list_lock.
> > > >
> > > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> > >
> > > I don't think this is the right solution to the problem. If it is,
> > > there needs to be much better argumentation in the commit message.
> > >
> > > As I understand it, jbd2_journal_file_inode() initialises jinode,
> > > then adds it to the t_inode_list, then drops the j_list_lock. So the
> > > actual problem we need to address is that there's no memory barrier
> > > between the store to i_dirty_start and the list_add(). Once that's
> > > added, there's no need for a READ_ONCE here.
> > >
> > > Or have I misunderstood the problem?
> >
> > Thanks for the review.
> >
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
>
> I think that's the only issue that exists ...
Understood.
>
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
>
> I don't think that's true. I think what you're asserting is that:
>
> int *pi;
> int **ppi;
>
> spin_lock(&lock);
> *pi = 1;
> *ppi = pi;
> spin_unlock(&lock);
>
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.
Yes, agreed ― thank you. I was implicitly assuming the reader had taken the same lock
at some point, which is not a valid assumption for a lockless reader.
> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
>
> I don't think that race exists. If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).
>
> > "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> > (callback may sleep); these fields are updated under j_list_lock in jbd2.
> > Use READ_ONCE() so the callback takes a single snapshot via actual loads
> > from the variable (i.e. don't let the compiler reuse a value kept in a register
> > or fold multiple reads)."
>
> I think the prevention of this race occurs at a higher level than
> "it's updated under a lock". That is, jbd2_journal_file_inode()
> is never called for a jinode which is currently being operated on by
> j_submit_inode_data_buffers(). Now, I'm not an expert on the jbd code,
> so I may be wrong here.
Thanks. I tried to sanity-check whether that “never called” invariant holds
in practice.
I added a small local-only tracepoint (not for upstream) which fires from
jbd2_journal_file_inode() when it observes JI_COMMIT_RUNNING already set
on the same jinode:
/* fs/jbd2/transaction.c */
if (unlikely(jinode->i_flags & JI_COMMIT_RUNNING))
trace_jbd2_file_inode_commit_running(...);
The trace event prints dev, ino, current tid, jinode flags, and the
i_transaction / i_next_transaction tids.
With an ext4 test (ordered mode) I do see repeated hits. Trace output:
... jbd2_submit_inode_data: dev 7,0 ino 20
... jbd2_file_inode_commit_running: dev 7,0 ino 20 tid 3 op 0x6 i_flags 0x7
j_tid 2 j_next 3 ... comm python3
So it looks like jbd2_journal_file_inode() can run while JI_COMMIT_RUNNING
is set for that inode, i.e. during the window where the commit thread drops
j_list_lock around ->j_submit_inode_data_buffers() / ->j_finish_inode_data_buffers().
Given this, would you prefer the series to move towards something like:
1. taking a snapshot of i_dirty_start/end under j_list_lock in the commit path and passing the snapshot
to the filesystem callback (so callbacks never read jinode->i_dirty_* locklessly), or
2. introducing a real synchronization mechanism for the dirty range itself (seqcount/atomic64/etc)?
3. something else.
I’d be very grateful for guidance on what you consider the most appropriate direction or point out something I'm wrong.
Thanks again.
Regards,
Li
Powered by blists - more mailing lists