linux-kernel - Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <vivemjs4vjvnfjapgnesrsk664rpk43l6w4sydfyrwryxrphkb@bduhqm5fnjou>
Date: Tue, 3 Feb 2026 13:37:12 +0100
From: Jan Kara <jack@...e.cz>
To: Li Chen <me@...ux.beauty>
Cc: Jan Kara <jack@...e.cz>, Matthew Wilcox <willy@...radead.org>, 
	Mark Fasheh <mark@...heh.com>, Joel Becker <jlbec@...lplan.org>, 
	Joseph Qi <joseph.qi@...ux.alibaba.com>, ocfs2-devel <ocfs2-devel@...ts.linux.dev>, 
	linux-kernel <linux-kernel@...r.kernel.org>, Jan Kara <jack@...e.com>
Subject: Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

On Tue 03-02-26 20:10:42, Li Chen wrote:
> Hi Jan,
> 
>  ---- On Tue, 03 Feb 2026 01:17:49 +0800  Jan Kara <jack@...e.cz> wrote --- 
>  > On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
>  > > On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
>  > > > Hi Matthew,
>  > > > 
>  > > >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
>  > > >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
>  > > >  > > holding journal->j_list_lock.
>  > > >  > > 
>  > > >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
>  > > >  > 
>  > > >  > I don't think this is the right solution to the problem.  If it is,
>  > > >  > there needs to be much better argumentation in the commit message.
>  > > >  > 
>  > > >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
>  > > >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
>  > > >  > actual problem we need to address is that there's no memory barrier
>  > > >  > between the store to i_dirty_start and the list_add().  Once that's
>  > > >  > added, there's no need for a READ_ONCE here.
>  > > >  > 
>  > > >  > Or have I misunderstood the problem?
>  > > > 
>  > > > Thanks for the review.
>  > > > 
>  > > > My understanding of your point is that you're worried about a missing
>  > > > "publish" ordering in jbd2_journal_file_inode(): we store
>  > > > jinode->i_dirty_start/end and then list_add() the jinode to
>  > > > t_inode_list, and a core which observes the list entry might miss the prior
>  > > > i_dirty_* stores. Is that the issue you had in mind?
>  > > 
>  > > I think that's the only issue that exists ...
>  > > 
>  > > > If so, for the normal commit path where the list is walked under
>  > > > journal->j_list_lock (e.g. journal_submit_data_buffers() in
>  > > > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
>  > > > necessary ordering, since both the i_dirty_* updates and the list_add()
>  > > > happen inside the same critical section.
>  > > 
>  > > I don't think that's true.  I think what you're asserting is that:
>  > > 
>  > >     int *pi;
>  > >     int **ppi;
>  > > 
>  > >     spin_lock(&lock);
>  > >     *pi = 1;
>  > >     *ppi = pi;
>  > >     spin_unlock(&lock);
>  > > 
>  > > that the store to *pi must be observed before the store to *ppi, and
>  > > that's not true for a reader which doesn't read the value of lock.
>  > > The store to *ppi needs a store barrier before it.
>  > 
>  > Well, the above reasonably accurately describes the code making jinode
>  > visible. The reader code is like:
>  > 
>  >     spin_lock(&lock);
>  >         pi = *ppi;
>  >     spin_unlock(&lock);
>  > 
>  >     work with pi
>  > 
>  > so it is guaranteed to see pi properly initialized. The problem is that
>  > "work with pi" can race with other code updating the content of pi which is
>  > what this patch is trying to deal with.
>  > 
>  > > > The ocfs2 case I was aiming at is different: the filesystem callback is
>  > > > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
>  > > > j_list_lock but it still reads jinode->i_dirty_start/end while other
>  > > > threads update these fields under the lock. Adding a barrier between the
>  > > > stores and list_add() would not address that concurrent update window.
>  > > 
>  > > I don't think that race exists.  If it does exist, the READ_ONCE will
>  > > not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
>  > > platforms do not, in general, have a way to do an atomic 64-bit load
>  > > (look at the implementation of i_size_read() for the gyrations we go
>  > > through to assure a non-torn read of that value).
>  > 
>  > Sadly the race does exist - journal_submit_data_buffers() on the committing
>  > transaction can run in parallel with jbd2_journal_file_inode() in the
>  > running transaction. There's nothing preventing that. The problems arising
>  > out of that are mostly theoretical but they do exist. In particular you're
>  > correct that on 32-bit platforms this will be racy even with READ_ONCE /
>  > WRITE_ONCE which I didn't realize.
>  > 
>  > Li, the best way to address this concern would be to modify jbd2_inode to
>  > switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
>  > bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
>  > the passed ranges properly...
> 
>   Thank you, Jan. I will update the series accordingly.
> 
>   It seems this won’t break large files on 32-bit either: MAX_LFS_FILESIZE there is
>   ULONG_MAX << PAGE_SHIFT (i.e. ULONG_MAX pages, ~16TB with 4K pages), and pgoff_t
>   (unsigned long) can represent the same ULONG_MAX-page range.

Exactly. That's the reason why I've proposed this but I should have
probably stated that explicitely :).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR