linux-kernel - Re: [PATCH v4] fs: Fix data race in inode_set_ctime_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241124183447.GP1926309@frogsfrogsfrogs>
Date: Sun, 24 Nov 2024 10:34:47 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Hao-ran Zheng <zhenghaoran@...a.edu.cn>, viro@...iv.linux.org.uk,
	brauner@...nel.org, jack@...e.cz, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, baijiaju1990@...il.com,
	21371365@...a.edu.cn
Subject: Re: [PATCH v4] fs: Fix data race in inode_set_ctime_to_ts

On Sun, Nov 24, 2024 at 06:56:57PM +0100, Mateusz Guzik wrote:
> On Sun, Nov 24, 2024 at 09:44:35AM -0800, Darrick J. Wong wrote:
> > On Sun, Nov 24, 2024 at 05:42:53PM +0800, Hao-ran Zheng wrote:
> > > A data race may occur when the function `inode_set_ctime_to_ts()` and
> > > the function `inode_get_ctime_sec()` are executed concurrently. When
> > > two threads call `aio_read` and `aio_write` respectively, they will
> > > be distributed to the read and write functions of the corresponding
> > > file system respectively. Taking the btrfs file system as an example,
> > > the `btrfs_file_read_iter` and `btrfs_file_write_iter` functions are
> > > finally called. These two functions created a data race when they
> > > finally called `inode_get_ctime_sec()` and `inode_set_ctime_to_ns()`.
> > > The specific call stack that appears during testing is as follows:
> > > 
> > > ============DATA_RACE============
> > > btrfs_delayed_update_inode+0x1f61/0x7ce0 [btrfs]
> > > btrfs_update_inode+0x45e/0xbb0 [btrfs]
> > > btrfs_dirty_inode+0x2b8/0x530 [btrfs]
> > > btrfs_update_time+0x1ad/0x230 [btrfs]
> > > touch_atime+0x211/0x440
> > > filemap_read+0x90f/0xa20
> > > btrfs_file_read_iter+0xeb/0x580 [btrfs]
> > > aio_read+0x275/0x3a0
> > > io_submit_one+0xd22/0x1ce0
> > > __se_sys_io_submit+0xb3/0x250
> > > do_syscall_64+0xc1/0x190
> > > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > ============OTHER_INFO============
> > > btrfs_write_check+0xa15/0x1390 [btrfs]
> > > btrfs_buffered_write+0x52f/0x29d0 [btrfs]
> > > btrfs_do_write_iter+0x53d/0x1590 [btrfs]
> > > btrfs_file_write_iter+0x41/0x60 [btrfs]
> > > aio_write+0x41e/0x5f0
> > > io_submit_one+0xd42/0x1ce0
> > > __se_sys_io_submit+0xb3/0x250
> > > do_syscall_64+0xc1/0x190
> > > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > 
> > > To address this issue, it is recommended to add WRITE_ONCE
> > > and READ_ONCE when reading or writing the `inode->i_ctime_sec`
> > > and `inode->i_ctime_nsec` variable.
> > 
> > Excuse my ignorance, but how exactly does this annotation fix the race?
> > Does it prevent reordering of the memory accesses or something?  "it is
> > recommended" is not enough for dunces such as myself to figure out why
> > this fixes the problem. :(
> > 
> 
> It prevents the compiler from getting ideas. One hypothetical is
> splitting the load from one asm op to several, possibly resulting a
> corrupted value when racing against an update
> 
> A not hypothethical concerns some like this:
> 	time64_t sec = inode_get_ctime_sec(inode);
> 	/* do stuff with sec */
> 	/* do other stuff */
> 	/* do more stuff sec */
> 
> The compiler might have decided to throw away the read sec value and do
> the load again later. Thus if an update happened in the meantime then
> the code ends up operating on 2 different values, even though the
> programmer clearly intended it to be stable.

<nod> I figured as much, but I wanted the commit message to spell that
out for everybody, e.g. "Use READ_ONCE to force compilers to sample the
ctime values one time only, and WRITE_ONCE to prevent the compiler from
turning one line of code into multiple stores to the same address."

> However, since both sec and nsec are updated separately and there is no
> synchro, reading *both* can still result in values from 2 different
> updates which is a bug not addressed by any of the above. To my
> underestanding of the vfs folk take on it this is considered tolerable.

<nod> This is the other thing -- the commit message refers to preventing
a data race, but there is indeed nothing about access ordering that
prevents /that/ race.  Someone not an fs developer could interpret that
as "the patch does not fully fix all possible problems" whereas the
community has decided there's an acceptable tradeoff with not taking
i_rwsem.

--D