linux-kernel - [PATCH 2/2] vfs: use &file->f_pos directly on files that have position

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190413165449.11168-2-kirr@nexedi.com>
Date:   Sat, 13 Apr 2019 16:55:07 +0000
From:   Kirill Smelkov <kirr@...edi.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Al Viro <viro@...iv.linux.org.uk>, Arnd Bergmann <arnd@...db.de>,
        Christoph Hellwig <hch@....de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        <linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        Kirill Smelkov <kirr@...edi.com>
Subject: [PATCH 2/2] vfs: use &file->f_pos directly on files that have position

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@...ux-foundation.org>
Signed-off-by: Kirill Smelkov <kirr@...edi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_read(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_read(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_write(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_write(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1