linux-kernel - Re: [PATCH] fs/read_write.c: Fix a broken signed integer overflow check.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <51e4f5f6-957b-6df5-0b73-e25bcbc08bb8@tuxera.com>
Date:   Fri, 18 Feb 2022 14:52:18 +0200
From:   Ari Sundholm <ari@...era.com>
To:     Al Viro <viro@...iv.linux.org.uk>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        <linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
        <stable@...r.kernel.org>, Anton Altaparmakov <anton@...era.com>
Subject: Re: [PATCH] fs/read_write.c: Fix a broken signed integer overflow
 check.

On 2/7/22 7:07 PM, Al Viro wrote:
> On Mon, Feb 07, 2022 at 06:44:55PM +0200, Ari Sundholm wrote:
>> Hello, Al,
>>
>> On 2/7/22 16:58, Al Viro wrote:
>>> On Mon, Feb 07, 2022 at 02:07:11PM +0200, Ari Sundholm wrote:
>>>> The function generic_copy_file_checks() checks that the ends of the
>>>> input and output file ranges do not overflow. Unfortunately, there is
>>>> an issue with the check itself.
>>>>
>>>> Due to the integer promotion rules in C, the expressions
>>>> (pos_in + count) and (pos_out + count) have an unsigned type because
>>>> the count variable has the type uint64_t. Thus, in many cases where we
>>>> should detect signed integer overflow to have occurred (and thus one or
>>>> more of the ranges being invalid), the expressions will instead be
>>>> interpreted as large unsigned integers. This means the check is broken.
>>>
>>> I must be slow this morning, but... which values of pos_in and count are
>>> caught by your check, but not by the original?
>>>
>>
>> Thank you for your response and questions.
>>
>> Assuming an x86-64 target platform, please consider:
>>
>> loff_t pos_out = 0x7FFFFFFFFFFEFFFFLL;
>> and
>> uint64_t count = 65537;
>>
>> The type of the expression (pos_out + count) is a 64-bit unsigned type, by
>> C's integer promotion rules. Its value is 0x8000000000000000ULL, that is,
>> bit 63 is set.
>>
>> The comparison (pos_out + count) < pos_out, again due to C's integer
>> promotion rules, is unsigned. Thus, the comparison, in this case, is
>> equivalent to:
>>
>> 0x8000000000000000ULL < 0x7FFFFFFFFFFEFFFFULL,
>>
>> which is false. Please note that the LHS is not expressible as a positive
>> integer of type loff_t. With larger values for count, the problem should
>> become quite obvious, as some the offsets within the file would not be
>> expressible as positive integers of type loff_t. But I digress. As we can
>> see above, the overflow is missed.
>>
>> With the LHS explicitly cast to loff_t, the comparison is equivalent to:
>>
>> 0x8000000000000000LL < 0x7FFFFFFFFFFEFFFFLL,
>>
>> which is true, as the LHS is negative.
>>
>> This has also been verified in practice, and was detected when running tests
>> on special cases of the copy_file_range syscall on different filesystems.
> 
> Er...  I still don't see the problem here.  If the destination filesystem
> explicitly allows offsets in excess of 2^63, what's the point in that
> -EOVERFLOW?  And if it doesn't, you'll get count truncated by
> generic_write_check_limits(), down to the amount remaining until the
> fs limit...
> 
> Same on the input side - if your source file is at least 2^63, what's the
> problem?  And if not, you'll get count capped by file size - pos_in, right
> under that check...
> 
Oops. You are of course correct - the patch is broken for files with 
unsigned offsets. This stems from the mistaken assumption that the 
comparison was supposed to be a signed one all along. Regardless, it 
seems an appropriate amount of flagellation is in order for disregarding 
the existence of files with unsigned offsets.

However, my concerns with the behavior of generic_copy_file_checks() are 
not limited to just this line, although just improving the check would 
be a considerable step forward. If I may, I would like to make some 
points to clarify what I find the problems with generic_copy_file_checks().

At the surface level, the comparison in question, be it signed or 
unsigned, should consider and match the signedness of each of the files. 
*Both* the original code and the patch seem wrong to me here. It is also 
remarkable that the far more common case of signed offsets is the one 
not considered here.

As for more fundamental issues, first, the behavior is inconsistent with 
pread64 and pwrite64, for instance. Using the same offset and length as 
those of the out file in the example given, both pread64 and pwrite64 
immediately return -EINVAL, as rw_verify_area(), which performs these 
checks correctly, is called in vfs_read() and vfs_write() *before* 
tampering with the length of the read or write request.

The second fundamental issue is, indeed, the very tampering with the 
length of the copy before properly checking the ranges. Especially so in 
the case of signed offsets. The operation, as a whole, must fail, as the 
requested range within the output file exceeds what can be expressed 
using loff_t. I think it would be wise to follow the lead of 
p{read,write}64 here, and fail early, as the range(s) being invalid is 
independent of any filesystem-specific considerations (apart from the 
offsets being signed). Returning partial success is not very useful, as 
userspace cannot tell anything is wrong, and will call copy_file_range() 
again in an attempt to complete the copy, inevitably forcing an error to 
be returned.

(Basically, what I am trying to argue here that this is *categorically* 
a different case from errors arising from the state of a specific 
volume, such as a write failing due to ENOSPC, or exceeding some limit 
dictated by the environment or filesystem specification. The range(s) 
involved in the operation are simply inexpressible with the datatypes used.)


 > Which filesystems had been involved and what was the test?
 >

To demonstrate the variation in behavior both between copy_file_range 
and p{read,write}64 and different filesystem implementations, I carried 
out a small experiment, where the equivalent of the following sequence 
was run from userspace (pseudocode, most error handling omitted):

	const loff_t orig_pos_in = 0;
	const loff_t orig_pos_out = 0x7FFFFFFFFFFEFFFFLL;
	const size_t orig_length = 65537;
	loff_t pos_in = orig_pos_in;
	loff_t pos_out = orig_pos_out;
	size_t length = orig_length;
	ssize_t ret, written = 0;
	int fd_in = mkstemp(...);
	int fd_out = mkstemp(...);
	const char buf[128k] = { whatever };

	write(fd_in, buf, 128k); /* Magically always succeeds. */

	while (written < orig_length &&
		(ret = copy_file_range(fd_in, &pos_in, fd_out, &pos_out,
		length, 0)) > 0)
	{
		written += ret;
		length -= ret;
	}

	pos_out = orig_pos_out;
	length = orig_length;
	written = 0;
	while (written < orig_length &&
		(ret = pwrite(fd_out, buf, length, pos_out) > 0)
	{
		written += ret;
		pos_out += ret;
		length -= ret;
	}

(The range parameters are shamelessly stolen from xfstests test case 
generic/564.)

Then, the behavior for various filesystems was observed:

btrfs, xfs:
  - copy_file_range is called twice.
  - First copy_file_range copies 65536 bytes. Partial success.
  - Second copy_file_range returns -EFBIG.
  - pwrite is called once, and returns -EINVAL, having written nothing.

exfat, ext{2,3,4}, f2fs, fuse2fs, ntfs3, reiserfs, vfat:
  - copy_file_range is called once, and returns -EFBIG.
  - pwrite is called once, and returns -EINVAL, having written nothing.

hfsplus:
  - copy_file_range is called once, and returns -EINVAL.
  - pwrite is called once, and returns -EINVAL, having written nothing.

Reading the source code of vfs_read() and a few manual test runs show 
that pread64 has similar behavior to pwrite64 when invoked with similar 
parameters.

Unlike in the case of pread64 and pwrite64, the copy range operation 
(after the length was reduced) is allowed to continue on to 
filesystem-specific code, where there is some variation in how the 
operation continues. Remarkably, with the sole exception of hfsplus, the 
end result is the same regardless of whether the filesystem 
implementation and specification support sparse files. Sure, it is 
perfectly POSIX'y to allow the filesystem implementation perform a 
partial write and only return an error on the second call, but the 
difference in behavior with pread64 and pwrite64 is very odd, and seems 
unnecessary.

Please note that, had the experiment been run on files with unsigned 
offsets, and the copy parameters been set in an analogous manner so that 
they cause unsigned overflow, generic_copy_file_checks() would reject 
the operation with -EOVERFLOW. Why is the same not done for signed 
offsets? Why is there a difference with p{read,write}64?

Best regards,
Ari Sundholm
ari@...era.com