lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BYAPR04MB58160E26E9A84B307C8E3E38E70F0@BYAPR04MB5816.namprd04.prod.outlook.com>
Date:   Thu, 23 Jan 2020 13:21:39 +0000
From:   Damien Le Moal <Damien.LeMoal@....com>
To:     Dave Chinner <david@...morbit.com>
CC:     "hare@...e.de" <hare@...e.de>, Naohiro Aota <Naohiro.Aota@....com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "darrick.wong@...cle.com" <darrick.wong@...cle.com>,
        "jth@...nel.org" <jth@...nel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [PATCH v8 1/2] fs: New zonefs file system

On 2020/01/23 8:11, Dave Chinner wrote:
[...]
>>> This also seems tailored around the call from zonefs_map_blocks()
>>> which tries to map the entire zone (length = zi->i_max_size) for
>>> writeback mappings. Hence the length in this case always requires
>>> clamping to zi->i_max_size - offset. Again, there's an issue here:
>>>
>>>> +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
>>>> +			     struct inode *inode, loff_t offset)
>>>> +{
>>>> +	if (offset >= wpc->iomap.offset &&
>>>> +	    offset < wpc->iomap.offset + wpc->iomap.length)
>>>> +		return 0;
>>>> +
>>>> +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
>>>> +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
>>>> +				  0, &wpc->iomap, NULL);
>>>
>>> Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
>>> no checking that this writeback code path is only executing against
>>> a conventional zone. I.e. the comments and checks in
>>> zonefs_iomap_begin() relate only to user IO call paths, but don't
>>> validate or comment on the writeback path callers, and there's no
>>> comments or checks here that the inode points at a conventional
>>> zone, either....
>>
>> I do not understand your point here. Since all blocks are always
>> allocated for both conventional and sequential files, I think that
>> using i_max_size for calling zonefs_iomap_begin is OK:
> 
> Yes, it is, but that wasn't really the point I was trying to make.
> My comments around passing in the max size here means that
> zonefs_iomap_begin() has to do clamping (i.e. the length = min(length,
> max_isize - offset) calls) just for this caller, as all the other
> callers from the user IO path already have their offset/lengths
> clamped to isize/max_isize. Hence if zonefs_map_blocks() clamped
> the length it passed to (i_max_size - offset) like all other callers
> do, then code in zonefs_iomap_begin() would be simpler.

Aaah, I see it now. Indeed, that would be cleaner.

>> for conventional
>> zone files, any of these blocks can be written, both user direct or
>> through the page cache. No distinction is I think necessary. For
>> sequential zone files, only the blocks at "offset" can be written, and
>> that value must be equal to zi->i_wpoffset (which account for in-
>> flights writes). In both cases, exceeding the max file size is not
>> allowed so this check is common in zonefs_iomap_begin() to cover all
>> users and not just zonefs_map_blocks(). Did I get something wrong with
>> iomap workings ?
> 
> No, the point I was trying to make (unsucessfully!) is that
> zonefs_map_blocks() is only called from buffered writeback, and
> so it only called on conventional zone writes. Neither
> zonefs_map_blocks() or zonefs_iomap_begin() check this, and
> zonefs_iomap_begin() can't because it doesn't have any flags passed
> into it to tell it that a mapping for a write is being done.
> 
> i.e. somewhere in this zonefs_map_blocks() codepath there needs to
> be a check like:
> 
> 	WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV);
> 
> because we should never get here for sequential zones.
> 
> And that then raises the question - if we can only get here for
> conventional zones, then wouldn't the code read better using
> the inode->i_size rather that the zi->i_max_size as all IO to
> conventional zones must be within the inode size?

Got it. It is obvious now :) Will clean that.

>>>> +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
>>>> +	zi->i_wpoffset = pos;
>>>> +	if (i_size_read(inode) != pos) {
>>>> +		zonefs_update_stats(inode, pos);
>>>> +		i_size_write(inode, pos);
>>>> +	}
>>>
>>> What happens if this decreases the size of the zone? don't we need
>>> to invalidate the page cache beyond the new EOF in this case (i.e.
>>> it's a truncate operation)?
>>
>> This is called only for direct write errors into sequential zones.
>> Since for that case we only deal with append direct writes, there is no
>> possibility of having any of the written data cached already. So even
>> if we get a short write or complete failure, no invalidation is needed.
> 
> Ah, there's a undocumented assumption that a write error never
> resets the zone write pointer completely, but only remains unchanged
> from where it was prior to the write that failed. My concern is what
> happens if the device decides that the error has caused the zone to
> be completely lost and so resets the write pointer back to zero?

Such behavior is forbidden by the ZBC/ZAC specifications. If it happens, we
would be fighting with a drive that has serious firmware bugs. So I am not
sure if it is worth adding code for that. But at the very least, adding a
warning would be nice.

> And the other concern here is what if the hardware write pointer
> still moves forward and exposes stale data because the write failed?

The write pointer can indeed move forward even in the case of a failed
write. This typically happens if a bad sector is hit in the middle of a
write. In this case the write pointer is updated to point right after the
last written sector, so no stale data is exposed. This is mandated by the
ZBC/ZAC specifications. If the drive firmware is bad and the WP goes beyond
the last written sector, we would have no way of detecting the stale data.
But on SMR, those sectors will most likely be unreadable anyway. Movement
of the write pointer beyond the IO end though can be tested for.
I will improve the checks on error.

>> Compared to errors for read operations in any zone, or conventional
>> zone files read/write errors, this error handling adds processing of
>> zone condition changes (error due to a zone going offline or read-
>> only). I could add the same treatment for all IO errors. I did not
>> since if we start seeing these zone conditions, it is likely that the
>> drive is about to die.
> 
> Ok, so it's not expected, but it sounds like in extreme
> circumstances it can still occur, and hence we still should try to
> handle such errors in a sane manner.

Yes. We can improve the checks. And if an unexpected (non standard) change
is detected, at the very least throw a warning, and possibly remount
read-only. For zones going offline or read-only, that is handled already
and the file attributes changed accordingly.

>> So conventional zone writes and all read errors
>> are treated like on any other FS: only return the error to the user
>> without any drive-specific forensic done.
> 
> Sure, but they don't go through this new error path :)

No they don't, but looking some more at the code, it is not too hard to
include it for all IOs to all zone types. I will do the change.

>>>> +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
>>>> +				     unsigned int flags)
>>>> +{
>>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>>> +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>>>> +
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Conventional zone file size is fixed to the zone size so there
>>>> +	 * is no need to do anything.
>>>> +	 */
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
>>>> +		return 0;
>>>> +
>>>> +	mutex_lock(&zi->i_truncate_mutex);
>>>> +
>>>> +	if (size < 0) {
>>>> +		ret = zonefs_seq_file_write_failed(inode, size);
>>>
>>> Ok, so I see it is being called from IO completion context, whcih
>>> means we'd want memalloc_noio_save() because the underlying bio
>>> doesn't get freed until this whole completion runs, right?
>>
>> Yes, the failed BIO is freed only after the report zone is done. But
>> more than GFP_NOIO, we want GFP_NOFS for the reason stated above.
> 
> Yes, I can see that GFP_NOFS is needed to avoid truncate lock
> recursion. However, we are in an IO completion routine here holding
> a bio, so what I'm asking is whether reclaim recursion back into the
> block layer and allocating more bios (e.g.  for swap to a
> conventional zone within the same zoned block device) is safe to do
> while we hold a bio from the same bioset that swap bios will be
> allocated from...
> 
> i.e. doesn't this violate the forwards progress guarantee we need
> for bioset mempools? i.e. we now can't free a bio if nested
> allocation of a bio blocks waiting for a bio to be freed...

Swap does not work on zoned block devices :) But I see your point. I think
it is indeed safer to use GFP_NOIO. Will change to that.

>>>> +	/*
>>>> +	 * Direct writes must be aligned to the block size, that is, the device
>>>> +	 * physical sector size, to avoid errors when writing sequential zones
>>>> +	 * on 512e devices (512B logical sector, 4KB physical sectors).
>>>> +	 */
>>>> +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Enforce sequential writes (append only) in sequential zones.
>>>> +	 */
>>>> +	mutex_lock(&zi->i_truncate_mutex);
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>>>> +	    iocb->ki_pos != zi->i_wpoffset) {
>>>> +		zonefs_err(inode->i_sb,
>>>> +			   "Unaligned write at %llu + %zu (wp %llu)\n",
>>>> +			   iocb->ki_pos, count,
>>>> +			   zi->i_wpoffset);
>>>> +		mutex_unlock(&zi->i_truncate_mutex);
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>> +	}
>>>> +	mutex_unlock(&zi->i_truncate_mutex);
>>>> +
>>>> +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
>>>> +			   is_sync_kiocb(iocb));
>>>> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>>>> +	    (ret > 0 || ret == -EIOCBQUEUED)) {
>>>> +		if (ret > 0)
>>>> +			count = ret;
>>>> +		mutex_lock(&zi->i_truncate_mutex);
>>>> +		zi->i_wpoffset += count;
>>>> +		mutex_unlock(&zi->i_truncate_mutex);
>>>
>>> Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
>>> means the IO has been queued but not necessarily submitted, but
>>> we update zi->i_wpoffset as though the entire AIO has laready
>>> completed. ANd then we drop the inode_lock() and return, allowing
>>> another AIO+DIO to be started.
>>>
>>> Hence another concurrent sequential AIO+DIO write could now be
>>> submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
>>> Now we have two independent IOs in flight - one that is at the
>>> current hardware write pointer offset, and one that is beyond it.
>>>
>>> What happens if the block layer now re-orders these two IOs?
>>
>> If the correct block scheduler is used, that is mq-deadline, there is
>> no possibility of write reordering.
> 
> Oh, my.
> 
> That needs a great big warning in the code. This assumes the block
> layer functions in a specific manner, and there is no way to
> guarantee that it does at the filesystem layer. Hence if the block
> layer is subtly broken (which has happened far too many times in the
> past couple of years for me to just trust it anymore) then this code
> can result in spurious write failures for applications that use
> AIO+DIO...

Yes. Absolutely true. I will add comments explaining this point. Of note is
that the same is true not only for AIOs but also for large sync write DIOs
too as these can get split into multiple write requests that can get
reordered without the correct system setup.

>> mq-deadline is now the default IO
>> scheduler for zoned block devices and the only one that is allowed
>> (beside "none"). mq-deadline uses a zone write locking mechanism to
>> ensure that there is no reordering of write requests, either by the
>> block layer itself or by bad hardware (SATA AHCI adapters are
>> notoriously bad and silently reorder requests all the time, even for
>> SMR disks).
>>
>> With this mechanism, the user can safely use io_submit() beyond a
>> single IO and zonefs check that the set of AIOs being submitted are all
>> sequential starting from the zi->i_wpoffset "soft" write pointer that
>> reflects the already in-flight AIOs. Multiple io_submit() of multiple
>> AIOs can be executed in sequence without needing to limit to a single
>> AIO at a time.
> 
> I can see lots of potential problems with AIO on a filesystem that
> assumes sequential, ordered AIO submission. e.g. RWF_NOWAIT and
> submitting multiple sequential IOs at a time. First IO gets EAGAIN
> because a lock is held by something else, second IO gets the lock
> and now returns -EINVAL because it's offset no longer matches the
> write pointer because the first IO got -EAGAIN and punted back to
> userspace.

Ah, yes indeed. I overlooked this case. I guess we have several solutions:
(1) return an error if RWF_NOWAIT is specified, (2) Ignore it and proceed
as if it is not specified, or (3) keep as is and let the user handle the
error pattern.

This error case is still safe with regard to the file size/zone WP: the
file size is kept unchanged from the value inspected for the first -EAGAIN
failed AIO and the user can restart writing from there. Not too bad. But I
am leaning toward solution (2) to reduce these potential spurious errors.

> Or worse, it's io_uring, and it punts that IO to a worker thread to
> resubmit. Now that IO will be issued out of order to all the others,
> and so userspace will see that it succeeds, but all the other IOs in
> the sequential batch get -EINVAL because of IO reordering long
> before the IO even gets to the block layer....

Yes, same here with the difference that we would have requests being
uselessly sent to the disk and failed with EIO instead of EINVAL. But
again, this is still safe with regard to the inode size/write pointer as no
change happen.

>> If a disk error occurs along the way, the seq file size and zi-
>>> i_wpoffset are updated using the report zone result. All in-flight or
>> submitted AIOs after the failed one will be failed by the disk itself
>> due to the their now unaligned position. These failures will not change
>> again the file size or zi->i_wpoffset since the zone information will
>> be the same after all failures. The user only has to look at the file
>> size again to know were to restart writing from without even needing to
>> wait for all in-flight AIO to complete with an error (but that would of
>> course be the recommended practice).
> 
> So I'm guessing that the same failure condition will return
> different errors based on where the failure is detected. e.g. EINVAL
> if it's at the write submission layer and EIO if it is reported by
> the hardware?

Yes, correct. With some additional work in the block layer, we could also
return -EINVAL for all unaligned writes failed by the disk. This would
indicate a soft recoverable error rather than the generally more serious
EIO. At the drive level, ZBC and ZAC define sense codes for unaligned
writes which could be mapped to a new blk status code, which in turn can be
converted to a -EINVAL for failed unaligned write bios. That can be done
separately from zonefs though. Will look at it.

>> In other word, we assume here that all write succeed and allow high-
>> queue depth submission using zi->i_wpoffset as a "soft" write pointer.
> 
> I'm starting to wonder whether it is a good idea to even support AIO
> on sequential zones because there are some really messy spurious
> failure cases that userspace will not be able to distinguish from
> real write errors. That's not a very nice API for applications to
> have to deal with...

Yes, but I would argue that zonefs does not make this error processing
harder than with the raw zoned block device use case. If anything, zonefs
is already an improvement over the complete lack of zone device specific
error handling of the block device file IO path. Unless I get a lot of
push-back, I would rather keep AIOs and work on suppressing bad patterns
due to RWF_NOWAIT and other corner cases.

> At minimum, this needs extensive documentation both for users and
> for kernel filesystem people that need to maintain this code forever
> more...

Yes, OK. I will add more comments and improve the documentation file
mentioning these points.

Thank you for your comments.

Best regards.

-- 
Damien Le Moal
Western Digital Research

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ