linux-kernel - Re: [PATCH v8 1/2] fs: New zonefs file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63dbc880d4748c5f7f9dc91f80525ec01933370f.camel@wdc.com>
Date:   Wed, 22 Jan 2020 10:07:07 +0000
From:   Damien Le Moal <Damien.LeMoal@....com>
To:     "david@...morbit.com" <david@...morbit.com>
CC:     "hare@...e.de" <hare@...e.de>, Naohiro Aota <Naohiro.Aota@....com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "darrick.wong@...cle.com" <darrick.wong@...cle.com>,
        "jth@...nel.org" <jth@...nel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [PATCH v8 1/2] fs: New zonefs file system

Dave,

On Wed, 2020-01-22 at 12:57 +1100, Dave Chinner wrote:
> [...]
> > +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > +			      unsigned int flags, struct iomap *iomap,
> > +			      struct iomap *srcmap)
> > +{
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t max_isize = zi->i_max_size;
> > +	loff_t isize;
> > +
> > +	/*
> > +	 * For sequential zones, enforce direct IO writes. This is already
> > +	 * checked when writes are issued, so warn about this here if we
> > +	 * get buffered write to a sequential file inode.
> > +	 */
> > +	if (WARN_ON_ONCE(zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +			 (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT)))
> > +		return -EIO;
> > +
> > +	/*
> > +	 * For all zones, all blocks are always mapped. For sequential zones,
> > +	 * all blocks after the write pointer (inode size) are always unwritten.
> > +	 */
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +	isize = i_size_read(inode);
> > +	if (offset >= isize) {
> > +		length = min(length, max_isize - offset);
> > +		if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > +			iomap->type = IOMAP_MAPPED;
> > +		else
> > +			iomap->type = IOMAP_UNWRITTEN;
> > +	} else {
> > +		length = min(length, isize - offset);
> > +		iomap->type = IOMAP_MAPPED;
> > +	}
> 
> Something was bugging me about this, and reading the rest of the
> patch it finally triggered. For conventional zones, inode->i_size =
> zi->i_max_size, and so if offset >= isize for a conventional
> zone then this:
> 
> 	length = min(length, max_isize - offset);
> 
> is going to result in length <= 0 and we return a negative length
> iomap.
> 
> IOWs, this case should only trigger for IO into sequential zones,
> as it appears to be prevented at higher layers for conventional
> zones by explicit checks against i_max_size and/or
> iov_iter_truncate() calls to ensure user IOs are limited to within
> i_max_size.
> 
> Hence it looks to me that triggering the (offset >= isize) case here
> for conventional zones is a WARN_ON_ONCE() and return -EIO
> situation...
> 
> SO, perhaps:
> 
> 	isize = i_size_read(inode);
> 	if (offset >= isize) {
> 		if (WARN_ON_ONCE(i->i_ztype == ZONEFS_ZTYPE_CNV)) {
> 			/* drop locks */
> 			return -EIO;
> 		}
> 		length = min(length, max_isize - offset);
> 		iomap->type = IOMAP_UNWRITTEN;
> 	} else {
> 		length = min(length, isize - offset);
> 		iomap->type = IOMAP_MAPPED;
> 	}

Yes, that is much better indeed. I will change this.

> This also seems tailored around the call from zonefs_map_blocks()
> which tries to map the entire zone (length = zi->i_max_size) for
> writeback mappings. Hence the length in this case always requires
> clamping to zi->i_max_size - offset. Again, there's an issue here:
> 
> > +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,
> > +			     struct inode *inode, loff_t offset)
> > +{
> > +	if (offset >= wpc->iomap.offset &&
> > +	    offset < wpc->iomap.offset + wpc->iomap.length)
> > +		return 0;
> > +
> > +	memset(&wpc->iomap, 0, sizeof(wpc->iomap));
> > +	return zonefs_iomap_begin(inode, offset, ZONEFS_I(inode)->i_max_size,
> > +				  0, &wpc->iomap, NULL);
> 
> Where we pass flags = 0 into zonefs_iomap_begin(), and so there is
> no checking that this writeback code path is only executing against
> a conventional zone. I.e. the comments and checks in
> zonefs_iomap_begin() relate only to user IO call paths, but don't
> validate or comment on the writeback path callers, and there's no
> comments or checks here that the inode points at a conventional
> zone, either....

I do not understand your point here. Since all blocks are always
allocated for both conventional and sequential files, I think that
using i_max_size for calling zonefs_iomap_begin is OK: for conventional
zone files, any of these blocks can be written, both user direct or
through the page cache. No distinction is I think necessary. For
sequential zone files, only the blocks at "offset" can be written, and
that value must be equal to zi->i_wpoffset (which account for in-
flights writes). In both cases, exceeding the max file size is not
allowed so this check is common in zonefs_iomap_begin() to cover all
users and not just zonefs_map_blocks(). Did I get something wrong with
iomap workings ?

> 
> > +static vm_fault_t zonefs_filemap_fault(struct vm_fault *vmf)
> > +{
> > +	struct zonefs_inode_info *zi = ZONEFS_I(file_inode(vmf->vma->vm_file));
> > +	vm_fault_t ret;
> > +
> > +	down_read(&zi->i_mmap_sem);
> > +	ret = filemap_fault(vmf);
> > +	up_read(&zi->i_mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +static vm_fault_t zonefs_filemap_page_mkwrite(struct vm_fault *vmf)
> > +{
> > +	struct inode *inode = file_inode(vmf->vma->vm_file);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	vm_fault_t ret;
> > +
> > +	sb_start_pagefault(inode->i_sb);
> > +	file_update_time(vmf->vma->vm_file);
> > +
> > +	/* Serialize against truncates */
> > +	down_read(&zi->i_mmap_sem);
> > +	ret = iomap_page_mkwrite(vmf, &zonefs_iomap_ops);
> > +	up_read(&zi->i_mmap_sem);
> > +
> > +	sb_end_pagefault(inode->i_sb);
> > +	return ret;
> > +}
> 
> Should there be a WARN_ON_ONCE(zi->zi_type != ZONEFS_ZTYPE_CNV) in
> here?

Yes, that would be useful. I will add that.

> 
> > +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t max_pos;
> > +	size_t count;
> > +	ssize_t ret;
> > +
> > +	if (iocb->ki_pos >= zi->i_max_size)
> > +		return 0;
> > +
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!inode_trylock_shared(inode))
> > +			return -EAGAIN;
> > +	} else {
> > +		inode_lock_shared(inode);
> > +	}
> 
> We should really turn that into a generic helper. This pattern is
> being replicated all over the place. Not in this patchset, though...

Yes, I saw that too. Will work on something later.

> > +static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
> > +				      void *data)
> > +{
> > +	struct inode *inode = data;
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	loff_t pos;
> > +
> > +	/*
> > +	 * The condition of the zone may have change. Check it and adjust the
> > +	 * inode information as needed, similarly to zonefs_init_file_inode().
> > +	 */
> > +	if (zone->cond == BLK_ZONE_COND_OFFLINE) {
> > +		inode->i_flags |= S_IMMUTABLE;
> > +		inode->i_mode &= ~0777;
> > +		zone->wp = zone->start;
> > +	} else if (zone->cond == BLK_ZONE_COND_READONLY) {
> > +		inode->i_flags |= S_IMMUTABLE;
> > +		inode->i_mode &= ~0222;
> > +	}
> 
> This exact code is repeated in zonefs_init_file_inode(). Maybe it
> should be a helper function?

Yes, good idea.

> 
> > +
> > +	pos = (zone->wp - zone->start) << SECTOR_SHIFT;
> > +	zi->i_wpoffset = pos;
> > +	if (i_size_read(inode) != pos) {
> > +		zonefs_update_stats(inode, pos);
> > +		i_size_write(inode, pos);
> > +	}
> 
> What happens if this decreases the size of the zone? don't we need
> to invalidate the page cache beyond the new EOF in this case (i.e.
> it's a truncate operation)?

This is called only for direct write errors into sequential zones.
Since for that case we only deal with append direct writes, there is no
possibility of having any of the written data cached already. So even
if we get a short write or complete failure, no invalidation is needed.

Compared to errors for read operations in any zone, or conventional
zone files read/write errors, this error handling adds processing of
zone condition changes (error due to a zone going offline or read-
only). I could add the same treatment for all IO errors. I did not
since if we start seeing these zone conditions, it is likely that the
drive is about to die. So conventional zone writes and all read errors
are treated like on any other FS: only return the error to the user
without any drive-specific forensic done.

> 
> > +static int zonefs_seq_file_write_failed(struct inode *inode, int error)
> > +{
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	struct super_block *sb = inode->i_sb;
> > +	sector_t sector = zi->i_zsector;
> > +	unsigned int nofs_flag;
> > +	int ret;
> > +
> > +	zonefs_warn(sb, "Updating inode zone %llu info\n", sector);
> > +
> > +	/*
> > +	 * blkdev_report_zones() uses GFP_KERNEL by default. Force execution as
> > +	 * if GFP_NOFS was specified so that it will not end up recursing into
> > +	 * the FS on memory allocation.
> > +	 */
> > +	nofs_flag = memalloc_nofs_save();
> > +	ret = blkdev_report_zones(sb->s_bdev, sector, 1,
> > +				  zonefs_report_zones_err_cb, inode);
> > +	memalloc_nofs_restore(nofs_flag);
> 
> The comment is kinda redundant - it's explaining exactly what the
> code does rather than why it needs this protection. i.e. the comment
> should explain the recursion vector/deadlock that we are avoiding
> here...

Yes. Changed it to:

/*
 * Report zones memory allocation could trigger a recursion into zonefs
 * due to memory reclaim. Since this is always called with the inode
 * truncate mutex lock being held, avoid the potential recursion
 * deadlock using a GFP_NOFS allocation.
 */

> 
> > +static int zonefs_file_dio_write_end(struct kiocb *iocb, ssize_t size, int ret,
> > +				     unsigned int flags)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Conventional zone file size is fixed to the zone size so there
> > +	 * is no need to do anything.
> > +	 */
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
> > +		return 0;
> > +
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +
> > +	if (size < 0) {
> > +		ret = zonefs_seq_file_write_failed(inode, size);
> 
> Ok, so I see it is being called from IO completion context, whcih
> means we'd want memalloc_noio_save() because the underlying bio
> doesn't get freed until this whole completion runs, right?

Yes, the failed BIO is freed only after the report zone is done. But
more than GFP_NOIO, we want GFP_NOFS for the reason stated above.

> 
> > +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
> > +	struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > +	size_t count;
> > +	ssize_t ret;
> > +
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!inode_trylock(inode))
> > +			return -EAGAIN;
> > +	} else {
> > +		inode_lock(inode);
> > +	}
> > +
> > +	ret = generic_write_checks(iocb, from);
> > +	if (ret <= 0)
> > +		goto out;
> > +
> > +	iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
> > +	count = iov_iter_count(from);
> 
> So count is initialised to the entire IO length....

Well, yes, count reflects the truncated iov_iter size. This is
necessary for the AIO case when iomap_dio_rw() returns -EIOCBQUEUED so
that we can account for the inflight AIOs for an eventual subsequent
AIO submission by the user (see next comment below). For sync writes
(or AIOs that completed very quickly), the final value for count is
updated using iomap_dio_rw() return value.

> > +
> > +	/*
> > +	 * Direct writes must be aligned to the block size, that is, the device
> > +	 * physical sector size, to avoid errors when writing sequential zones
> > +	 * on 512e devices (512B logical sector, 4KB physical sectors).
> > +	 */
> > +	if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Enforce sequential writes (append only) in sequential zones.
> > +	 */
> > +	mutex_lock(&zi->i_truncate_mutex);
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +	    iocb->ki_pos != zi->i_wpoffset) {
> > +		zonefs_err(inode->i_sb,
> > +			   "Unaligned write at %llu + %zu (wp %llu)\n",
> > +			   iocb->ki_pos, count,
> > +			   zi->i_wpoffset);
> > +		mutex_unlock(&zi->i_truncate_mutex);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +	mutex_unlock(&zi->i_truncate_mutex);
> > +
> > +	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, &zonefs_dio_ops,
> > +			   is_sync_kiocb(iocb));
> > +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > +	    (ret > 0 || ret == -EIOCBQUEUED)) {
> > +		if (ret > 0)
> > +			count = ret;
> > +		mutex_lock(&zi->i_truncate_mutex);
> > +		zi->i_wpoffset += count;
> > +		mutex_unlock(&zi->i_truncate_mutex);
> 
> Hmmmm. This looks problematic w.r.t. AIO. If we get -EIOCBQUEUED it
> means the IO has been queued but not necessarily submitted, but
> we update zi->i_wpoffset as though the entire AIO has laready
> completed. ANd then we drop the inode_lock() and return, allowing
> another AIO+DIO to be started.
> 
> Hence another concurrent sequential AIO+DIO write could now be
> submitted and pass the above iocb->ki_pos != zi->i_wpoffset check.
> Now we have two independent IOs in flight - one that is at the
> current hardware write pointer offset, and one that is beyond it.
> 
> What happens if the block layer now re-orders these two IOs?

If the correct block scheduler is used, that is mq-deadline, there is
no possibility of write reordering. mq-deadline is now the default IO
scheduler for zoned block devices and the only one that is allowed
(beside "none"). mq-deadline uses a zone write locking mechanism to
ensure that there is no reordering of write requests, either by the
block layer itself or by bad hardware (SATA AHCI adapters are
notoriously bad and silently reorder requests all the time, even for
SMR disks).

With this mechanism, the user can safely use io_submit() beyond a
single IO and zonefs check that the set of AIOs being submitted are all
sequential starting from the zi->i_wpoffset "soft" write pointer that
reflects the already in-flight AIOs. Multiple io_submit() of multiple
AIOs can be executed in sequence without needing to limit to a single
AIO at a time.

If a disk error occurs along the way, the seq file size and zi-
>i_wpoffset are updated using the report zone result. All in-flight or
submitted AIOs after the failed one will be failed by the disk itself
due to the their now unaligned position. These failures will not change
again the file size or zi->i_wpoffset since the zone information will
be the same after all failures. The user only has to look at the file
size again to know were to restart writing from without even needing to
wait for all in-flight AIO to complete with an error (but that would of
course be the recommended practice).

In other word, we assume here that all write succeed and allow high-
queue depth submission using zi->i_wpoffset as a "soft" write pointer.

> > +static struct dentry *zonefs_create_inode(struct dentry *parent,
> > +					const char *name, struct blk_zone *zone)
> > +{
> > +	struct inode *dir = d_inode(parent);
> > +	struct dentry *dentry;
> > +	struct inode *inode;
> > +
> > +	dentry = d_alloc_name(parent, name);
> > +	if (!dentry)
> > +		return NULL;
> > +
> > +	inode = new_inode(parent->d_sb);
> > +	if (!inode)
> > +		goto out;
> > +
> > +	inode->i_ino = get_next_ino();
> 
> get_next_ino() doesn't guarantee inode number uniqueness (it's 32
> bit and global across all filesystems so it can overflow). Are
> duplicate inode numbers on this superblock an issue?

Haa. Indeed. I missed that point. It would be nicer to have unique and
mount persistent per-volume inode numbers, especially considering the
amount of inodes that a large server with hundreds of drives would
generate. It is trivial to create unique inode numbers using each file
zone number on the disk (ordered by start sector) and using numbers
after that for the root directory and per zone type sub-directories.
That makes the inode numbers persistent across remounts. I will change
the code to do that. Having the file inode numbers linked to the on-
disk zones will also be a useful information for debugging any problem.

> 
> > +/*
> > + * Read super block information from the device.
> > + */
> > +static int zonefs_read_super(struct super_block *sb)
> > +{
> > +	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
> > +	struct zonefs_super *super;
> > +	u32 crc, stored_crc;
> > +	struct page *page;
> > +	struct bio_vec bio_vec;
> > +	struct bio bio;
> > +	int ret;
> > +
> > +	page = alloc_page(GFP_KERNEL);
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	bio_init(&bio, &bio_vec, 1);
> > +	bio.bi_iter.bi_sector = 0;
> > +	bio_set_dev(&bio, sb->s_bdev);
> > +	bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> > +	bio_add_page(&bio, page, PAGE_SIZE, 0);
> > +
> > +	ret = submit_bio_wait(&bio);
> > +	if (ret)
> > +		goto out;
> > +
> > +	super = page_address(page);
> > +
> > +	stored_crc = le32_to_cpu(super->s_crc);
> > +	super->s_crc = 0;
> > +	crc = crc32(~0U, (unsigned char *)super, sizeof(struct zonefs_super));
> > +	if (crc != stored_crc) {
> > +		zonefs_err(sb, "Invalid checksum (Expected 0x%08x, got 0x%08x)",
> > +			   crc, stored_crc);
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> 
> Does this mean if mount or the kernel tries to autoprobe the
> filesystem type on a device it will get -EIO and an "Invalid
> checksum" error message rather than just silently returning -EINVAL
> because....
> 
> > +	ret = -EINVAL;
> > +	if (le32_to_cpu(super->s_magic) != ZONEFS_MAGIC)
> > +		goto out;
> 
> ... it isn't actually a zonefs filesystem?
> 
> i.e. shouldn't these checks be the other way around?

Good catch. Yes, the other way around definitely makes more sense.

Thank you for all your comments. Posting an update asap.

Best regards.

-- 
Damien Le Moal
Western Digital Research