[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e1600a7-24d4-454a-87d6-6528b1c4c88d@pengutronix.de>
Date: Wed, 7 May 2025 18:24:34 +0200
From: Ahmad Fatoum <a.fatoum@...gutronix.de>
To: "Darrick J. Wong" <djwong@...nel.org>, Christoph Hellwig <hch@....de>,
Jens Axboe <axboe@...nel.dk>, Ulf Hansson <ulf.hansson@...aro.org>
Cc: Jan Lübbe <jlu@...gutronix.de>,
"kernel@...gutronix.de" <kernel@...gutronix.de>,
linux-block@...r.kernel.org, linux-mmc@...r.kernel.org,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: [RFC] block: Make MMC respect REQ_NOUNMAP?
Hi,
We would like to tell storage (mostly eMMC, but not exclusively though) to
discard a block range, so it reads back zeroes and is preferably unmapped
to give the storage device the most flexibility. For eMMCs, this is possible
right now with blkdiscard -z (BLKZEROOUT), but digging through the code with Jan,
I am starting to question whether eMMC is correctly implementing REQ_OP_WRITE_ZEROES
(granted, the expected semantics seem to be spelt out nowhere).
Read along for what MMC does, what semantics BLKZEROOUT seems to should have,
and what I think might need to be done to address this.
eMMC supports a number of different commands for erasing/discarding data[1].
Relevant to my question are two commands:
DISCARD: Host no longer needs the data and doesn't care about value
read from it. Card may remove or unmap portions or all of it.
TRIM: implies DISCARD and additionally guarantees to
return all-zero or all-ones on read.
These are made available to the block layer as follows:
REQ_OP_DISCARD -> DISCARD
REQ_OP_WRITE_ZEROES -> TRIM (if erased_byte == 0, otherwise
if all-ones, blk_queue_max_write_zeroes_sectors() is not called)
Looking at it from the ioctl side:
blkdev_common_ioctl(..., BLKZEROOUT, ...)
blk_ioctl_zeroout
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
__blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP)
submit_bio_wait(bio->bi_opf = REQ_NOUNMAP)
__REQ_NOUNMAP has comment saying 'do not free blocks when zeroing',
but as shown above, TRIM allows the card to unmap the indicated region.
REQ_NOUNMAP has no other documentation, but virtio inverts it and translates
it to VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, which is documented as follows[2]:
"For write zeroes commands, if the unmap is set, the device MAY deallocate
the specified range of sectors in the device backend storage, as if the
discard command had been sent.".
I.e. REQ_NOUNMAP -> device MAY NOT deallocate
This is at odds with the MMC implementation, which ignores REQ_NOUNMAP
completely it seems. I don't believe there is a MMC command for write zeroes
without discard short of actually writing zeroes, so it sounds like the
correct implementation for MMC would be:
if (req->cmd_flags & REQ_NOUNMAP)
// or w/e causes fallback to __blkdev_issue_zero_pages
return -EOPNOTSUPP;
// do TRIM as before
Of course, this will change user visible behavior: blkdiscard -z will start
taking much longer for most users. These users will have to migrate to
using fallocate instead:
blkdev_fallocate(mode = FALLOC_FL_ZERO_RANGE):
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
blkdev_fallocate(mode = FALLOC_FL_PUNCH_HOLE)
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOFALLBACK)
So, it's not a drop-in replacement. I guess user code can punch hole
with fallback to BLKZEROOUT if it fails in order to get back the old
behavior.
I must admit I don't even know why one would write zeroes and care about
them remaining mapped on the storage device, but that seems to be
what's expected with BLKZEROOUT.
Thoughts? What did I miss?
Thanks,
Ahmad
[1]: For a short and incomplete summary, see:
https://github.com/barebox/barebox/commit/91a11c7d50df91
[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html
5.2.6.2 Device Requirements: Device Operation
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Powered by blists - more mailing lists