[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e2lcp3n5gpf7zmlpyn4nj7wsr36sffn23z5bmzlsghu6oapi5u@sdkcbpimi5is>
Date: Thu, 28 Mar 2024 17:16:54 -0500
From: Eric Blake <eblake@...hat.com>
To: Stefan Hajnoczi <stefanha@...hat.com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
Alasdair Kergon <agk@...hat.com>, Mikulas Patocka <mpatocka@...hat.com>, dm-devel@...ts.linux.dev,
David Teigland <teigland@...hat.com>, Mike Snitzer <snitzer@...nel.org>, Jens Axboe <axboe@...nel.dk>,
Christoph Hellwig <hch@....de>, Joe Thornber <ejt@...hat.com>
Subject: Re: [RFC 0/9] block: add llseek(SEEK_HOLE/SEEK_DATA) support
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
As a minor point of clarity (perhaps as much for my own records for
documenting research I've done over the years, and not necessarily
something you need to change in the commit message):
Userspace apps generally use lseek(2) from glibc or similar (perhaps
via its alias lseek64(), depending on whether userspace is using large
file offsets), rather than directly calling the _llseek() syscall.
But it all boils down to the same notion of seeking information about
various special offsets.
Also, in past history, coreutils cp(1) and dd(1) did experiment with
using FS_IOC_FIEMAP ioctls when SEEK_HOLE was not available, but it
proved to cause more problems than it solved, so that is not currently
in favor. Yes, we could teach more and more block devices to expose
specific ioctls for querying sparseness boundaries, and then teach
userspace apps a list of ioctls to try; but as cp(1) already learned,
having one common interface is much easier than an ever-growing ioctl
ladder to be copied across every client that would benefit from
knowing where the unallocated portions are.
> This can speed up the process by reducing the amount of data read and it
> preserves sparseness when writing to the output file.
>
> This patch series is an initial attempt at implementing
> llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
> approach and suggestions for resolving the open issues.
One of your open issues was whether adjusting the offset of the block
device itself should also adjust the file offset of the underlying
file (at least in the case of loopback and dm-linear). What would the
community think of adding two additional constants to the entire
family of *seek() functions, that have the effect of returning the
same offset as their SEEK_HOLE/SEEK_DATA counterparts but without
moving the file offset?
Explaining the idea by example, although I'm not stuck on these names:
suppose you have an fd visiting a file description of 2MiB in size,
with the first 1MiB being a hole and the second being data.
#define MiB (1024*1024)
lseek64(fd, MiB, SEEK_SET); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_HOLE); // returns 0, file offset changed to 0
lseek64(fd, 0, SEEK_DATA); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_HOLE); // returns 0, but file offset left at MiB
lseek64(fd, 0, SEEK_SET); // returns 0, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_DATA); // returns MiB, but file offset left at MiB
With semantics like that, it might be easier to implement just
SEEK_PEEK* in devices (don't worry about changing offsets, just about
reporting where the requested offset is), and then have a common layer
do the translation from llseek(...,offs,SEEK_HOLE) into a 2-step
llseek(...,llseek(...,offs,SEEK_PEEK_HOLE),SEEK_SET) if that makes life
easier under the hood.
>
> In the block device world there are similar concepts to holes:
> - SCSI has Logical Block Provisioning where the "mapped" state would be
> considered data and other states would be considered holes.
BIG caveat here: the SCSI spec does not necessarily guarantee that
unmapped regions read as all zeroes; compare the difference between
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE)
on a regular file guarantees that future read() in that hole will see
NUL bytes, I'm not sure whether we want to make that guarantee for
block devices. This may be yet another case where we might want to
add new SEEK_* constants to the *seek() family of functions that lets
the caller indicate whether they want offsets that are guaranteed to
read as zero, vs. merely offsets that are not allocated but may or may
not read as zero. Skipping unallocated portions, even when you don't
know if the contents reliably read as zero, is still a useful goal in
some userspace programs.
> - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
However, utilizing it in nbd.ko would require teaching the kernel to
handle structured or extended headers (right now, that is an extension
only supported in user-space implementations of the NBD protocol). I
can see why you did not tackle that in this RFC series, even though
you mention it in the cover letter.
> - Linux loop block devices and dm-linear targets can pass through queries to
> the backing file.
> - dm-thin targets can query metadata to find holes.
> - ...and you may be able to think of more examples.
>
> Therefore it is possible to offer this functionality in block drivers.
>
> In my use case a QEMU process in userspace copies the contents of a dm-thin
> target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets
> without this patch series.
>
> Others have also wished for block device support for SEEK_HOLE. Here is an open
> issue from the BorgBackup project:
> https://github.com/borgbackup/borg/issues/5609
>
> With these patches userspace can identify holes in loop, dm-linear, and dm-thin
> devices. This is done by adding a seek_hole_data() callback to struct
> block_device_operations. When the callback is NULL the entire device is
> considered data. Device-mapper is extended along the same lines so that targets
> can provide seek_hole_data() callbacks.
>
> I'm unfamiliar with much of this code and have probably missed locking
> requirements. Since llseek() executes synchronously like ioctl() and is not an
> asynchronous I/O request it's possible that my changes to the loop block driver
> and dm-thin are broken (e.g. what if the loop device fd is changed during
> llseek()?).
>
> To run the tests:
>
> # make TARGETS=block_seek_hole -C tools/testing/selftests run_tests
>
> The code is also available here:
> https://gitlab.com/stefanha/linux/-/tree/block-seek-hole
>
> Please take a look and let me know your thoughts. Thanks!
>
> Stefan Hajnoczi (9):
> block: add llseek(SEEK_HOLE/SEEK_DATA) support
> loop: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add loop block driver tests
> dm: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-zero test
> dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-linear test
> dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-thin test
>
> tools/testing/selftests/Makefile | 1 +
> .../selftests/block_seek_hole/Makefile | 17 +++
> include/linux/blkdev.h | 7 ++
> include/linux/device-mapper.h | 5 +
> block/fops.c | 43 ++++++-
> drivers/block/loop.c | 36 +++++-
> drivers/md/dm-linear.c | 25 ++++
> drivers/md/dm-thin.c | 77 ++++++++++++
> drivers/md/dm.c | 68 ++++++++++
> .../testing/selftests/block_seek_hole/config | 3 +
> .../selftests/block_seek_hole/dm_thin.sh | 80 ++++++++++++
> .../selftests/block_seek_hole/dm_zero.sh | 31 +++++
> .../selftests/block_seek_hole/map_holes.py | 37 ++++++
> .../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++
> 14 files changed, 540 insertions(+), 7 deletions(-)
> create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> create mode 100644 tools/testing/selftests/block_seek_hole/config
> create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
> create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
> create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> create mode 100755 tools/testing/selftests/block_seek_hole/test.py
>
> --
> 2.44.0
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
Powered by blists - more mailing lists