[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201230062819.yinrrp6uwfegsqo3@alap3.anarazel.de>
Date: Tue, 29 Dec 2020 22:28:19 -0800
From: Andres Freund <andres@...razel.de>
To: linux-fsdevel@...r.kernel.org
Cc: linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-block@...r.kernel.org
Subject: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten
extents?
Hi,
For things like database journals using fallocate(0) is not sufficient,
as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
writes requires the unwritten extents to be converted, which in turn
requires journal operations.
The performance difference in a journalling workload (lots of
sequential, low-iodepth, often small, writes) is quite remarkable. Even
on quite fast devices:
andres@...rk3:/mnt/t3$ grep /mnt/t3 /proc/mounts
/dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
andres@...rk3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
andres@...rk3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
andres@...rk3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
andres@...rk3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
andres@...rk3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
z262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
andres@...rk3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
The way around that, from a database's perspective, is obviously to just
overwrite the file "manually" after fallocate()ing it, utilizing larger
writes, and then to recycle the file.
But that's a fair bit of unnecessary IO from userspace, and it's IO that
the kernel can do more efficiently on a number of types of block
devices, e.g. by utilizing write-zeroes.
Which brings me to $subject:
Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
doesn't convert extents into unwritten extents, but instead uses
blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
myself, but ...
Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
sense, as that'd work reasonably efficiently to initialize newly
allocated space as well as for zeroing out previously used file space.
As blkdev_issue_zeroout() already has a fallback path it seems this
should be doable without too much concern for which devices have write
zeroes, and which do not?
Greetings,
Andres Freund
Powered by blists - more mailing lists