[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250508202424.GA30222@mit.edu>
Date: Thu, 8 May 2025 16:24:24 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Christoph Hellwig <hch@....de>, "Darrick J. Wong" <djwong@...nel.org>,
linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-block@...r.kernel.org, dm-devel@...ts.linux.dev,
linux-nvme@...ts.infradead.org, linux-scsi@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
john.g.garry@...cle.com, bmarzins@...hat.com, chaitanyak@...dia.com,
shinichiro.kawasaki@....com, brauner@...nel.org, yi.zhang@...wei.com,
chengzhihao1@...wei.com, yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote:
> On 2025/5/8 13:01, Christoph Hellwig wrote:
> >>
> >> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
> >> only bdev or files where bdev_unmap_write_zeroes() returns true. In
> >> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
> >> are not consistent, they are two independent features. Even if some
> >> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
> >> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
> >> devices and drivers currently cannot reliably ascertain whether they
> >> support the unmap write zero command; however, certain devices, such as
> >> specific cloud storage devices, do support it. Users of these devices
> >> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
> >> process.
> >
> > What are those "cloud storage devices" where you set it reliably,
> > i.e.g what drivers?
>
> I don't have these 'cloud storage devices' now, but Ted had mentioned
> those cloud-emulated block devices such as Google's Persistent Desk or
> Amazon's Elastic Block Device in. I'm not sure if they can accurately
> report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more
> details.
>
> https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/
There's nothing really exotic about what I was referring to in terms
of "cloud storage devices". Perhaps a better way of describing them
is to consider devices such as dm-thin, or a Ceph Block Device, which
is being exposed as a SCSI or NVME device.
The distinction I was trying to make is performance-related. Suppose
you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS
complete, a read anywhere on that 14TB region will return zeros.
That's easy. But the question is when you call WRITE_ZEROS, will the
storage device (a) go away for a day or more before it completes (which
would be the case if it is a traditional spinning rust platter), or
(b) will it be basically instaneous, because all dm-thin or a Ceph Block
Device needs to do is to delete one or more entries in its mapping
table.
The problem is two-fold. First, there's no way for the kernel to know
whether a storage device will behave as (a) or (b), because SCSI and
other storage specifications say that performance is out of scope.
They only talk about the functional results (afterwards, if yout try
to read from the region, you will get zeros), and are utterly silent
about how long it migt take. The second problem is that if you are an
application program, there is no way you will be willing to call
fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will
go away for a day or whether it will be instaneous.
But because there is no way for the kernel to know whether WRITE_ZEROS
will be fast or not, how would you expect the kernel to expose
STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the
abstraction" perfectly encapsulate the SCSI specification's position
on the matter, and I agree it's a valid position. It's just not
terribly useful for the application programmer.
Things which some programs/users might want to know or rely upon, but which is normally quite impossible are:
* Will the write zero / discard operation take a "reasonable" amount
of time? (Yes, not necessarilly well defined, but we know it when
we see it, and hours or days is generally not reasonable.)
* Is the operation reliable --- i.e., is the device allowed to
randomly decide that it won't actually zero the requested blocks (as
is the case of discard) whenever it feels like it.
* Is the operation guaranteed to make the data irretreviable even in
face of an attacker with low-level access to the device. (And this
is also not necessarily well defined; does the attacker have access
to a scanning electronic microscope, or can do a liquid nitrogen
destructive access of the flash device?)
The UFS (Universal Flash Storage) spec comes the closest to providing
commands that distinguish between these various cases, but for most
storage specifications, like SCSI, it is absolutely requires peaking
behind the abstraction barrier defined by the specification, and so
ultimately, the kernel can't know.
About the best you can do is to require manual configuration; perhaps a
config file at the database or userspace cluster file system level
because the system adminsitrator knows --- maybe because the hyperscale
cloud provider has leaned on the storage vendor to tell them under
NDA, storage specs be damned or they won't spend $$$ millions with
that storage vendor --- or because the database administrator discovers
that using fallocate(WRITE_ZEROS) causes performance to tank, so they
manually disable the use of WRITE_ZEROS.
Could this be done in the kernel? Sure. We could have a file, say,
/sys/block/sdXX/queue/write_zeros where the write_zeros file is
writeable, and so the administrator can force-disable WRITES_ZERO by
writing 0 into the file. And could this be queried via a STATX
attribute? I suppose, although to be honest, I'm used to doing this
by looking at the sysfs files. For example, just recently I coded up
the following:
static int is_rotational (const char *device_name EXT2FS_ATTR((unused)))
{
int rotational = -1;
#ifdef __linux__
char path[1024];
struct stat st;
FILE *f;
if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode))
return -1;
snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational",
major(st.st_rdev), minor(st.st_rdev));
f = fopen(path, "r");
if (!f) {
snprintf(path, sizeof(path),
"/sys/dev/block/%d:%d/../queue/rotational",
major(st.st_rdev), minor(st.st_rdev));
f = fopen(path, "r");
}
if (f) {
if (fscanf(f, "%d", &rotational) != 1)
rotational = -1;
fclose(f);
}
#endif
return rotational;
}
Easy-peasy! Who needs statx? :-)
- Ted
Powered by blists - more mailing lists