lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20250508202424.GA30222@mit.edu> Date: Thu, 8 May 2025 16:24:24 -0400 From: "Theodore Ts'o" <tytso@....edu> To: Zhang Yi <yi.zhang@...weicloud.com> Cc: Christoph Hellwig <hch@....de>, "Darrick J. Wong" <djwong@...nel.org>, linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org, linux-block@...r.kernel.org, dm-devel@...ts.linux.dev, linux-nvme@...ts.infradead.org, linux-scsi@...r.kernel.org, linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org, john.g.garry@...cle.com, bmarzins@...hat.com, chaitanyak@...dia.com, shinichiro.kawasaki@....com, brauner@...nel.org, yi.zhang@...wei.com, chengzhihao1@...wei.com, yukuai3@...wei.com, yangerkun@...wei.com Subject: Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote: > On 2025/5/8 13:01, Christoph Hellwig wrote: > >> > >> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to > >> only bdev or files where bdev_unmap_write_zeroes() returns true. In > >> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES > >> are not consistent, they are two independent features. Even if some > >> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be > >> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some > >> devices and drivers currently cannot reliably ascertain whether they > >> support the unmap write zero command; however, certain devices, such as > >> specific cloud storage devices, do support it. Users of these devices > >> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing > >> process. > > > > What are those "cloud storage devices" where you set it reliably, > > i.e.g what drivers? > > I don't have these 'cloud storage devices' now, but Ted had mentioned > those cloud-emulated block devices such as Google's Persistent Desk or > Amazon's Elastic Block Device in. I'm not sure if they can accurately > report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more > details. > > https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/ There's nothing really exotic about what I was referring to in terms of "cloud storage devices". Perhaps a better way of describing them is to consider devices such as dm-thin, or a Ceph Block Device, which is being exposed as a SCSI or NVME device. The distinction I was trying to make is performance-related. Suppose you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS complete, a read anywhere on that 14TB region will return zeros. That's easy. But the question is when you call WRITE_ZEROS, will the storage device (a) go away for a day or more before it completes (which would be the case if it is a traditional spinning rust platter), or (b) will it be basically instaneous, because all dm-thin or a Ceph Block Device needs to do is to delete one or more entries in its mapping table. The problem is two-fold. First, there's no way for the kernel to know whether a storage device will behave as (a) or (b), because SCSI and other storage specifications say that performance is out of scope. They only talk about the functional results (afterwards, if yout try to read from the region, you will get zeros), and are utterly silent about how long it migt take. The second problem is that if you are an application program, there is no way you will be willing to call fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will go away for a day or whether it will be instaneous. But because there is no way for the kernel to know whether WRITE_ZEROS will be fast or not, how would you expect the kernel to expose STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the abstraction" perfectly encapsulate the SCSI specification's position on the matter, and I agree it's a valid position. It's just not terribly useful for the application programmer. Things which some programs/users might want to know or rely upon, but which is normally quite impossible are: * Will the write zero / discard operation take a "reasonable" amount of time? (Yes, not necessarilly well defined, but we know it when we see it, and hours or days is generally not reasonable.) * Is the operation reliable --- i.e., is the device allowed to randomly decide that it won't actually zero the requested blocks (as is the case of discard) whenever it feels like it. * Is the operation guaranteed to make the data irretreviable even in face of an attacker with low-level access to the device. (And this is also not necessarily well defined; does the attacker have access to a scanning electronic microscope, or can do a liquid nitrogen destructive access of the flash device?) The UFS (Universal Flash Storage) spec comes the closest to providing commands that distinguish between these various cases, but for most storage specifications, like SCSI, it is absolutely requires peaking behind the abstraction barrier defined by the specification, and so ultimately, the kernel can't know. About the best you can do is to require manual configuration; perhaps a config file at the database or userspace cluster file system level because the system adminsitrator knows --- maybe because the hyperscale cloud provider has leaned on the storage vendor to tell them under NDA, storage specs be damned or they won't spend $$$ millions with that storage vendor --- or because the database administrator discovers that using fallocate(WRITE_ZEROS) causes performance to tank, so they manually disable the use of WRITE_ZEROS. Could this be done in the kernel? Sure. We could have a file, say, /sys/block/sdXX/queue/write_zeros where the write_zeros file is writeable, and so the administrator can force-disable WRITES_ZERO by writing 0 into the file. And could this be queried via a STATX attribute? I suppose, although to be honest, I'm used to doing this by looking at the sysfs files. For example, just recently I coded up the following: static int is_rotational (const char *device_name EXT2FS_ATTR((unused))) { int rotational = -1; #ifdef __linux__ char path[1024]; struct stat st; FILE *f; if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode)) return -1; snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational", major(st.st_rdev), minor(st.st_rdev)); f = fopen(path, "r"); if (!f) { snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/../queue/rotational", major(st.st_rdev), minor(st.st_rdev)); f = fopen(path, "r"); } if (f) { if (fscanf(f, "%d", &rotational) != 1) rotational = -1; fclose(f); } #endif return rotational; } Easy-peasy! Who needs statx? :-) - Ted
Powered by blists - more mailing lists