[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250711154012.GB4040@mit.edu>
Date: Fri, 11 Jul 2025 11:40:12 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Jiany Wu <wujianyue000@...il.com>
Cc: yi.zhang@...wei.com, jack@...e.cz, linux-ext4@...r.kernel.org
Subject: Re: Issue with ext4 filesystem corruption when writing to a file
after disk exhaustion
On Fri, Jul 11, 2025 at 05:56:18PM +0800, Jiany Wu wrote:
> Hello, Ted,
>
> Thanks indeed for the help, really appreciated!
> BTW, is it proper to fallocate whole disk space to exhaust disk?
I'm not sure what do you mean by "proper". It depends on what you are
trying to do, I suppose.
The other thing here is I think you are seriously confusing yourself
(and others) by using a loopback file image which si mounted. That's
because now you need worry about failures at two levels; at the level
of the storage device containing the image (e.g., /tmp for the image
/tmp/mydisk) and the loopback file system (e.g., /mnt/tmp when
/tmp/mydisk is mounted on top of /mnt/tmp). You could potentially
have ENOSPC errors at either level.
> I see even fallocate full disk size, seems file size equal to avail
> size still can be allocated.
> i.e. When /tmp availability space is 26G, but fallocate requests 32G
> (total disk space), we see it finally allocated a 26G file, but exit
> code is 1.
You're being ambiguous here. When you say "full disk sice", which
level are you talking about? /tmp or /mnt/test? And when you
fallocate, which are you fallocating.
What I would recommend is to fallocate *first* at the /mnt/mydisk
level. So do this:
# fallocate -l 32G /tmp/mydisk
# mkfs.ext4 /tmp/mydisk
If /tmp only has 26GB of free space, then the fallocate will fail ---
but that's fine. That tells you that you don't have enough free space
to fully allocate the file system image. So *stop*, and do this
somewhere you have enough free space:
# fallocate -l 32G /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk
# mkfs.ext4 /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk
Now you know that no matter what, when you mount mydisk, you don't
need to worry about I/O errors when writing to mydisk. And you can
proceed with your experimentation.
Now, what if you don't have that huge 10TB disk. Can you use
/tmp/mydisk to create a 32TB file system even though /tmp only has
26GB of free space. You *can*. but you need to be careful, because
eventually when you start writing to the mounted file system, you will
eventually run out of space in /tmp.
For example:
% cp /dev/null /tmp/test.img
% ls -lsh /tmp/test.img
0 -rw-r--r-- 1 tytso tytso 0 Jul 11 11:19 /tmp/test.img
% mkfs.ext4 -q /tmp/test.img 32G
% ls -lsh /tmp/test.img
6.4M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:19 /tmp/test.img
So you can see here that we have created a test file system which is
32 GiB in size, but so far, the actual amount of *space* consumed in
/tmp is 6.4 MiB. The i_size of the file is 32 GiB, but it is a sparse
file, which means not all of the blocks between logical offset 0 and
32 GiB have been allocated.
Now, if we mount the file system, as we start writing into the file,
we will allocate space in /tmp. Now, the way fallocate works in the
mounted file system is that it guarantees space in the file system,
but it won't write the data blocks, so space confusmed in /tmp by
/tmp/mydisk will grow only by the space needed when we updated the
metadata blocks in the file system contained in /tmp/mydisk.
% sudo mount /tmp/test.img /mnt/test
% df -h /mnt/test
Filesystem Size Used Avail Use% Mounted on
/dev/loop1 32G 2.1M 30G 1% /mnt/test
% sudo fallocate -l 16G /mnt/test/testfile
1093% df -h /mnt/test
Filesystem Size Used Avail Use% Mounted on
/dev/loop1 32G 17G 14G 55% /mnt/test
% ls -lsh /mnt/test/testfile
17G -rw-r--r-- 1 root root 16G Jul 11 11:25 /mnt/test/testfile
% ls -lsh /tmp/test.img
7.5M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:25 /tmp/test.img
So here, we fallocated 16GB in the file system in /tmp/test.img. You
can see that it created a file which is 16GB in size, but which is a
bit more than 16GB once you include the metadata blocks for
/tmp/test/testfile. That's why the space used is 17GB (the ls program
rounded up) but the i_size is 16GB.
*But* the space consumed by the file /tmp/test.img only went up from
6.4 MiB to 7.5 MiB. That's because although we reserved space in the
file system /tmp/test.img, we didn't reserve any space in /tmp.
This is working as intended; and if what you are doing is "thin
provisioning", this is a feature, but a bug.
But what it means is that if /tmp only has 26GB of space, eventually
if you keep writing to /tmp/test.img, there will be block level errors
in the loop device when /tmp runs out of space. That was what you saw
in your original example:
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024.
As soon as there are block I/O errors in the underlying file system,
all bets are off. There could be data loss (if you had been writing
to a data block when /tmp ran out of space) or file system corruption
(if the kernel had been trying to write a metadata block when /tmp ran
out of space), or possibly both. So as soon as you see block I/O
errors, don't assume that file system is unscathed, because you
probably *will* have lost data or have a corrupted file sytem.
> Is it legal usage or will it trigger some unknown issue? I'm a newbie
> on fallocate:)
So it's *legal* to do thin provisioning; if you are trying to test a
very large file system, and you don't have enough space, then you
might not have a choice. Or if you are trying to be more efficient,
it mgiht allow you to allow users to *think* they have more space than
you actually have purchased, since very often, users don't artually
use; they just want to feel good that they have the space.
And if you have a large number of users, thin provisioning might make
sense because it saves money. But it's much like a bank which has
lent out money that depositors have on deposit, relying on the fact
that it is very rare that all of the depositors will suddenly show up
and withdraw all of their money all at the same time. If that
happens, then you have a run on the bank, and there could be civil
unrest, and things get ugly. Which is why after bank runs, government
regulartors will demand that banks keep more money on reserve, which
lowers their profits and makes the bank's shareholders sad --- but
better that than angry bank customers. :-)
So if you know what you are doing, it *can* work. But it might
trigger an issue which is unknown/unexpected for you, even though for
soemone who understands how things work it makes perfect sense and is
the system working as designed.
If you have lots of disk space, then just use fallocate to allocate
space for /tmp/mydisk, and then you can use fallocate to allocate
space for the file system contained in /tmp/mydisk. And it will all
work, but it will require more disk space to be available in /tmp.
Cheers,
- Ted
Powered by blists - more mailing lists