[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20220901221845.1753773-1-tytso@mit.edu>
Date: Thu, 1 Sep 2022 18:18:45 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Ext4 Developers List <linux-ext4@...r.kernel.org>
Cc: "Theodore Ts'o" <tytso@....edu>
Subject: [RFC PATCH] ext4: limit the number of retries after discarding preallocations blocks
This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:
A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.
When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.
The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.
The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)
[ DO NOT APPLY: This patch has only been lightly tested, although it
does appear to avoid NFS write threads from getting tied up for
*hours* before the system recovers. Sending to the linux-ext4
mailing list for comments only. -- Ted ]
Signed-off-by: Theodore Ts'o <tytso@....edu>
---
fs/ext4/mballoc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bd8f8b5c3d30..8ab8ab8b0f42 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5565,6 +5565,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ext4_fsblk_t block = 0;
unsigned int inquota = 0;
unsigned int reserv_clstrs = 0;
+ int retries = 0;
u64 seq;
might_sleep();
@@ -5667,7 +5668,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
- if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
+ if (++retries < 3 &&
+ ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
goto repeat;
/*
* If block allocation fails then the pa allocated above
--
2.31.0
Powered by blists - more mailing lists