[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DFF5C93.4000704@redhat.com>
Date: Mon, 20 Jun 2011 09:43:31 -0500
From: Eric Sandeen <sandeen@...hat.com>
To: stufever@...il.com
CC: linux-ext4@...r.kernel.org,
Wang Shaoyan <wangshaoyan.pt@...bao.com>,
Ted Tso <tytso@....edu>, Jan Kara <jack@...e.cz>,
Lukas Czerner <lczerner@...hat.com>
Subject: Re: [PATCH v2] ext4: Set file system to read-only by I/O error threshold
On 6/20/11 7:26 AM, stufever@...il.com wrote:
> From: Wang Shaoyan <wangshaoyan.pt@...bao.com>
>
> changes from v1 -> v2 :
> When counter is greater than threshold, don't call ext4_abort(), but check the mount errors_* option.
>
> Some version of Hadoop uses access(2) to check whether the data chunk
> harddisk is online, if access(2) returns "Read-only file system"
> error, hadoop marks the disk which it called access(2) to as offline.
Ugh, that already sounds ext3-specific. "Some versions of Hadoop" - which
versions? All? Custom? Old? Upstream?
> This method works for Ext3/4 with journal, because when jbd/jbd2
> encounters I/O error, the file system will be set as read-only. For
> Ext4 no-journal mode, there is no jdb2 to set the file system as
> read-only when I/O error happens, the access(2) from Hadoop is not
But there are other paths to ext4_handle_error ... your changes don't
seem specific to whether or not journaling is active?
> able to reliably detect hard disk offline condition.
> This patch tries to fix the above problem from kernel side.
I don't really don't like this very much.
We already have:
errors=remount-ro
errors=continue
errors=panic
data_err=ignore
data_err=abort
Now you propose yet another error handling behavior for the nonstandard
no-journal operation mode, with still more tunables and complexity...
Well, or maybe it's only a change to existing behavior...
>From a quick read, I think that all your patch really does is change existing
ext4_handle_error() behavior so that it only triggers after a certain error rate
threshold, right? And we don't necessarily get there only for I/O errors, in
fact I think many (most?) callchains which end here come from metadata corruption
detection.
So for that case (metadata corruption) a threshold makes no sense to me;
either the fs is corrupt, or it's not. In the case of a corrupted directory,
whether your code fires basically depends on how often (or how rapidly) an
application tries to look up a file in a corrupted dir, for example.
That doesn't make sense to me. Your patch simply makes the filesystem
more tolerant of corruption, as far as I can tell, and I don't see the point
in that. Am I missing something?
Thanks,
-Eric
> 1.Mount file system with errors=remount-ro
> mount -t -o errors=remount-ro /dev/sd[?] some_dir
> 2.The counter reach the threshold:
> 1) inside the sampling interval, I/O errors come more than pre-set threshold happens
> 2) I/O errors always happen in continous sampling intervals, the sum of errors exceeds pre-set threshold
>
> Then the application can find the file system is set as read-only, and call its own failure tolerance procedures.
>
> There are 2 interface exported to user space via sysfs:
> /sys/fs/ext4/sd[?]/eio_threshold --- I/O error threshold to check mount errors options
> /sys/fs/ext4/sd[?]/eio_interval --- sampling interval in second
>
> If default value of eio_threshold is 0, everything happens as before.
>
> Cc: Ted Tso <tytso@....edu>
> Cc: Jan Kara <jack@...e.cz>
> Cc: Lukas Czerner <lczerner@...hat.com>
> Reviewed-by: Coly Li <bosong.ly@...bao.com>
> Reviewed-by: Liu Yuan <tailai.ly@...bao.com>
> Signed-off-by: Wang Shaoyan <wangshaoyan.pt@...bao.com>
> ---
> fs/ext4/ext4.h | 7 +++++++
> fs/ext4/super.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 58 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1921392..b08348e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1214,6 +1214,13 @@ struct ext4_sb_info {
>
> /* Kernel thread for multiple mount protection */
> struct task_struct *s_mmp_tsk;
> +
> + /* IO error count */
> + spinlock_t s_eio_lock;
> + unsigned int s_eio_threshold;
> + unsigned int s_eio_interval;
> + unsigned int s_eio_counter;
> + unsigned long s_eio_last_jiffies;
> };
>
> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index cc5c157..1b3fc81 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -384,6 +384,19 @@ static void save_error_info(struct super_block *sb, const char *func,
> ext4_commit_super(sb, 1);
> }
>
> +static inline void inc_sb_error_count(struct super_block *sb)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> + if (time_after(sbi->s_eio_last_jiffies + sbi->s_eio_interval * HZ, jiffies)) {
> + sbi->s_eio_counter++;
> + } else {
> + sbi->s_eio_counter = 1;
> + }
> +
> + sbi->s_eio_last_jiffies = jiffies;
> + ext4_msg(sb, KERN_CRIT, "IO error count total: %d", sbi->s_eio_counter);
> +}
>
> /* Deal with the reporting of failure conditions on a filesystem such as
> * inconsistencies detected or read IO failures.
> @@ -402,9 +415,21 @@ static void save_error_info(struct super_block *sb, const char *func,
>
> static void ext4_handle_error(struct super_block *sb)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> if (sb->s_flags & MS_RDONLY)
> return;
>
> + spin_lock(&sbi->s_eio_lock);
> + inc_sb_error_count(sb);
> + spin_unlock(&sbi->s_eio_lock);
> +
> + /* When the counter is greater than threshold(default 0),
> + * like usual, we check the ERRORS_*. Otherwise, we just return.
> + */
> + if (sbi->s_eio_counter <= sbi->s_eio_threshold)
> + return;
> +
> if (!test_opt(sb, ERRORS_CONT)) {
> journal_t *journal = EXT4_SB(sb)->s_journal;
>
> @@ -2471,6 +2496,22 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
> return count;
> }
>
> +static ssize_t eio_interval_store(struct ext4_attr *a,
> + struct ext4_sb_info *sbi,
> + const char *buf, size_t count)
> +{
> + unsigned long t;
> +
> + if (parse_strtoul(buf, 0xffffffff, &t))
> + return -EINVAL;
> +
> + if (t <= 0)
> + return -EINVAL;
> +
> + sbi->s_eio_interval = t;
> + return count;
> +}
> +
> static ssize_t sbi_ui_show(struct ext4_attr *a,
> struct ext4_sb_info *sbi, char *buf)
> {
> @@ -2524,6 +2565,9 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
> +EXT4_RW_ATTR_SBI_UI(eio_threshold, s_eio_threshold);
> +EXT4_ATTR_OFFSET(eio_interval, 0644, sbi_ui_show,
> + eio_interval_store, s_eio_interval);
>
> static struct attribute *ext4_attrs[] = {
> ATTR_LIST(delayed_allocation_blocks),
> @@ -2540,6 +2584,8 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(mb_stream_req),
> ATTR_LIST(mb_group_prealloc),
> ATTR_LIST(max_writeback_mb_bump),
> + ATTR_LIST(eio_threshold),
> + ATTR_LIST(eio_interval),
> NULL,
> };
>
> @@ -3464,6 +3510,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sbi->s_stripe = ext4_get_stripe_size(sbi);
> sbi->s_max_writeback_mb_bump = 128;
>
> + spin_lock_init(&sbi->s_eio_lock);
> + sbi->s_eio_threshold = 0;
> + sbi->s_eio_interval = 5;
> + sbi->s_eio_counter = 0;
> +
> /*
> * set up enough so that it can read an inode
> */
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists