linux-ext4 - Re: [PATCH v2] ext4: Set file system to read-only by I/O error threshold

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1106211111170.3618@dhcp-27-109.brq.redhat.com>
Date:	Tue, 21 Jun 2011 11:29:17 +0200 (CEST)
From:	Lukas Czerner <lczerner@...hat.com>
To:	Eric Sandeen <sandeen@...hat.com>
cc:	stufever@...il.com, linux-ext4@...r.kernel.org,
	Wang Shaoyan <wangshaoyan.pt@...bao.com>,
	Ted Tso <tytso@....edu>, Jan Kara <jack@...e.cz>,
	Lukas Czerner <lczerner@...hat.com>
Subject: Re: [PATCH v2] ext4: Set file system to read-only by I/O error
 threshold

On Mon, 20 Jun 2011, Eric Sandeen wrote:

> On 6/20/11 7:26 AM, stufever@...il.com wrote:
> > From: Wang Shaoyan <wangshaoyan.pt@...bao.com>
> > 
> > changes from v1 -> v2 :
> > When counter is greater than threshold, don't call ext4_abort(), but check the mount errors_* option.
> > 
> > Some version of Hadoop uses access(2) to check whether the data chunk
> > harddisk is online, if access(2) returns "Read-only file system"
> > error, hadoop marks the disk which it called access(2) to as offline.
> 
> Ugh, that already sounds ext3-specific.  "Some versions of Hadoop" - which
> versions?  All?  Custom?  Old?  Upstream?
> 
> > This method works for Ext3/4 with journal, because when jbd/jbd2
> > encounters I/O error, the file system will be set as read-only. For
> > Ext4 no-journal mode, there is no jdb2 to set the file system as
> > read-only when I/O error happens, the access(2) from Hadoop is not
> 
> But there are other paths to ext4_handle_error ... your changes don't
> seem specific to whether or not journaling is active?
> 
> > able to reliably detect hard disk offline condition.
> 
> > This patch tries to fix the above problem from kernel side. 
> 
> I don't really don't like this very much.

Hi Wang,

as I already said to you, I do not like this either and I agree with
Eric. You still did not explained how this change helps your case
without journal, since ext4_handle_error() is callen when IO error is
encountered and it is not journal specific at all.

But, I am trying to understand the case you are trying to solve. Is it
that you have bunch of data sources and access(2) helps you to determine
whether specific data source is accessible, e.g. writable ? Well if this
is the case then maybe you would have to change the method to not only
check if the device is writable, but also if the disk is still alive.
But generally you'll hit the IO error once you try to sync write to the
disk, is then already too late ?

Why not just detect whether the filesystem is mounted as read write with
access(2) and ignore the case when disk is not alive and then detect this
and mark the data source as offline once you really hit the IO error ? Or
it is not sufficient ?

Also I do not think that this kind of heuristic you are adding into ext4
belongs to kernel, you should rather solve this in user-space.

Thanks!
-Lukas

> 
> We already have:
> 
> errors=remount-ro
> errors=continue
> errors=panic
> data_err=ignore
> data_err=abort
> 
> Now you propose yet another error handling behavior for the nonstandard
> no-journal operation mode, with still more tunables and complexity...
> 
> Well, or maybe it's only a change to existing behavior...
> 
> From a quick read, I think that all your patch really does is change existing
> ext4_handle_error() behavior so that it only triggers after a certain error rate
> threshold, right?  And we don't necessarily get there only for I/O errors, in
> fact I think many (most?) callchains which end here come from metadata corruption
> detection.
> 
> So for that case (metadata corruption) a threshold makes no sense to me;
> either the fs is corrupt, or it's not.  In the case of a corrupted directory,
> whether your code fires basically depends on how often (or how rapidly) an
> application tries to look up a file in a corrupted dir, for example.
> That doesn't make sense to me.  Your patch simply makes the filesystem
> more tolerant of corruption, as far as I can tell, and I don't see the point
> in that.  Am I missing something?
> 
> Thanks,
> -Eric
> 
> 
> > 1.Mount file system with errors=remount-ro
> > 	mount -t -o errors=remount-ro /dev/sd[?] some_dir 
> > 2.The counter reach the threshold:
> > 	1) inside the sampling interval, I/O errors come more than pre-set threshold happens
> > 	2) I/O errors always happen in continous sampling intervals, the sum of errors exceeds pre-set threshold
> > 
> > Then the application can find the file system is set as read-only, and call its own failure tolerance procedures.
> > 
> > There are 2 interface exported to user space via sysfs:
> > /sys/fs/ext4/sd[?]/eio_threshold --- I/O error threshold to check mount errors options
> > /sys/fs/ext4/sd[?]/eio_interval  --- sampling interval in second
> > 
> > If default value of eio_threshold is 0, everything happens as before.
> > 
> > Cc: Ted Tso <tytso@....edu>
> > Cc: Jan Kara <jack@...e.cz>
> > Cc: Lukas Czerner <lczerner@...hat.com>
> > Reviewed-by: Coly Li <bosong.ly@...bao.com>
> > Reviewed-by: Liu Yuan <tailai.ly@...bao.com>
> > Signed-off-by: Wang Shaoyan <wangshaoyan.pt@...bao.com>
> > ---
> >  fs/ext4/ext4.h  |    7 +++++++
> >  fs/ext4/super.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 58 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 1921392..b08348e 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1214,6 +1214,13 @@ struct ext4_sb_info {
> >  
> >  	/* Kernel thread for multiple mount protection */
> >  	struct task_struct *s_mmp_tsk;
> > +
> > +	/* IO error count */
> > +	spinlock_t s_eio_lock;
> > +	unsigned int s_eio_threshold;
> > +	unsigned int s_eio_interval;
> > +	unsigned int s_eio_counter;
> > +	unsigned long s_eio_last_jiffies;
> >  };
> >  
> >  static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index cc5c157..1b3fc81 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -384,6 +384,19 @@ static void save_error_info(struct super_block *sb, const char *func,
> >  	ext4_commit_super(sb, 1);
> >  }
> >  
> > +static inline void inc_sb_error_count(struct super_block *sb)
> > +{
> > +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > +	if (time_after(sbi->s_eio_last_jiffies + sbi->s_eio_interval * HZ, jiffies)) {
> > +		sbi->s_eio_counter++;
> > +	} else {
> > +		sbi->s_eio_counter = 1;
> > +	}
> > +
> > +	sbi->s_eio_last_jiffies = jiffies;
> > +	ext4_msg(sb, KERN_CRIT, "IO error count total: %d", sbi->s_eio_counter);
> > +}
> >  
> >  /* Deal with the reporting of failure conditions on a filesystem such as
> >   * inconsistencies detected or read IO failures.
> > @@ -402,9 +415,21 @@ static void save_error_info(struct super_block *sb, const char *func,
> >  
> >  static void ext4_handle_error(struct super_block *sb)
> >  {
> > +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> >  	if (sb->s_flags & MS_RDONLY)
> >  		return;
> >  
> > +	spin_lock(&sbi->s_eio_lock);
> > +	inc_sb_error_count(sb);
> > +	spin_unlock(&sbi->s_eio_lock);
> > +
> > +	/* When the counter is greater than threshold(default 0), 
> > +	 * like usual, we check the ERRORS_*. Otherwise, we just return.
> > +	 */
> > +	if (sbi->s_eio_counter <= sbi->s_eio_threshold)
> > +		return;
> > +
> >  	if (!test_opt(sb, ERRORS_CONT)) {
> >  		journal_t *journal = EXT4_SB(sb)->s_journal;
> >  
> > @@ -2471,6 +2496,22 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
> >  	return count;
> >  }
> >  
> > +static ssize_t eio_interval_store(struct ext4_attr *a,
> > +					  struct ext4_sb_info *sbi,
> > +					  const char *buf, size_t count)
> > +{
> > +	unsigned long t;
> > +
> > +	if (parse_strtoul(buf, 0xffffffff, &t))
> > +		return -EINVAL;
> > +
> > +	if (t <= 0)
> > +		return -EINVAL;
> > +
> > +	sbi->s_eio_interval = t;
> > +	return count;
> > +}
> > +
> >  static ssize_t sbi_ui_show(struct ext4_attr *a,
> >  			   struct ext4_sb_info *sbi, char *buf)
> >  {
> > @@ -2524,6 +2565,9 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> >  EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> >  EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> >  EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
> > +EXT4_RW_ATTR_SBI_UI(eio_threshold, s_eio_threshold);
> > +EXT4_ATTR_OFFSET(eio_interval, 0644, sbi_ui_show,
> > +		 eio_interval_store, s_eio_interval);
> >  
> >  static struct attribute *ext4_attrs[] = {
> >  	ATTR_LIST(delayed_allocation_blocks),
> > @@ -2540,6 +2584,8 @@ static struct attribute *ext4_attrs[] = {
> >  	ATTR_LIST(mb_stream_req),
> >  	ATTR_LIST(mb_group_prealloc),
> >  	ATTR_LIST(max_writeback_mb_bump),
> > +	ATTR_LIST(eio_threshold),
> > +	ATTR_LIST(eio_interval),
> >  	NULL,
> >  };
> >  
> > @@ -3464,6 +3510,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> >  	sbi->s_stripe = ext4_get_stripe_size(sbi);
> >  	sbi->s_max_writeback_mb_bump = 128;
> >  
> > +	spin_lock_init(&sbi->s_eio_lock);
> > +	sbi->s_eio_threshold = 0;
> > +	sbi->s_eio_interval = 5;
> > +	sbi->s_eio_counter = 0;
> > +
> >  	/*
> >  	 * set up enough so that it can read an inode
> >  	 */
> 
> 

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html