lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1377286615.3872.25.camel@localhost.localdomain>
Date:	Fri, 23 Aug 2013 15:36:55 -0400
From:	Ewan Milne <emilne@...hat.com>
To:	James Bottomley <James.Bottomley@...senPartnership.com>
Cc:	Eiichi Tsukata <eiichi.tsukata.xh@...achi.com>,
	linux-kernel@...r.kernel.org, linux-scsi@...r.kernel.org
Subject: Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry
 loop

On Fri, 2013-08-23 at 06:19 -0700, James Bottomley wrote:
> On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:
> > Yes, basically the device should be offlined on error detection.
> > Just offlining the disk is enough when an error occurs on "not" os-installed
> > system disk. Panic is going too far on such case.
> > 
> > However, in a clustered environment where computers use each its own
> > disk and
> > do not share the same disk, calling panic() will be suitable when an
> > error
> > occurs in system disk.
> 
> However, when not in a clustered environment, it won't be.  Decisions
> about whether to panic the system or not are user space policy, and
> should not be embedded into subsystems.  What we need to do is to come
> up with a way of detecting the condition, reporting it and possibly
> taking some action.
> 
> >  Because even on such disk error, cluster monitoring
> > tool may not be able to detect the system failure while heartbeat can
> > continue
> > working.
> > So, I think basically offlining is enough and also, panic is necessary
> > on some cases.

The way I have seen this done in such a clustered environment is to have
the heartbeat agent on each system periodically attempt to access the
disk.  If that I/O hangs, other systems will see loss of heartbeat.
You really don't want to panic the kernel.  Among other things, it may
make it difficult to get the system up again later for long enough to
figure out what is wrong.

> 
> Offline seems a bit drastic ... what happens if you send it a target
> reset?
> 
> James
> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ