lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091121203645.60d68d11@mjolnir.ossman.eu>
Date:	Sat, 21 Nov 2009 20:36:45 +0100
From:	Pierre Ossman <pierre-list@...man.eu>
To:	Dan Williams <dan.j.williams@...el.com>
Cc:	neilb@...e.de, LKML <linux-kernel@...r.kernel.org>
Subject: Re: Raid not shutting down when disks are lost?

On Sat, 21 Nov 2009 12:21:58 -0700
Dan Williams <dan.j.williams@...el.com> wrote:

> On Sat, Nov 21, 2009 at 9:03 AM, Pierre Ossman <pierre-list@...man.eu> wrote:
> > Neil?
> >
> > On Thu, 8 Oct 2009 16:39:52 +0200
> > Pierre Ossman <pierre-list@...man.eu> wrote:
> >
> >> Today one RAID6 array I manage decided to lose four out of eight disks.
> >> Oddly enough, the array did not shut down but instead I got
> >> intermittent read and writer errors from the filesystem.
> 
> This is expected.
> 
> The array can't shutdown when there is a mounted filesystem.  Reads
> may still be serviced from the survivors, all writes should be aborted
> with an error.
> 

It could "shut down" in the sense that it refuses to touch the
underlying hardware and just report errors to upper layers. I.e. don't
update the md superblock marking more disks as failed.

> >>
> >> It's been some time since I had a failure of this magnitude, but I seem
> >> to recall that once the array lost too many disks, it would shut down
> >> and refuse to write a single byte. The nice effect of this was that if
> >> it was a temporary error, you could just reboot and the array would
> >> start nicely (albeit in degraded mode).
> >>
> >> Has something changed? Is this perhaps an effect of using RAID6 (I used
> >> to run RAID5 arrays)? Or was I simply lucky the previous instances I've
> >> had?
> 
> It should not come back up nicely in this scenario.  You need
> "--force" to attempt to reassemble a failed array.
> 

If the last disk is thrown out either because of a read error, or
because of the first write of a stripe (i.e. what's on the platters is
still in sync) then a force would not be needed. This requires the md
code to not mark that last disk as failed in the superblocks of the
remaining disks though.

> >>
> >> Related, it would be nice if you could control how it handles lost
> >> disks. E.g. I'd like it to go read-only when it goes in to fully
> >> degraded mode. In case the last disk lost was only a temporary glitch,
> >> the array could be made to recover without a lengthy resync.
> >>
> 
> When you say "fully-degraded" do you mean "failed"?  In general the
> bitmap mechanism provides fast resync after temporary disk outages.
> 

Fully degraded means still working but without any redundancy. I.e. one
lost disk with RAID 5 or two with RAID 6. And the bitmap mechanism
seems to be broken in that case since I always experience full,
multi-hour resyncs whenever a disk is lost.

Or is there perhaps some magic mdadm command to add a lost disk and get
it up to speed without a complete sync?

Rgds
-- 
     -- Pierre Ossman

  WARNING: This correspondence is being monitored by FRA, a
  Swedish intelligence agency. Make sure your server uses
  encryption for SMTP traffic and consider using PGP for
  end-to-end encryption.

Download attachment "signature.asc" of type "application/pgp-signature" (199 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ