linux-kernel - mdadm unable to stop RAID device after disk failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20150913114858.42226a80@korath.teln.shikadi.net>
Date:	Sun, 13 Sep 2015 11:48:58 +1000
From:	Adam Nielsen <a.nielsen@...kadi.net>
To:	linux-kernel@...r.kernel.org
Subject: mdadm unable to stop RAID device after disk failure

Hi all,

I'm having some problems trying to work out how to get mdadm to restart
a RAID array after a disk failure.  It is refusing to close the array
saying it's in use, and it's refusing to let me start the array again
saying the devices are already part of another array:

  $ mdadm --manage /dev/md10 --stop
  mdadm: Cannot get exclusive access to /dev/md10:Perhaps a running
    process, mounted filesystem or active volume group?

  $ mdadm --manage /dev/md10 --fail

  $ mdadm --manage /dev/md10 --stop
  mdadm: Cannot get exclusive access to /dev/md10:Perhaps a running
    process, mounted filesystem or active volume group?

  $ cat /proc/mdstat
  Personalities : [raid0] 
  md10 : active raid0 sde1[0] sdd1[1]
        5860268032 blocks super 1.2 512k chunks

Why is it still telling me the array is active after I have tried to
mark it failed?  If I try to specifically list one of the devices that
make up the array, that doesn't work either:

  $ mdadm --manage /dev/md10 --fail /dev/sdd1
  mdadm: Cannot find /dev/sdd1: No such file or directory

This is because /dev/sdd doesn't exist anymore, as it's an external
drive so when I replugged it it became /dev/sdf.  The manpage says you
can use the special word "detached" for this situation, but that doesn't
work either:

  $ mdadm --manage /dev/md10 --fail detached
  mdadm: set device faulty failed for 8:65:  Device or resource busy

8:65 corresponds to /dev/sde1, so it appears to get the right device
but why is it busy?  Isn't the point of --fail to simulate a drive
failure, which could occur at any time, even if a drive is busy?

The two disks (sdd and sde) reappeared as sdf and sdg after replugging,
so I thought I could just create the array and ignore the old failed
one:

  $ mdadm --assemble /dev/md11 /dev/sdf1 /dev/sdg1
  mdadm: Found some drive for an array that is already
    active: /dev/md/10
  mdadm: giving up.

I'm not sure how it considers the drive part of an active array, when
it's a different device.  I guess it's matching serial numbers or
something which is wrong in this case.  Although it wouldn't be a
problem if there was some way to remove the old array that is refusing
to die!

Is there any way to solve this problem, or do you just have to reboot a
machine after a disk failure?

Thanks,
Adam.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/