[<prev] [next>] [day] [month] [year] [list]
Message-ID: <5627AA7E.5010003@gmail.com>
Date: Wed, 21 Oct 2015 11:08:46 -0400
From: Austin S Hemmelgarn <ahferroin7@...il.com>
To: Heinz Mauelshagen <heinzm@...hat.com>,
device-mapper development <dm-devel@...hat.com>,
Linux-Kernel mailing list <linux-kernel@...r.kernel.org>,
linux-raid@...r.kernel.org
Subject: Re: [dm-devel] Possible bug in DM-RAID.
Thanks for the quick response. I've cloned Linux's master branch (which
has the commit), built it, tested it, and everything works, so it looks
like this was indeed the bug I was seeing (that, or something else
between 4.2.3 and what I tested fixed things).
On 2015-10-21 10:11, Heinz Mauelshagen wrote:
>
> Neil,
>
> this looks like an incarnation of the md bitmap flaw (the one with the bogus
> slot number) leading to the false bitmap header page index.
>
>
> Austin,
> this is the respective upstream commit you need to fix your problem:
>
> commit da6fb7a9e5bd6f04f7e15070f630bdf1ea502841
> Author: NeilBrown <neilb@...e.com>
> Date: Thu Oct 1 16:03:38 2015 +1000
>
> md/bitmap: don't pass -1 to bitmap_storage_alloc.
>
> Passing -1 to bitmap_storage_alloc() causes page->index to be set to
> -1, which is quite problematic.
>
> So only pass ->cluster_slot if mddev_is_clustered().
>
> Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the
> cluster")
> Cc: stable@...r.kernel.org (v4.1+)
> Signed-off-by: NeilBrown <neilb@...e.com>
>
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index e51de52..48b5890 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1997,7 +1997,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t
> blocks,
> if (bitmap->mddev->bitmap_info.offset ||
> bitmap->mddev->bitmap_info.file)
> ret = bitmap_storage_alloc(&store, chunks,
> !bitmap->mddev->bitmap_info.external,
> - bitmap->cluster_slot);
> + mddev_is_clustered(bitmap->mddev)
> + ? bitmap->cluster_slot : 0);
> if (ret)
> goto err;
>
>
> On 10/21/2015 03:39 AM, Neil Brown wrote:
>> Added dm-devel, which is probably the more appropriate list for dm
>> things.
>>
>> NeilBrown
>>
>> Austin S Hemmelgarn<ahferroin7@...il.com> writes:
>>
>>> I think I've stumbled upon a bug in DM-RAID. The primary symptom is that when
>>> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
>>> configuration, it very quickly claims one by one that all of the disks failed
>>> except the first, and goes degraded. When this happens on a given system, the
>>> disks always 'fail' in the reverse of the order of the mirror numbers. All of
>>> the other RAID profiles work just fine. Curiously, it also only seems to
>>> happen for 'big' devices (I haven't been able to determine exactly what the
>>> minimum size is, but I see it 100% of the time with 32G devices, never with 16G
>>> ones, and only intermittently with 24G).
>>>
>>> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
>>> this issue:
>>> [66318.401295] device-mapper: raid: Superblocks created for new array
>>> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
>>> [66318.450467] Choosing daemon_sleep default (5 sec)
>>> [66318.450482] created bitmap (32 pages) for device mdX
>>> [66318.450495] attempt to access beyond end of device
>>> [66318.450501] dm-91: rw=13329, want=0, limit=8192
>>> [66318.450506] md: super_written gets error=-5, uptodate=0
>>> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>>> md/raid1:mdX: Operation continuing on 1 devices.
>>> [66318.459815] attempt to access beyond end of device
>>> [66318.459819] dm-89: rw=13329, want=0, limit=8192
>>> [66318.459822] md: super_written gets error=-5, uptodate=0
>>> [66318.492852] attempt to access beyond end of device
>>> [66318.492862] dm-89: rw=13329, want=0, limit=8192
>>> [66318.492868] md: super_written gets error=-5, uptodate=0
>>> [66318.627183] mdX: bitmap file is out of date, doing full recovery
>>> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
>>> [66318.782045] RAID1 conf printout:
>>> [66318.782054] --- wd:1 rd:2
>>> [66318.782061] disk 0, wo:0, o:1, dev:dm-90
>>> [66318.782068] disk 1, wo:1, o:0, dev:dm-92
>>> [66318.836598] RAID1 conf printout:
>>> [66318.836607] --- wd:1 rd:2
>>> [66318.836614] disk 0, wo:0, o:1, dev:dm-90
>>>
>>> And here's output for a 24G LVM volume that didn't display the issue.
>>> [66343.407954] device-mapper: raid: Superblocks created for new array
>>> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
>>> [66343.479078] Choosing daemon_sleep default (5 sec)
>>> [66343.479101] created bitmap (24 pages) for device mdX
>>> [66343.629329] mdX: bitmap file is out of date, doing full recovery
>>> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits
>>>
>>> I'm using a lightly patched version of 4.2.3
>>> (the source can be found athttps://github.com/ferroin/linux)
>>> but none of the patches I'm using come anywhere near anything in the block layer,
>>> let alone the DM/MD code.
>>>
>>> I've attempted to bisect this, although it got kind of complicated. So far I've
>>> determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
>>> Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
>>> I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
>>> but I do distinctly remember that I did not see this issue in 3.19.
>>>
>>> I'll be happy to provide more info if needed.
>>>
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@...hat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel
Download attachment "smime.p7s" of type "application/pkcs7-signature" (3019 bytes)
Powered by blists - more mailing lists