[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f936f158-a256-252f-02ed-ce23f053715f@gmail.com>
Date: Sat, 13 May 2017 02:06:31 +0200
From: MasterPrenium <masterprenium.lkml@...il.com>
To: Shaohua Li <shli@...nel.org>
Cc: linux-kernel@...r.kernel.org, xen-users@...ts.xen.org,
linux-raid@...r.kernel.org,
"MasterPrenium@...il.com" <MasterPrenium@...il.com>,
xen-devel@...ts.xenproject.org
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
Hi guys,
My issue is still remaining with new kernels, at least last revision of
4.10.x branch.
But I found something that can be interesting for investigations, here I
attached another .config file for kernel building, with this
configuration I'm not able to reproduce the kernel panic, I got no crash
at all with exactly the same procedure.
Tested on 4.9.16 kernel and 4.10.13 :
- config_Crash.txt : result in a crash running fio within less than 2
minutes
- config_NoCrash.txt : even after hours of fio, rebuilding arrays, etc
... no crash at all, neither no warning or anything in dmesg.
Note : config_NoCrash is coming from another server on which I had setup
similar system and which was not crashing. Tested this kernel on my
crashing system, and no crash anymore...
I can't believe how a different config can solve a kernel BUG...
If someone has any idea...
Bests,
Le 09/01/2017 à 23:44, Shaohua Li a écrit :
> On Sun, Jan 08, 2017 at 02:31:15PM +0100, MasterPrenium wrote:
>> Hello,
>>
>> Replies below + :
>> - I don't know if this can help but after the crash, when the system
>> reboots, the Raid 5 stack is re-synchronizing
>> [ 37.028239] md10: Warning: Device sdc1 is misaligned
>> [ 37.028541] created bitmap (15 pages) for device md10
>> [ 37.030433] md10: bitmap initialized from disk: read 1 pages, set 59 of
>> 29807 bits
>>
>> - Sometimes the kernel completely crash (lost serial + network connection),
>> sometimes only got the "BUG" dump, but still have network access (but a
>> reboot is impossible, need to reset the system).
>>
>> - You can find blktrace here (while running fio), I hope it's complete since
>> the end of the file is when the kernel crashed : https://goo.gl/X9jZ50
> Looks most are normal full stripe writes.
>
>>> I'm trying to reproduce, but no success. So
>>> ext4->btrfs->raid5, crash
>>> btrfs->raid5, no crash
>>> right? does subvolume matter? When you create the raid5 array, does adding
>>> '--assume-clean' option change the behavior? I'd like to narrow down the issue.
>>> If you can capture the blktrace to the raid5 array, it would be great to hint
>>> us what kind of IO it is.
>> Yes Correct.
>> The subvolume doesn't matter.
>> -- assume-clean doesn't change the behaviour.
> so it's not a resync issue.
>
>> Don't forget that the system needs to be running on xen to crash, without
>> (on native kernel) it doesn't crash (or at least, I was not able to make it
>> crash).
>>>> Regarding your patch, I can't find it. Is it the one sent by Konstantin
>>>> Khlebnikov ?
>>> Right.
>> It doesn't help :(. Maybe the crash is happening a little bit later.
> ok, the patch is unlikely helpful, since the IO size isn't very big.
>
> Don't have good idea yet. My best guess so far is virtual machine introduces
> extra delay, which might trigger some race conditions which aren't seen in
> native. I'll check if I could find something locally.
>
> Thanks,
> Shaohua
View attachment "Config_Crash.txt" of type "text/plain" (110513 bytes)
View attachment "Config_NoCrash.txt" of type "text/plain" (121929 bytes)
Powered by blists - more mailing lists