linux-kernel - Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f936f158-a256-252f-02ed-ce23f053715f@gmail.com>
Date:   Sat, 13 May 2017 02:06:31 +0200
From:   MasterPrenium <masterprenium.lkml@...il.com>
To:     Shaohua Li <shli@...nel.org>
Cc:     linux-kernel@...r.kernel.org, xen-users@...ts.xen.org,
        linux-raid@...r.kernel.org,
        "MasterPrenium@...il.com" <MasterPrenium@...il.com>,
        xen-devel@...ts.xenproject.org
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

Hi guys,

My issue is still remaining with new kernels, at least last revision of 
4.10.x branch.

But I found something that can be interesting for investigations, here I 
attached another .config file for kernel building, with this 
configuration I'm not able to reproduce the kernel panic, I got no crash 
at all with exactly the same procedure.

Tested on 4.9.16 kernel and 4.10.13 :
- config_Crash.txt : result in a crash running fio within less than 2 
minutes
- config_NoCrash.txt : even after hours of fio, rebuilding arrays, etc 
... no crash at all, neither no warning or anything in dmesg.

Note : config_NoCrash is coming from another server on which I had setup 
similar system and which was not crashing. Tested this kernel on my 
crashing system, and no crash anymore...

I can't believe how a different config can solve a kernel BUG...

If someone has any idea...

Bests,


Le 09/01/2017 à 23:44, Shaohua Li a écrit :
> On Sun, Jan 08, 2017 at 02:31:15PM +0100, MasterPrenium wrote:
>> Hello,
>>
>> Replies below + :
>> - I don't know if this can help but after the crash, when the system
>> reboots, the Raid 5 stack is re-synchronizing
>> [   37.028239] md10: Warning: Device sdc1 is misaligned
>> [   37.028541] created bitmap (15 pages) for device md10
>> [   37.030433] md10: bitmap initialized from disk: read 1 pages, set 59 of
>> 29807 bits
>>
>> - Sometimes the kernel completely crash (lost serial + network connection),
>> sometimes only got the "BUG" dump, but still have network access (but a
>> reboot is impossible, need to reset the system).
>>
>> - You can find blktrace here (while running fio), I hope it's complete since
>> the end of the file is when the kernel crashed : https://goo.gl/X9jZ50
> Looks most are normal full stripe writes.
>   
>>> I'm trying to reproduce, but no success. So
>>> ext4->btrfs->raid5, crash
>>> btrfs->raid5, no crash
>>> right? does subvolume matter? When you create the raid5 array, does adding
>>> '--assume-clean' option change the behavior? I'd like to narrow down the issue.
>>> If you can capture the blktrace to the raid5 array, it would be great to hint
>>> us what kind of IO it is.
>> Yes Correct.
>> The subvolume doesn't matter.
>> -- assume-clean doesn't change the behaviour.
> so it's not a resync issue.
>
>> Don't forget that the system needs to be running on xen to crash, without
>> (on native kernel) it doesn't crash (or at least, I was not able to make it
>> crash).
>>>> Regarding your patch, I can't find it. Is it the one sent by Konstantin
>>>> Khlebnikov ?
>>> Right.
>> It doesn't help :(. Maybe the crash is happening a little bit later.
> ok, the patch is unlikely helpful, since the IO size isn't very big.
>
> Don't have good idea yet. My best guess so far is virtual machine introduces
> extra delay, which might trigger some race conditions which aren't seen in
> native.  I'll check if I could find something locally.
>
> Thanks,
> Shaohua


View attachment "Config_Crash.txt" of type "text/plain" (110513 bytes)

View attachment "Config_NoCrash.txt" of type "text/plain" (121929 bytes)