linux-kernel - Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 5 Jan 2017 15:16:53 +0100
From:   MasterPrenium <masterprenium.lkml@...il.com>
To:     Shaohua Li <shli@...nel.org>
Cc:     linux-kernel@...r.kernel.org, xen-users@...ts.xen.org,
        linux-raid@...r.kernel.org,
        "MasterPrenium@...il.com" <MasterPrenium@...il.com>,
        xen-devel@...ts.xenproject.org
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

Hi Shaohua,

Thanks for your reply.

Let me explain my "huge". For example, if I'm making a low rate i/o 
stream, I don't get a crash (<1MB written / sec) with random i/o, but if 
I'm making a random I/O of about 20MB/sec, the kernel crashes in a few 
minutes (for example, making an rsync, or even synchronising my DRBD 
stack is causing the crash).
I don't know if this can help, but in most of case, when the kernel 
crashes, after a reboot, my raid 5 stack is re-synchronizing.

I'm not able to reproduce the crash with a raw RAID5 stack (with dd/fio 
...).

It seems I need to stack filesystems to help reproduce it:

Here is a configuration test, command lines to explain (the way I'm able 
to reproduce the crash). Everything is done in dom0.
- mdadm --create /dev/md10 --raid-devices=3 --level=5 /dev/sdc1 
/dev/sdd1 /dev/sde1
- mkfs.btrfs /dev/md10
- mkdir /tmp/btrfs /mnt/XenVM /tmp/ext4
- mount /dev/md10 /tmp/btrfs
- btrfs subvolume create /tmp/btrfs/XenVM
- umount /tmp/btrfs
- mount /dev/md10 /mnt/XenVM -osubvol=XenVM
- truncate /mnt/XenVM/VMTestFile.dat -s 800G
- mkfs.ext4 /mnt/XenVM/VMTestFile.dat
- mount /mnt/XenVM/VMTestFile.dat /tmp/ext4

-> Doing this, doesn't seem to crash the kernel :
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite 
--rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600 
--group_reporting --filename=/mnt/XenVM/Fio.dat

-> Doing this, is crashing the kernel in a few minutes :
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite 
--rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600 
--group_reporting --filename=/tmp/ext4/ext4.dat

Note : --direct=1 or --direct=0 doesn't seem to change the behaviour. 
Also having the raid 5 stack re-synchronizing or already synchronized, 
doesn't change the behaviour.

Here another "crash" : http://pastebin.com/uqLzL4fn

Regarding your patch, I can't find it. Is it the one sent by Konstantin 
Khlebnikov ?

Do you want the "ext4.dat" fio file ? It will be really difficult for me 
to provide it to you as I've only a poor ADSL network connection.

Thanks for your help,

MasterPrenium

Le 04/01/2017 à 23:30, Shaohua Li a écrit :
> On Fri, Dec 23, 2016 at 07:25:56PM +0100, MasterPrenium wrote:
>> Hello Guys,
>>
>> I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
>> Here is configuration :
>> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
>> - On top of it, DRBD for replication over another node (Active/passive cluster)
>> - On top of it, a BTRFS FileSystem with a few subvolumes
>> - On top of it, XEN VMs running.
>>
>> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
>> I've to reset system to make it work again.
> what did you mean 'huge' I/O (20M/s)? Is it possible you can reproduce the
> issue with a raw raid5 raid? It would be even better if you can give me a fio
> job file with the issue, so I can easily debug it.
>
> also please check if upstream patch (e8d7c33 md/raid5: limit request size
> according to implementation limits) helps.
>
> Thanks,
> Shaohua