linux-kernel - Re: 2.6.23.1: mdadm/raid5 hung/d-state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 8 Nov 2007 07:44:36 -0500 (EST)
From:	Justin Piszcz <jpiszcz@...idpixels.com>
To:	BERTRAND Joël <joel.bertrand@...tella.fr>
cc:	Chuck Ebbert <cebbert@...hat.com>, Neil Brown <neilb@...e.de>,
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org
Subject: Re: 2.6.23.1: mdadm/raid5 hung/d-state



On Thu, 8 Nov 2007, BERTRAND Joël wrote:

> BERTRAND Joël wrote:
>> Chuck Ebbert wrote:
>>> On 11/05/2007 03:36 AM, BERTRAND Joël wrote:
>>>> Neil Brown wrote:
>>>>> On Sunday November 4, jpiszcz@...idpixels.com wrote:
>>>>>> # ps auxww | grep D
>>>>>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME 
>>>>>> COMMAND
>>>>>> root       273  0.0  0.0      0     0 ?        D    Oct21  14:40
>>>>>> [pdflush]
>>>>>> root       274  0.0  0.0      0     0 ?        D    Oct21  13:00
>>>>>> [pdflush]
>>>>>> 
>>>>>> After several days/weeks, this is the second time this has happened,
>>>>>> while doing regular file I/O (decompressing a file), everything on
>>>>>> the device went into D-state.
>>>>> At a guess (I haven't looked closely) I'd say it is the bug that was
>>>>> meant to be fixed by
>>>>> 
>>>>> commit 4ae3f847e49e3787eca91bced31f8fd328d50496
>>>>> 
>>>>> except that patch applied badly and needed to be fixed with
>>>>> the following patch (not in git yet).
>>>>> These have been sent to stable@ and should be in the queue for 2.6.23.2
>>>>     My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
>>>> time :
>>>> 
>>>> ...
>>>>         spin_lock(&sh->lock);
>>>>         clear_bit(STRIPE_HANDLE, &sh->state);
>>>>         clear_bit(STRIPE_DELAYED, &sh->state);
>>>>
>>>>         s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
>>>>         s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
>>>>         s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
>>>>         /* Now to look around and see what can be done */
>>>>
>>>>         /* clean-up completed biofill operations */
>>>>         if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
>>>>                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
>>>>                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
>>>>                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
>>>>         }
>>>>
>>>>         rcu_read_lock();
>>>>         for (i=disks; i--; ) {
>>>>                 mdk_rdev_t *rdev;
>>>>                 struct r5dev *dev = &sh->dev[i];
>>>> ...
>>>> 
>>>> but it doesn't fix this bug.
>>>> 
>>> 
>>> Did that chunk starting with "clean-up completed biofill operations" end
>>> up where it belongs? The patch with the big context moves it to a 
>>> different
>>> place from where the original one puts it when applied to 2.6.23...
>>> 
>>> Lately I've seen several problems where the context isn't enough to make
>>> a patch apply properly when some offsets have changed. In some cases a
>>> patch won't apply at all because two nearly-identical areas are being
>>> changed and the first chunk gets applied where the second one should,
>>> leaving nowhere for the second chunk to apply.
>>
>>     I always apply this kind of patches by hands, and no by patch command. 
>> Last patch sent here seems to fix this bug :
>> 
>> gershwin:[/usr/scripts] > cat /proc/mdstat
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md7 : active raid1 sdi1[2] md_d0p1[0]
>>       1464725632 blocks [2/1] [U_]
>>       [=====>...............]  recovery = 27.1% (396992504/1464725632) 
>> finish=1040.3min speed=17104K/sec
>
> 	Resync done. Patch fix this bug.
>
> 	Regards,
>
> 	JKB
>

Excellent!

I cannot easily re-produce the bug on my system so I will wait for the 
next stable patch set to include it and let everyone know if it happens 
again, thanks.