linux-kernel - Re: 2.6.23.1: mdadm/raid5 hung/d-state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4734C4BB.1050000@Lessem.org>
Date:	Fri, 09 Nov 2007 13:36:11 -0700
From:	Jeff Lessem <Jeff@...sem.org>
To:	Dan Williams <dan.j.williams@...el.com>
CC:	Bill Davidsen <davidsen@....com>,
	BERTRAND Joël <joel.bertrand@...tella.fr>,
	Justin Piszcz <jpiszcz@...idpixels.com>,
	Neil Brown <neilb@...e.de>, linux-kernel@...r.kernel.org,
	linux-raid@...r.kernel.org
Subject: Re: 2.6.23.1: mdadm/raid5 hung/d-state

Dan Williams wrote:
 > On 11/8/07, Bill Davidsen <davidsen@....com> wrote:
 >> Jeff Lessem wrote:
 >>> Dan Williams wrote:
 >>>> The following patch, also attached, cleans up cases where the code
 >>> looks
 >>>> at sh->ops.pending when it should be looking at the consistent
 >>>> stack-based snapshot of the operations flags.
 >>> I tried this patch (against a stock 2.6.23), and it did not work for
 >>> me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
 >>> also I/O to all other disks.  I was not able to capture any debugging
 >>> information, but I should be able to do that tomorrow when I can hook
 >>> a serial console to the machine.
 >> That can't be good! This is worrisome because Joel is giddy with joy
 >> because it fixes his iSCSI problems. I was going to try it with nbd, but
 >> perhaps I'll wait a week or so and see if others have more information.
 >> Applying patches before a holiday weekend is a good way to avoid time
 >> off. :-(
 >
 > We need to see more information on the failure that Jeff is seeing,
 > and whether it goes away with the two known patches applied.  He
 > applied this most recent patch against stock 2.6.23 which means that
 > the platform was still open to the first biofill flags issue.

I applied both of the patches.  The biofill one did not apply cleanly,
as it was adding biofill to one section, and removing it from another,
but it appears that biofill does not need to be removed from a stock
2.6.23 kernel.  The second patch applies with a slight offset, but no
errors.

I can report success so far with both patches applied.  I created an
1100GB RAID5, formated it XFS, and successfully "tar c | tar x" 895GB
of data onto it.  I'm also in the process of rsync-ing the 895GB of
data from the (slightly changed) original.  In the past, I would
always get a hang within 0-50GB of data transfer.

For each drive in the RAID I also:

echo 128 > /sys/block/"$i"/queue/max_sectors_kb
echo 512 > /sys/block/"$i"/queue/nr_requests
echo 1 > /sys/block/"$i"/device/queue_depth
blockdev --setra 65536 /dev/md3
echo 16384 > /sys/block/md3/md/stripe_cache_size

These changes appear to improve performance, along with a RAID5 chunk
size of 1024k, but these changes alone (without the patches) do not
fix the problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/