linux-kernel - Re: Problem with shared interrupt latency with a RAID6 array?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <mtudh6hhtqr33e97nkggeiug68pgdlfdab@4ax.com>
Date:	Sun, 26 Dec 2010 19:40:19 +1100
From:	Grant Coady <gcoady.lk@...il.com>
To:	Robert Hancock <hancockrwd@...il.com>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Problem with shared interrupt latency with a RAID6 array?

On Fri, 24 Dec 2010 16:19:07 -0600, you wrote:

>On 12/22/2010 05:57 AM, Grant Coady wrote:
>> Hi there,
>>
>> Built my first RAID6 array with 5 x 1TB SATA drives.
>>
>> I notice this odd number in the SMART values for the last two drives on the
>> array.  The drives connect to an Intel ICH9R chip, the mobo has a 2.13GHz
>> Core2Duo CPU and 4GB memory, running Slackware64-13.1 with 2.6.36.2a kernel.
>>
>> While feeding data into the array from a USB 2.0 attached drive, the box's
>> load average was about 3.5, the box was very responsive and I transferred
>> over 900GB into the RAID6 array.
>>
>> The fourth and fifth drives report lots of command timeouts in the SMART
>> data.  Is this a problem?
>>
>> Is it because the drives share an interrupt?
>>
>> Extract from dmesg:
>>
>> root@...h:~# egrep -e '^(ahci|ata)' /var/log/dmesg
>> ahci 0000:00:1f.2: version 3.0
>> ahci 0000:00:1f.2: PCI INT B ->  GSI 19 (level, low) ->  IRQ 19
>> ahci 0000:00:1f.2: irq 40 for MSI/MSI-X
>> ahci: SSS flag set, parallel bus scan disabled
>> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
>> ahci 0000:00:1f.2: flags: 64bit ncq sntf stag pm led clo pmp pio slum part ccc ems
>> ahci 0000:00:1f.2: setting latency timer to 64
>> ata1: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386100 irq 40
>> ata2: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386180 irq 40
>> ata3: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386200 irq 40
>> ata4: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386280 irq 40
>> ata5: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386300 irq 40
>> ata6: SATA max UDMA/133 abar m2048@...6386000 port 0xf6386380 irq 40
>> ata7: PATA max UDMA/100 cmd 0xc000 ctl 0xc100 bmdma 0xc400 irq 16
>> ata8: PATA max UDMA/100 cmd 0xc200 ctl 0xc300 bmdma 0xc408 irq 16
>> ata7.00: ATAPI: PIONEER DVD-RW  DVR-110D, 1.41, max UDMA/66
>> ata7.00: configured for UDMA/66
>> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata1.00: ATA-8: ST31000528AS, CC46, max UDMA/133
>> ata1.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> ata1.00: configured for UDMA/133
>> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata2.00: ATA-8: ST31000528AS, CC46, max UDMA/133
>> ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> ata2.00: configured for UDMA/133
>> ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata3.00: ATA-8: ST31000528AS, CC46, max UDMA/133
>> ata3.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> ata3.00: configured for UDMA/133
>> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata4.00: ATA-8: ST31000528AS, CC46, max UDMA/133
>> ata4.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> ata4.00: configured for UDMA/133
>> ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata5.00: ATA-8: ST31000528AS, CC46, max UDMA/133
>> ata5.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
>> ata5.00: configured for UDMA/133
>> ata6: SATA link down (SStatus 0 SControl 300)
>>
>> And here's SMART's command timeout numbers:
>>
>> root@...h:~# for d in a b c d e; do smartctl -a /dev/sd${d} |grep Command_Timeout; done
>> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
>> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
>> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
>> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       65537
>> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       65537
>>
>> Is this a problem?  Is there something I can change in the .config?
>
>Well, if it is a problem it's presumably hardware related. Are those 
>command timeout numbers increasing?

No, it's not increasing, I just noticed the number there one day, the drives 
were purchased over a period of several weeks, and the last two drives were 
bought specifically for building the RAID array.  More info:

root@...h:~# for d in a b c d e; do smartctl -a /dev/sd${d} |gawk '/Seri/{print};/Reall|Start_|Power_O|Power_C|Comman/{printf"    %-22s %d\n",$2,$10}'; done
Serial Number:    9VP7PVAZ
    Start_Stop_Count       70
    Reallocated_Sector_Ct  0
    Power_On_Hours         353
    Power_Cycle_Count      35
    Command_Timeout        0
Serial Number:    9VP7RR7A
    Start_Stop_Count       146
    Reallocated_Sector_Ct  0
    Power_On_Hours         512
    Power_Cycle_Count      70
    Command_Timeout        0
Serial Number:    9VP7PJ62
    Start_Stop_Count       121
    Reallocated_Sector_Ct  0
    Power_On_Hours         456
    Power_Cycle_Count      58
    Command_Timeout        0
Serial Number:    9VP7PYDY
    Start_Stop_Count       79
    Reallocated_Sector_Ct  0
    Power_On_Hours         330
    Power_Cycle_Count      35
    Command_Timeout        65537
Serial Number:    9VP7QJJM
    Start_Stop_Count       72
    Reallocated_Sector_Ct  0
    Power_On_Hours         305
    Power_Cycle_Count      31
    Command_Timeout        65537

> If so, then you might look at 
>anything that might be common to those two drives - things like having 
>too many hard drives on one power cable coming from the power supply 
>have caused drive problems for some people in the past. In some cases 
>power supply problems can occur when running multiple hard drives in a 
>machine, especially in a RAID configuration where all drives are likely 
>to be accessed at once.

Bos has 600W power supply, been quite reliable.  I can add filter caps 
to the power rails.  

No longer suspect it's an interrupt latency, but I have no clue why those 
timeouts happens -- might've been a mistyped dd zero drive command or 
something?

After a couple days data I/O I've had no RAID errors.  Only problem is to get 
the speed up, it seems to run half speed at about 43MB/s max.  I thought it 
would go much faster, twice that -- still to see about scheduler and timebase 
rate, preemption -- do they make a difference?

Turned off the NCQ, it seems to reduce load average as Q depth gets closer 
to 1, though I've yet to script a formal benchmark of the the effect, say 
queue length of 1,3,7,15,31 --> data rate and load average.

Thanks,
Grant.

>
>>
>> Config and full dmesg are at:
>>
>>    http://bugsplatter.id.au/kernel/boxen/pooh/config-2.6.36.2a.gz
>>    http://bugsplatter.id.au/kernel/boxen/pooh/dmesg-2.6.36.2a.gz
>>
>> Ask, and I'll provide more info, do tests and so on.
>>
>> Could this issue be related to RAID6 unreliability reports one finds for
>> some Linux based NAS devices on the 'net?
>>
>> Thanks,
>> Grant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/