linux-kernel - Re: I/O throughput problem in newer kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090407003524.41c9b666.akpm@linux-foundation.org>
Date:	Tue, 7 Apr 2009 00:35:24 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	adobra@...e.ufl.edu
Cc:	linux-kernel@...r.kernel.org
Subject: Re: I/O throughput problem in newer kernels

On Thu, 2 Apr 2009 11:06:08 -0400 (EDT) adobra@...e.ufl.edu wrote:

> While putting together a not so average machine for database research, I
> bumped into the following performance problem with the newer kernels (I
> tested 2.6.27.11, 2.6.29): the aggregate throughput drops drastically when
> more than 20 hard drives are involved in the operation. This problem is
> not happening on 2.6.22.9 or 2.6.20 (did not test other kernels).

Well that's bad.  You'd at least expect the throughput to level out.

> Since I am not subscribed to the mailing list, I would appreciate you
> cc-ing me on any reply or discussion.
> 
> 1. Description of the machine
> -----------------------------------------------
> 8 Quad-Core AMD Opteron(tm) Processor 8346 HE
> Each processor has independent memory banks (16GB in each bank for 128GB
> total)
> Two PCI busses (connected in different places in the NUMA architecture)
> 8 hard drives installed into the base system on SATA interfaces
>     First hard drive dedicated to the OS
>     7 Western Digital hard drives (90 MB/s max throughput)
>     Nvidia SATA chipset
> 4 Adaptec 5805 RAID cards installed in PCI-E 16X slots (all running at 8X
> speed)
>    The 4 cards live on two separate PCI busses
> 6 IBM EXP3000 disk enclosures
>    2 cards connect to 2 enclosures each, the other 2 to 1 enclosure
> 8 Western Digital Velociraptor HD in each enclosure
>    Max measured throughput 110-120 MB/s
> 
> Total number of hard drives used the tests: 7+47=54 or subsets
> The Adaptec cards are configure to expose each disk individually to the
> OS. Any RAID configuration seems to limit the throughput at 300-350MB/s
> which is too low for the purpose of this system.
> 
> 2. Throughput tests
> --------------------------------
> I did two types of tests: using dd (spawning parallel dd jobs that lasted
> at least 10s) or using a multi-threaded program that simulates the
> intended usage for the system. Results using both are consistent so I will
> only report the results with the custom program. Both the dd test and the
> custom one do reads in large chunks (256K/request at least). All request
> in the custom program are made with "read" system call to page aligned
> memory (allocated with mmap to make sure). The kernel is doing a zero-copy
> to user space otherwise the speeds observed are not possible.
> 
> Here is what I observed in terms of throughput:
> a. Speed/WD disk: 90 MB/s
> b. Speed/Velociraptor disk: 110 MB/s
> c. Speed of all WD disks in base system: 700MB/s
> d. Speed of disks in one enclosure: 750 MB/s
> e. Speed of disks connected to one Adaptec card: 1000 MB/s
> f. Speed of disks connected on a single PCI bus: 2000 MB/s
> 
> The above numbers look good and are consistent on all kernels that I tried.
> 
> THE PROBLEM: when the number of disks exceeds 20 the throughput plummets
> on newer kernels.
> 
> g. SPEED OF ALL DISKS: 600 MB/s on newer kernels, 2700 MB/s on older kernels
> The throughput drops drastically the moment 20-25 hard drives are involved
> 
> 3. Tests I performed to ensure the number of hard drives is the culprit
> ----------------------------------------------------------------------------------------------------------------
> a. Took 1, 2, 3 and 4 disks from each enclosure to ensure uniform load on
> buses
>     performance going up as expected until 20 drives reached than dropping
> 
> b. Involved combinations of the regular WD drives and the Velociraptors.
>     Had no major influence on the observation
> 
> c. Involved combinations of enclosures
>     No influence
> 
> d. Used the hard drives in decreasing order of measured speed (as reported
> by hdparm)
>     Only minor influence and still drastic drop at 20
> 
> e. Changed the I/O scheduler used for the hard drives
>     No influence
> 
> 4. Things that I do not think are wrong
> --------------------------------------------------------------
> a. aacraid or scsi_nv drivers
>     The problem depends only on the number of hard drives not the
> combination of the drives themselves
> 
> b. Limitations on the buses
>     The measured speeds of the subsystems indicate that no bottleneck on
> individual buses is reached. Even if this is the case, the throughput
> should level up not drop dramatically
> 
> c. Failures in the system
>     No errors reported in /var/log/messages or other logs related to I/O
> 
> Of course, this begs the question WHAT IS WRONG?
> 
> I would be more than happy to run any tests you suggest on my system to
> find the problem.
> 

Did you monitor the CPU utilisation?

It would be interesting to test with O_DIRECT (dd conv=direct) to
remove the page allocator and page reclaim from the picture.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/