linux-kernel - Performance problems when writing large files on CCISS hardware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <195127.4908.qm@web32605.mail.mud.yahoo.com>
Date:	Wed, 23 Jan 2008 06:46:49 -0800 (PST)
From:	Martin Knoblauch <spamtrap@...bisoft.de>
To:	linux-kernel@...r.kernel.org
Cc:	mike.miller@...com, iss_storagedev@...com
Subject: Performance problems when writing large files on CCISS hardware

Please CC me on replies, as I am not subscribed.

Hi,

 for a while now I am having problems writing large files sequentially to EXT2 filesystems on CCISS based boxes. The problem is that writing multiple files in parallel is extremely slow compared to a single file in non-DIO mode. When using DIO, the scaling is almost "perfect". The problem manifests itself in RHEL4 kernels (2.6.9-X) and any mainline kernel up to 2.6.24-rc8.

 The systems in question are HP/DL380G4 with 2 cpus, 8 GB memory, SmartArray6i (CCISS) with BBWC and 4x72GB@...rpm disks in RAID5 configuration. Environment is 64-bit RHEL4.3.

 The problem can be reproduced by running 1, 2 or 3 parallel "dd" processes, or  "iozone" with 1, 2 or 3 threads. Curiously, there was a period from 2.6.24-rc1 until 2.6.24-rc5 where the problem went away. It turned out that this was due to a "regression" that was "fixed" by below commit. Unfortunatelly this is not good  for my systems, but it might shed some light on the underlying problem:

> #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> #Author: Mel Gorman <mel@....ul.ie>
> #Date:   Mon Dec 17 16:20:05 2007 -0800
> #
> #    mm: fix page allocation for larger I/O segments
> #    
> #    In some cases the IO subsystem is able to merge requests if the
 pages are
> #    adjacent in physical memory.  This was achieved in the allocator
 by having
> #    expand() return pages in physically contiguous order in
 situations were a
> #    large buddy was split.  However, list-based anti-fragmentation
 changed the
> #    order pages were returned in to avoid searching in
 buffered_rmqueue() for a
> #    page of the appropriate migrate type.
> #    
> #    This patch restores behaviour of rmqueue_bulk() preserving the
 physical
> #    order of pages returned by the allocator without incurring
 increased search
> #    costs for anti-fragmentation.
> #    
> #    Signed-off-by: Mel Gorman <mel@....ul.ie>
> #    Cc: James Bottomley <James.Bottomley@...eleye.com>
> #    Cc: Jens Axboe <jens.axboe@...cle.com>
> #    Cc: Mark Lord <mlord@...ox.com
> #    Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> #    Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> diff -urN linux-2.6.24-rc5/mm/page_alloc.c
 linux-2.6.24-rc6/mm/page_alloc.c
> --- linux-2.6.24-rc5/mm/page_alloc.c    2007-12-21 04:14:11.305633890
 +0000
> +++ linux-2.6.24-rc6/mm/page_alloc.c    2007-12-21 04:14:17.746985697
 +0000
> @@ -847,8 +847,19 @@
>                 struct page *page = __rmqueue(zone, order,
 migratetype);
>                 if (unlikely(page == NULL))
>                         break;
> +
> +               /*
> +                * Split buddy pages returned by expand() are
 received here
> +                * in physical page order. The page is added to the
 callers and
> +                * list and the list head then moves forward. From
 the callers
> +                * perspective, the linked list is ordered by page
 number in
> +                * some conditions. This is useful for IO devices
 that can
> +                * merge IO requests if the physical pages are
 ordered
> +                * properly.
> +                */
>                 list_add(&page->lru, list);
>                 set_page_private(page, migratetype);
> +               list = &page->lru;
>         }
>         spin_unlock(&zone->lock);
>         return i;
> 

 Reverting this patch from 2.6.24-rc8 gives the good performance reported below (rc8*). So, apparently CCISS is very sensitive to the page ordering.

 Here are the numbers (MB/sec) including sync-time. I compare 2.6.24-rc8 (rc8) and 2.6.24-rc8 with abore commit reverted (rc8*). Reported is the combined throughput for 1,2,3 iozone threads, for reference also the DIO numbers. Raw numbers are attached.

Test                  rc8      rc8*
----------------------------------------
1x3GB              56       90
1x3GB-DIO       86       86
2x1.5GB           9.5      87
2x1.5GB-DIO    80       85
3x1GB             16.5     85
3x1GB-DIO       85       85

 One can see that in mainline/rc8  all non-DIO numbers are smaller than the corresponding DIO numbers, or the non-DIO numbers from rc8*. The performance for 2 and 3 threads in mainline/rc8 is just bad.

 Of course I have the option to revert commit ....54b6d for my systems, but I think a more general solution would be better. If I can help tracking the real problem down, I am open for suggestions.

Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


View attachment "cciss-rc8-bad.log" of type "text/x-log" (10205 bytes)

View attachment "cciss-rc8-good.log" of type "text/x-log" (10205 bytes)

Download attachment "config-2.6.24-rc8" of type "application/octet-stream" (44328 bytes)