[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070806103317.5688d6df.akpm@linux-foundation.org>
Date: Mon, 6 Aug 2007 10:33:17 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Dimitrios Apostolou <jimis@....net>
Cc: linux-kernel@...r.kernel.org
Subject: Re: high system cpu load during intense disk i/o
On Mon, 06 Aug 2007 16:20:30 +0200 Dimitrios Apostolou <jimis@....net> wrote:
> Andrew Morton wrote:
> > On Sun, 5 Aug 2007 19:03:12 +0300 Dimitrios Apostolou <jimis@....net> wrote:
> >
> >> was my report so complicated?
> >
> > We're bad.
> >
> > Seems that your context switch rate when running two instances of
> > badblocks against two different disks went batshit insane. It doesn't
> > happen here.
> >
> > Please capture the `vmstat 1' output while running the problematic
> > workload.
> >
> > The oom-killing could have been unrelated to the CPU load problem. iirc
> > badblocks uses a lot of memory, so it might have been genuine. Keep an eye
> > on the /proc/meminfo output and send the kernel dmesg output from the
> > oom-killing event.
>
> Please see the attached files. Unfortunately I don't see any useful info
> in them:
> *_before: before running any badblocks process
> *_while: while running badblocks process, but without any cron job
> having kicked in
> *_bad: 5 minutes later that some cron jobs kicked in
>
> About the OOM killer, indeed I believe that it is unrelated. It started
> killing after about 2 days, that hundreds of processes were stuck as
> running and taking up memory, so I suppose the 256 MB RAM were truly
> filled. I just mentioned it because its behaviour is completely
> non-helpful. It doesn't touch the badblocks process, it rarely touches
> the stuck as running cron jobs, but it kills other irrelevant processes.
> If you still want the killing logs, tell me and I'll search for them.
ah. Your context-switch rate during the dual-badblocks run is not high at
all.
I suspect I was fooled by the oprofile output, which showed tremendous
amounts of load in schedule() and switch_to(). The percentages which
opreport shows are the percentage of non-halted CPU time. So if you have a
function in the kernel which is using 1% of the total CPU, and the CPU is
halted for 95% of the time, it appears that the function is taking 20% of
CPU!
The fix for that is to boot with the "idle=poll" boot parameter, to make
the CPU spin when it has nothing else to do.
I'm suspecting that your machine is just stuck in D state waiting for disk.
Did we have a sysrq-T trace?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists