[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070801004117.GE31006@wotan.suse.de>
Date: Wed, 1 Aug 2007 02:41:18 +0200
From: Nick Piggin <npiggin@...e.de>
To: "Siddha, Suresh B" <suresh.b.siddha@...el.com>
Cc: Christoph Lameter <clameter@....com>, linux-kernel@...r.kernel.org,
arjan@...ux.intel.com, mingo@...e.hu, ak@...e.de,
jens.axboe@...cle.com, James.Bottomley@...elEye.com,
andrea@...e.de, akpm@...ux-foundation.org,
andrew.vasquez@...gic.com
Subject: Re: [rfc] direct IO submission and completion scalability issues
On Tue, Jul 31, 2007 at 10:14:03AM -0700, Suresh B wrote:
> On Tue, Jul 31, 2007 at 06:19:17AM +0200, Nick Piggin wrote:
> > On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> > > So any suggestions for making this clean and acceptable to everyone?
> >
> > It is obviously a good idea to hand over the IO at the point which
> > requires the least number of cachelines to be moved, and I think doing
> > it in the block layer is right. Mostly you have to convince the block
> > and driver maintainers I guess.
>
> Yes. Implementation is the challenging part I guess.
>
> > The scheduler really should be made interrupt-load aware anyway, so I
> > don't have a problem with changing that; or scheduling kblockd at a
> > higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
> > it be done in a softirq instead?
>
> Yes, softirq context is one way. But just didn't want to penalize the running
> task by taking away some of its cpu time. With CFS micro accounting, perhaps
> we can track irq, softirq time and avoid penalizing the running task's cpu
> time.
But you "penalize" the running task in the completion handler as well
anyway. Doing this with a SCHED_FIFO task is sort of like doing interrupt
threading which AFAIK has not been accepted (yet).
> > Latency for IO migration could be the most difficult problem to solve
> > really. You don't give much details of the workload, profiles, etc... I
> > hope this is for a real world test?
>
> Improvement numbers quoted are from the OLTP database workload. We can look
> into other workloads.
>
> > Can the locking be improved in simpler ways first?
> >
> > Just some random questions...
> >
> > It looks like the main source of cacheline bouncing you're eliminating
> > is from the initial starting of IO from an empty queue (ie. unplug).
> > From then on, the submission is driven by completion, right?
> >
> > Why is the queue allowed to go empty in the first place in an IO critical
> > workload?
>
> This workload is using direct IO and there is no batching at the block layer
> for direct IO. IO is submitted to the HW as it arrives.
So you aren't putting concurrent requests into the queue? Sounds like
userspace should be improved.
> > Are you loading up each CPU with as many disks as it can possibly handle
> > plus a few more? If so, is that realistic? (I honestly don't know).
>
> There is 3-4% iowait time in the system. So the cpu's are not 100% busy,
> but there is quite a bit of direct IO going on.
>
> > You say that you'd like to do this for direct IO only, but if it is more
> > efficient, why not for buffered IO as well? (or is it not more efficient
> > for buffered IO? if not, why?)
>
> It is applicable for both direct IO and buffered IO. But the implementations
> will differ. For example in buffered IO, we can setup in such a way that the
> block plug timeout function runs on the IO completion cpu.
It would be nice to be doing that anyway. But unplug via request submission
rather than timeout is fairly common in buffered loads too.
> > AFAIKS, you'd still have significant queue_lock contention from other
> > CPUs inserting requests into the list?
>
> Correct. We have more potential to explore. Current implementation
> is very elementary.
>
> > What IO scheduler are you using? I assume noop...
>
> yes.
>
> > as a crazy experiment, what happens if you create per-cpu request queues?
>
> or in other words, each kblockd thread catering multiple request queues
> (perhaps one for each cpu or one for group of cpu's).
>
> softirq context and each kblockd thread handling multiple request queues will
> lead to further improvements.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists