[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <48C11BF3.2060609@weinigel.se>
Date: Fri, 05 Sep 2008 13:45:55 +0200
From: Christer Weinigel <christer@...nigel.se>
To: linux-kernel@...r.kernel.org
CC: Ben Dooks <ben-linux@...ff.org>,
Pierre Ossman <drzeus-list@...eus.cx>
Subject: Proposed SDIO layer rework
Hi Pierre,
I said I wanted to argue a bit about the SDIO layer, and here it is. :-)
Ben, this is about the s3cmci driver, so I guess you might be interested.
I've been trying to make SDIO run and run fast on a small embedded
system with a Samsung S3C24A0 processor with 200 MHz ARM926 core. While
doing this I've discovered some changes that I'd like to make to the
generic SDIO layer.
First some measurements. I've connected the SDIO bus to an Agilent
scope to get some pretty plots. To measure CPU load I have a small test
program which just counts how many gettimeofday calls I can do in one
second. On on idle system I get the following:
269648 kcycles in 1.000 s, 269 kcycles/s
I've written a test firmware for one of our SDIO chips which streams out
data using the SDIO Bluetooth Type A protocol, at 250 packets per
second, 2 kByte per packet for a total of about 4 MBit transfer rate.
The device driver I'm using is the s3cmci driver from Ben Dooks together
with my patches to do SDIO and with a working SDIO interrupt. The
driver does not support DMA yet. I've limited the SDIO bus clock to
10 MHz just to avoid any errors when connecting the scope probes.
If I use a test driver which just copies these packets to userspace and
then discards them (the logic in the test driver is basically the same
as in the btsdio driver), this is what I see on a scope:
http://zoo.weinigel.se/s3cmci/timing-normal.png
D4 is the CMD line and D5 is the CLK line. D0 to D3 are the data lines.
The SDIO interrupt is signalled on D1 and then you can see the
different SDIO transactions on the bus:
1. SDIO layer reads the interrupt register in function 0
2. Driver reads the BT function nterrupt register
3. Driver clears the BT function interrupt register and
you can see the interrupt line going high again.
4. Driver reads the SDIO BT Type A header, you can see
the data transaction on the D0-D3 wires.
5. Driver reads the payload which is 2kBytes.
6. Driver acknowledges the packet.
Each transaction requires the SDIO irq thread to wake up, and at 250
packets times 6 operations per packet, that is 1500 wakeups per second,
not including the initial wakeup for the SDIO interrupt itself. The gaps
when the CLK line stops in the 2kByte data transfer is probably due to
scheduling latencies before the pio_tasklet in the s3cmci driver is run.
As you can see, the latency from the interrupt being signalled to the
interrupt being cleared is about 340 us. The cpu load program reports:
130674 kcycles in 1.000 s, 154 kcycles/s
so about 45% of the CPU is spent just transferring data into kernel
space. Another problem is that the reference board I'm using has a SMC
ethernet chip which is extremely slow, it can spend up to a dozen
milliseconds just reading data from the chip. Since network traffic has
higher priority than the SDIO irq worker thread, if there is almost any
network traffic at all, the SDIO interrupt latency is awful, it often
reaches 10 or 20 ms.
To speed things up slightly, I removed the pio_tasklet in the s3cmci
driver and just called the function to read from the fifo directly from
the interrupt handler:
http://zoo.weinigel.se/s3cmci/timing-no-tasklet.png
156078 kcycles in 1.000 s, 156 kcycles/s
So the CPU load is slightly lower and it looks as if the clock never
stops during the 2kByte transfer. The latency is 320 us but this
difference is negligible, it is well within the variation I see when
looking at the the scope.
So, what I did next was to go through the whole SDIO layer, adding
asynchronous I/O operations all over the place. So an asynchronouse
call to sdio_readb looks like this:
...
sdio_readb_start(func, address, done_callback);
}
And then I of course need a callback:
void done_callback(struct sdio_func *func)
{
u8 intrd;
int ret;
value = sdio_readb_complete(func, &ret);
...
So instead of sleeping, it registers a callback function which will be
called directly from the host interrupt handler when the SDIO transfer
finishes. I also changed the SDIO interrupt handling so that the
interrupt handler is called directly from the mmc_signal_sdio_irq
function, instead of running from a thread. This made an enormous
difference on the latency, it's down to 132 us, and the CPU load is down
as well:
http://zoo.weinigel.se/s3cmci/timing-async.png
185000 kcycles in 1.000 s, 185 kcycles/s
Most of the CPU is probably spent doing PIO transfers to the SDIO
controller, if DMA starts working in the s3cmci driver, the CPU load
difference will be even larger.
Since more and more high speed chips are being connected to embedded
devices using SDIO, I belive that to get good SDIO performance, the SDIO
layer has to be changed not to use a worker thread. This is
unfortunately rather painful since it means that we have to add
asynchronous variants of all the I/O operations, and the sdio_irq thread
has to be totally redone, and the sdio IRQ enabling and disablin turned
out to be a bit tricky.
So, do you think that it is worth it to make such a major change to the
SDIO subsystem? I belive so, but I guess I'm a bit biased. I can clean
up the work I've done and make sure that everything is backwards
compatible so that existing SDIO function drivers keep working (my
current hack to the sdio_irq thread breaks existing drivers, and it is
too ugly to live anyway so I don't even want to show it to you yet), and
add a demo driver which shows how to use the asynchronous functions.
But if you don't like this idea at all, I'll probably just keep it as a
patch and not bother as much with backwards compatibility. This is a
lot more work for me though, and I'd much prefer to get it into the
official kernel.
Comments?
/Christer
ps. I've tested the same driver on a Lenovo laptop with the SDHCI driver
and it works fine, but it doesn't make as big a difference there. The
CPU load isn't big enough to matter even with the normal drivers, but
the latency gets slightly but not significantly better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists