lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 9 Jun 2008 19:48:59 +0200 From: Haavard Skinnemoen <haavard.skinnemoen@...el.com> To: David Brownell <david-b@...bell.net> Cc: lkml <linux-kernel@...r.kernel.org>, linux-mtd@...ts.infradead.org, Nicolas Ferre <nicolas.ferre@....atmel.com> Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}() David Brownell <david-b@...bell.net> wrote: > On Monday 09 June 2008, Haavard Skinnemoen wrote: > > David Brownell <david-b@...bell.net> wrote: > > > This uses __raw_{read,write}s{b,w}() primitives to access data on NAND > > > chips for more efficient I/O. > > > > > > On an arm926 with memory clocked at 100 MHz, this reduced the elapsed > > > time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with > > > an 8-bit NAND using hardware ECC and 128KB blocksize.) > > > > Nice. Here are some numbers from my setup (256 MB, 8-bit, software ECC). > > > > Before: > > real 2m38.131s > > user 0m0.228s > > sys 2m37.740s > > > > After: > > real 2m27.404s > > user 0m0.180s > > sys 2m27.068s > > > > which is a 6.8% speedup. I guess hardware ECC helps... > > The AVR32 versions of readsb/writesb didn't look to me as if they'd > be quite as fast as the ARM ones either. If AVR32 has some analogue > of "stmia r1!, {r3 - r6}" for burst 16 byte stores, it's not using > it right now. (What was the bug you found in its readsb?) Note that I'm talking about the __raw_ versions of those, which are a bit more optimized than the non-raw versions. They do 1: ldins.b r8:t, r12[0] ldins.b r8:u, r12[0] ldins.b r8:l, r12[0] ldins.b r8:b, r12[0] st.w r11++, r8 sub r10, 4 brge 1b I don't think we have an instruction that can store multiple registers to the same address...it would of course be acceptable to store to incrementing addresses when dealing with NAND flash, but I don't think it's a good idea in a general __raw_readsb implementation. Here's the bug I found, btw: --- a/arch/avr32/lib/io-readsb.S +++ b/arch/avr32/lib/io-readsb.S @@ -41,7 +41,7 @@ __raw_readsb: 2: sub r10, -4 reteq r12 -3: ld.uh r8, r12[0] +3: ld.ub r8, r12[0] sub r10, 1 st.b r11++, r8 brne 3b Not sure how easy it is to trigger since that code is only executed for odd sizes. > Yes, I'd think the win would be most visible with hardware ECC, since > without it you've still got a second manual scan of each block. (And > I see you observed this too, after applying a workaround for an ECC > erratum you just learned about...) My numbers for one pair of trials > (the "16%" was an average of 6 runs) had a *lot* less system time. > Which oddly enough went *up* after the switch to readsb/writesb: > > Before: > real 0m24.199s > user 0m0.000s > sys 0m5.630s > > After: > real 0m20.226s > user 0m0.010s > sys 0m6.000s Hmm, that's odd. What's the CPU doing during the remaining 14 seconds? It can't possibly be sleeping? Ah, it's I/O wait, isn't it? Because you're going through the block layer? > However, the fact that you got a win even with soft ECC (and, I'm > guessing, slower RAM and slower readsb) suggests that this speedup > should be pretty generally applicable! Yes, I would think so...although I've seen gcc generate somewhat crappy code for the I/O accessors, and we do some address mangling in the non-raw I/O accessors on avr32 which might explain some of the difference. > > though I can't > > seem to get it to work properly. Is there anything I need to do besides > > flash_eraseall when changing the ECC layout? > > I wouldn't know. Just be sure not to lose all your badblocks data > when you convert ... Seems like flash_eraseall skips the bad blocks as it should. > > Also, I wonder if we can use the DMA engine framework to get rid of all > > that "sys" time...? > > It's another one of those cases where the framework overhead has to be > low enough to make that practical. Last time I looked, the overhead to > set up and wait for a DMA of a couple KBytes was a significant chunk of > the cost to readsb()/writesb() the same data ... and that's even before > the data starts transferring. Right. I guess we should take a look at how to reduce that overhead at some point... > Plus, the MTD layer currently assumes DMA is never used. Some of the > buffers it passes are not suitable for dma_map_single() since they > come from vmalloc. Aw...the MTD layer uses vmalloc() all over the place :-( > Sounds fair to me. Thanks; this has been sitting in my tree for many > months now, I finally made time to measure it and was pleasantly > surprised by the size of the win! Yeah...I'm still not sure where to send it though, since it touches three different subsystems. I can set up a separate tree for it like I've done a couple of times before...though I'm not sure if anyone ever pulls it. Haavard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists