[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.0904130747440.4583@localhost.localdomain>
Date: Mon, 13 Apr 2009 08:10:40 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Avi Kivity <avi@...hat.com>
cc: Alan Cox <alan@...rguk.ukuu.org.uk>,
Szabolcs Szakacsits <szaka@...s-3g.com>,
Grant Grundler <grundler@...gle.com>,
Linux IDE mailing list <linux-ide@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Jens Axboe <jens.axboe@...cle.com>,
Arjan van de Ven <arjan@...radead.org>
Subject: Re: Implementing NVMHCI...
On Mon, 13 Apr 2009, Avi Kivity wrote:
> >
> > - create a big file,
>
> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
> was throwing out 256KB I/Os even though I was generating 1MB writes (and
> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
Heh, ok. So the "big file" really only needed to be big enough to not be
cached, and 5GB was probably overkill. In fact, if there's some way to
blow the cache, you could have made it much smaller. But 5G certainly
works ;)
And yeah, I'm not surprised it limits the size of the IO. Linux will
generally do the same. I forget what our default maximum bio size is, but
I suspect it is in that same kind of range.
There are often problems with bigger IO's (latency being one, actual
controller bugs being another), and even if the hardware has no bugs and
its limits are higher, you usually don't want to have excessively large
DMA mapping tables _and_ the advantage of bigger IO is usually not that
big once you pass the "reasonably sized" limit (which is 64kB+). Plus they
happen seldom enough in practice anyway that it's often not worth
optimizing for.
> > then rewrite just a few bytes in it, and look at the IO pattern of the
> > result. Does it actually do the rewrite IO as one 16kB IO, or does it
> > do sub-blocking?
>
> It generates 4KB writes (I was generating aligned 512 byte overwrites).
> What's more interesting, it was also issuing 32KB reads to fill the
> cache, not 64KB. Since the number of reads and writes per second is
> almost equal, it's not splitting a 64KB read into two.
Ok, that sounds pretty much _exactly_ like the Linux IO patterns would
likely be.
The 32kB read has likely nothing to do with any filesystem layout issues
(especially as you used a 64kB cluster size), but is simply because
(a) Windows caches things with a 4kB granularity, so the 512-byte write
turned into a read-modify-write
(b) the read was really for just 4kB, but once you start reading you want
to do read-ahead anyway since it hardly gets any more expensive to
read a few pages than to read just one.
So once it had to do the read anyway, windows just read 8 pages instead of
one - very reasonable.
> > If the latter, then the 16kB thing is just a filesystem layout
> > issue, not an internal block-size issue, and WNT would likely have
> > exactly the same issues as Linux.
>
> A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a
> 16KB block. So long as the filesystem is just a layer behind the pagecache
> (which I think is the case on Windows), I don't see what issues it can have.
Right. It's all very straightforward from a filesystem layout issue. The
problem is all about managing memory.
You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
your example!). It's a total disaster. Imagine what would happen to user
application performance if kmalloc() always returned 16kB-aligned chunks
of memory, all sized as integer multiples of 16kB? It would absolutely
_suck_. Sure, it would be fine for your large allocations, but any time
you handle strings, you'd allocate 16kB of memory for any small 5-byte
string. You'd have horrible cache behavior, and you'd run out of memory
much too quickly.
The same is true in the kernel. The single biggest memory user under
almost all normal loads is the disk cache. That _is_ the normal allocator
for any OS kernel. Everything else is almost details (ok, so Linux in
particular does cache metadata very aggressively, so the dcache and inode
cache are seldom "just details", but the page cache is still generally the
most important part).
So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
system does that. It's only useful if you absolutely _only_ work with
large files - ie you're a database server. For just about any other
workload, that kind of granularity is totally unnacceptable.
So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
block size is 4kB is easy - we just have to do it anyway.
Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
also _doable_, and from the IO pattern standpoint it is no different. But
from a memory allocation pattern standpoint it's a disaster - because now
you're always working with chunks that are just 'too big' to be good
building blocks of a reasonable allocator.
If you always allocate 64kB for file caches, and you work with lots of
small files (like a source tree), you will literally waste all your
memory.
And if you have some "dynamic" scheme, you'll have tons and tons of really
nasty cases when you have to grow a 4kB allocation to a 64kB one when the
file grows. Imagine doing "realloc()", but doing it in a _threaded_
environment, where any number of threads may be using the old allocation
at the same time. And that's a kernel - it has to be _the_ most
threaded program on the whole machine, because otherwise the kernel
would be the scaling bottleneck.
And THAT is why 64kB blocks is such a disaster.
> > - can you tell how many small files it will cache in RAM without doing
> > IO? If it always uses 16kB blocks for caching, it will be able to cache a
> > _lot_ fewer files in the same amount of RAM than with a smaller block
> > size.
>
> I'll do this later, but given the 32KB reads for the test above, I'm guessing
> it will cache pages, not blocks.
Yeah, you don't need to.
I can already guarantee that Windows does caching on a page granularity.
I can also pretty much guarantee that that is also why Windows stops
compressing files once the blocksize is bigger than 4kB: because at that
point, the block compressions would need to handle _multiple_ cache
entities, and that's really painful for all the same reasons that bigger
sectors would be really painful - you'd always need to make sure that you
always have all of those cache entries in memory together, and you could
never treat your cache entries as individual entities.
> > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > to just make the virtual disk only able to do 16kB IO accesses (and with
> > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > then, then clearly WNT has no issues with bigger sectors.
>
> I don't think IDE supports this? And Windows 2008 doesn't like the LSI
> emulated device we expose.
Yeah, you'd have to have the OS use the SCSI commands for disk discovery,
so at least a SATA interface. With IDE disks, the sector size always has
to be 512 bytes, I think.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists