[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49E45E9C.1020105@redhat.com>
Date: Tue, 14 Apr 2009 12:59:56 +0300
From: Avi Kivity <avi@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
CC: Alan Cox <alan@...rguk.ukuu.org.uk>,
Szabolcs Szakacsits <szaka@...s-3g.com>,
Grant Grundler <grundler@...gle.com>,
Linux IDE mailing list <linux-ide@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Jens Axboe <jens.axboe@...cle.com>,
Arjan van de Ven <arjan@...radead.org>
Subject: Re: Implementing NVMHCI...
Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>
>>> - create a big file,
>>>
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and
>> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>
>
> Heh, ok. So the "big file" really only needed to be big enough to not be
> cached, and 5GB was probably overkill. In fact, if there's some way to
> blow the cache, you could have made it much smaller. But 5G certainly
> works ;)
>
I wanted to make sure my random writes later don't get coalesced. A 1GB
file, half of which is cached (I used a 1GB guest), offers lots of
chances for coalescing if Windows delays the writes sufficiently. At
5GB, Windows can only cache 10% of the file, so it will be continuously
flushing.
>
> (a) Windows caches things with a 4kB granularity, so the 512-byte write
> turned into a read-modify-write
>
>
[...]
> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
> your example!). It's a total disaster. Imagine what would happen to user
> application performance if kmalloc() always returned 16kB-aligned chunks
> of memory, all sized as integer multiples of 16kB? It would absolutely
> _suck_. Sure, it would be fine for your large allocations, but any time
> you handle strings, you'd allocate 16kB of memory for any small 5-byte
> string. You'd have horrible cache behavior, and you'd run out of memory
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under
> almost all normal loads is the disk cache. That _is_ the normal allocator
> for any OS kernel. Everything else is almost details (ok, so Linux in
> particular does cache metadata very aggressively, so the dcache and inode
> cache are seldom "just details", but the page cache is still generally the
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
> system does that. It's only useful if you absolutely _only_ work with
> large files - ie you're a database server. For just about any other
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
> block size is 4kB is easy - we just have to do it anyway.
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
> also _doable_, and from the IO pattern standpoint it is no different. But
> from a memory allocation pattern standpoint it's a disaster - because now
> you're always working with chunks that are just 'too big' to be good
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of
> small files (like a source tree), you will literally waste all your
> memory.
>
>
Well, no one is talking about 64KB granularity for in-core files. Like
you noticed, Windows uses the mmu page size. We could keep doing that,
and still have 16KB+ sector sizes. It just means a RMW if you don't
happen to have the adjoining clean pages in cache.
Sure, on a rotating disk that's a disaster, but we're talking SSD here,
so while you're doubling your access time, you're doubling a fairly
small quantity. The controller would do the same if it exposed smaller
sectors, so there's no huge loss.
We still lose on disk storage efficiency, but I'm guessing that a modern
tree with some object files with debug information and a .git directory
it won't be such a great hit. For more mainstream uses, it would be
negligible.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists