linux-kernel - Re: Implementing NVMHCI...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <49E45E9C.1020105@redhat.com>
Date:	Tue, 14 Apr 2009 12:59:56 +0300
From:	Avi Kivity <avi@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
CC:	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Szabolcs Szakacsits <szaka@...s-3g.com>,
	Grant Grundler <grundler@...gle.com>,
	Linux IDE mailing list <linux-ide@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <jens.axboe@...cle.com>,
	Arjan van de Ven <arjan@...radead.org>
Subject: Re: Implementing NVMHCI...

Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>   
>>>  - create a big file,
>>>       
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and 
>> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>     
>
> Heh, ok. So the "big file" really only needed to be big enough to not be 
> cached, and 5GB was probably overkill. In fact, if there's some way to 
> blow the cache, you could have made it much smaller. But 5G certainly 
> works ;)
>   

I wanted to make sure my random writes later don't get coalesced.  A 1GB 
file, half of which is cached (I used a 1GB guest), offers lots of 
chances for coalescing if Windows delays the writes sufficiently.  At 
5GB, Windows can only cache 10% of the file, so it will be continuously 
flushing.


>
>  (a) Windows caches things with a 4kB granularity, so the 512-byte write 
>      turned into a read-modify-write
>   
>   
[...]

> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
> your example!). It's a total disaster. Imagine what would happen to user 
> application performance if kmalloc() always returned 16kB-aligned chunks 
> of memory, all sized as integer multiples of 16kB? It would absolutely 
> _suck_. Sure, it would be fine for your large allocations, but any time 
> you handle strings, you'd allocate 16kB of memory for any small 5-byte 
> string. You'd have horrible cache behavior, and you'd run out of memory 
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under 
> almost all normal loads is the disk cache. That _is_ the normal allocator 
> for any OS kernel. Everything else is almost details (ok, so Linux in 
> particular does cache metadata very aggressively, so the dcache and inode 
> cache are seldom "just details", but the page cache is still generally the 
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane 
> system does that. It's only useful if you absolutely _only_ work with 
> large files - ie you're a database server. For just about any other 
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the 
> block size is 4kB is easy - we just have to do it anyway. 
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is 
> also _doable_, and from the IO pattern standpoint it is no different. But 
> from a memory allocation pattern standpoint it's a disaster - because now 
> you're always working with chunks that are just 'too big' to be good 
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of 
> small files (like a source tree), you will literally waste all your 
> memory.
>
>   

Well, no one is talking about 64KB granularity for in-core files.  Like 
you noticed, Windows uses the mmu page size.  We could keep doing that, 
and still have 16KB+ sector sizes.  It just means a RMW if you don't 
happen to have the adjoining clean pages in cache.

Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
so while you're doubling your access time, you're doubling a fairly 
small quantity.  The controller would do the same if it exposed smaller 
sectors, so there's no huge loss.

We still lose on disk storage efficiency, but I'm guessing that a modern 
tree with some object files with debug information and a .git directory 
it won't be such a great hit.  For more mainstream uses, it would be 
negligible.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/