linux-kernel - Re: [PATCH][RF C/T/D] Unmapped page cache control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4B9F66AC.5080400@redhat.com>
Date:	Tue, 16 Mar 2010 13:08:28 +0200
From:	Avi Kivity <avi@...hat.com>
To:	Christoph Hellwig <hch@....de>
CC:	Chris Webb <chris@...chsys.com>, balbir@...ux.vnet.ibm.com,
	KVM development list <kvm@...r.kernel.org>,
	Rik van Riel <riel@...riel.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Kevin Wolf <kwolf@...hat.com>
Subject: Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
> On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
>    
>> Are you talking about direct volume access or qcow2?
>>      
> Doesn't matter.
>
>    
>> For direct volume access, I still don't get it.  The number of barriers
>> issues by the host must equal (or exceed, but that's pointless) the
>> number of barriers issued by the guest.  cache=writeback allows the host
>> to reorder writes, but so does cache=none.  Where does the difference
>> come from?
>>
>> Put it another way.  In an unvirtualized environment, if you implement a
>> write cache in a storage driver (not device), and sync it on a barrier
>> request, would you expect to see a performance improvement?
>>      
> cache=none only allows very limited reorderning in the host.  O_DIRECT
> is synchronous on the host, so there's just some very limited reordering
> going on in the elevator if we have other I/O going on in parallel.
>    

Presumably there is lots of I/O going on, or we wouldn't be having this 
conversation.

> In addition to that the disk writecache can perform limited reodering
> and caching, but the disk cache has a rather limited size.  The host
> pagecache gives a much wieder opportunity to reorder, especially if
> the guest workload is not cache flush heavy.  If the guest workload
> is extremly cache flush heavy the usefulness of the pagecache is rather
> limited, as we'll only use very little of it, but pay by having to do
> a data copy.  If the workload is not cache flush heavy, and we have
> multiple guests doing I/O to the same spindles it will allow the host
> do do much more efficient data writeout by beeing able to do better
> ordered (less seeky) and bigger I/O (especially if the host has real
> storage compared to ide for the guest).
>    

Let's assume the guest has virtio (I agree with IDE we need reordering 
on the host).  The guest sends batches of I/O separated by cache 
flushes.  If the batches are smaller than the virtio queue length, 
ideally things look like:

  io_submit(..., batch_size_1);
  io_getevents(..., batch_size_1);
  fdatasync();
  io_submit(..., batch_size_2);
   io_getevents(..., batch_size_2);
   fdatasync();
   io_submit(..., batch_size_3);
   io_getevents(..., batch_size_3);
   fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees 
_exactly_ the same information as it would from a bunch of threaded 
pwritev()s followed by fdatasync().

(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there are 
no flushes at all, then yes the huge write cache gives more opportunity 
for reordering.  But we're already talking hundreds of requests here.

Let's say the virtio queue size was unlimited.  What merging/reordering 
opportunity are we missing on the host?  Again we have exactly the same 
information: either the pagecache lru + radix tree that identifies all 
dirty pages in disk order, or the block queue with pending requests that 
contains exactly the same information.

Something is wrong.  Maybe it's my understanding, but on the other hand 
it may be a piece of kernel code.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/