netdev - Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <461A41CA.9080201@qumranet.com>
Date:	Mon, 09 Apr 2007 16:38:18 +0300
From:	Avi Kivity <avi@...ranet.com>
To:	Rusty Russell <rusty@...tcorp.com.au>
Cc:	Ingo Molnar <mingo@...e.hu>, kvm-devel@...ts.sourceforge.net,
	netdev <netdev@...r.kernel.org>
Subject: Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

Rusty Russell wrote:
> On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> 	I'm a little puzzled by your response.  Hmm...
>>>
>>> 	lguest's userspace network frontend does exactly as many copies as
>>> Ingo's in-host-kernel code.  One from the Guest, one to the Guest.
>>>       
>> kvm pvnet is suboptimal now.  The number of copies could be reduced by 
>> two (to zero), by constructing an skb that points to guest memory.  
>> Right now, this can only be done in-kernel.
>>     
>
> Sorry, you lost me here.  You mean both input and output copies can be
> eliminated?  Or are you talking about another two copies somewhere?
>   

On the transmit path, current kvm pvnet has two copies:

1.  on the guest side, the driver copies the skb data into the shared ring
2. on the host side, the device copies the data from the ring into a 
newly allocated skb

Both of these copies can be eliminated with a host-side kernel.  With 
current userspace interfaces, only one copy can be eliminated.

Similar logic applies to receive, except that one copy must remain.

> But I don't get this "we can enhance the kernel but not userspace" vibe
> 8(
>   

I've been waiting for network aio since ~2003.  If it arrives in the 
next few days, I'm all for it; much more than kvm can use it 
profitably.  But I'm not going to write that interface myself.

Moreover, some things just don't lend themselves to a userspace 
abstraction.  If we want to expose tso (tcp segmentation offload), we 
can easily do so with a kernel driver since the kernel interfaces are 
all tso aware.  Tacking on tso awareness to tun/tap is doable, but at 
the very least wierd.

>   
>> With current userspace networking interfaces, one cannot build a network 
>> device that has less than one copy on transmit, because sendmsg() *must* 
>> copy the data (as there is no completion notification).
>>     
>
> Why are you talking about sendmsg()?  Perhaps this is where we're
> getting tangled up.
>
> We're dealing with the tun/tap device here, not a socket.
>
>   

Hmm.  tun actually has aio_write implemented, but it seems synchronous.  
So does the read path.

If these are made truly asynchronous, and the write path is made in 
addition copyless, then we might have something workable.  I still 
cringe at having a pagetable walk in order to deliver a 1500-byte packet.

>>  sendfilev(), 
>> even if it existed, cannot be used: it is copyless, but lacks completion 
>> notification.  It is useful only on unchanging data like read-only files.
>>     
>
> Again, sendfile is a *much* harder problem than sending a single packet
> once, which is the question here.
>   

sendfile() is a *different* problem.  It doesn't need completion because 
the data is assumed not to change under it.

Consider that the guest may be issuing a megabyte-sized sendfile() which 
is broken into 17 tso frames.  We need to preserve the large structures 
as much as possible or we end up repeating the simple "single packet 
once" path 700 times.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html