[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <461A41CA.9080201@qumranet.com>
Date: Mon, 09 Apr 2007 16:38:18 +0300
From: Avi Kivity <avi@...ranet.com>
To: Rusty Russell <rusty@...tcorp.com.au>
Cc: Ingo Molnar <mingo@...e.hu>, kvm-devel@...ts.sourceforge.net,
netdev <netdev@...r.kernel.org>
Subject: Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote:
> On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
>
>> Rusty Russell wrote:
>>
>>> I'm a little puzzled by your response. Hmm...
>>>
>>> lguest's userspace network frontend does exactly as many copies as
>>> Ingo's in-host-kernel code. One from the Guest, one to the Guest.
>>>
>> kvm pvnet is suboptimal now. The number of copies could be reduced by
>> two (to zero), by constructing an skb that points to guest memory.
>> Right now, this can only be done in-kernel.
>>
>
> Sorry, you lost me here. You mean both input and output copies can be
> eliminated? Or are you talking about another two copies somewhere?
>
On the transmit path, current kvm pvnet has two copies:
1. on the guest side, the driver copies the skb data into the shared ring
2. on the host side, the device copies the data from the ring into a
newly allocated skb
Both of these copies can be eliminated with a host-side kernel. With
current userspace interfaces, only one copy can be eliminated.
Similar logic applies to receive, except that one copy must remain.
> But I don't get this "we can enhance the kernel but not userspace" vibe
> 8(
>
I've been waiting for network aio since ~2003. If it arrives in the
next few days, I'm all for it; much more than kvm can use it
profitably. But I'm not going to write that interface myself.
Moreover, some things just don't lend themselves to a userspace
abstraction. If we want to expose tso (tcp segmentation offload), we
can easily do so with a kernel driver since the kernel interfaces are
all tso aware. Tacking on tso awareness to tun/tap is doable, but at
the very least wierd.
>
>> With current userspace networking interfaces, one cannot build a network
>> device that has less than one copy on transmit, because sendmsg() *must*
>> copy the data (as there is no completion notification).
>>
>
> Why are you talking about sendmsg()? Perhaps this is where we're
> getting tangled up.
>
> We're dealing with the tun/tap device here, not a socket.
>
>
Hmm. tun actually has aio_write implemented, but it seems synchronous.
So does the read path.
If these are made truly asynchronous, and the write path is made in
addition copyless, then we might have something workable. I still
cringe at having a pagetable walk in order to deliver a 1500-byte packet.
>> sendfilev(),
>> even if it existed, cannot be used: it is copyless, but lacks completion
>> notification. It is useful only on unchanging data like read-only files.
>>
>
> Again, sendfile is a *much* harder problem than sending a single packet
> once, which is the question here.
>
sendfile() is a *different* problem. It doesn't need completion because
the data is assumed not to change under it.
Consider that the guest may be issuing a megabyte-sized sendfile() which
is broken into 17 tso frames. We need to preserve the large structures
as much as possible or we end up repeating the simple "single packet
once" path 700 times.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists