[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200909211637.23299.rusty@rustcorp.com.au>
Date: Mon, 21 Sep 2009 16:37:22 +0930
From: Rusty Russell <rusty@...tcorp.com.au>
To: virtualization@...ts.linux-foundation.org
Cc: Stephen Hemminger <shemminger@...tta.com>,
"Xin, Xiaohui" <xiaohui.xin@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"mst@...hat.com" <mst@...hat.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"hpa@...or.com" <hpa@...or.com>, "mingo@...e.hu" <mingo@...e.hu>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote:
> On Tue, 1 Sep 2009 14:58:19 +0800
> "Xin, Xiaohui" <xiaohui.xin@...el.com> wrote:
>
> > [RFC] Virtual Machine Device Queues (VMDq) support on KVM
> >
> > Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
> > and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
> > for each tx/rx queue pair. Here we present a generic framework, in which network
> > traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
> > any software copy.
> >
> > Actually this framework can apply to traditional network adapters which have
> > just one tx/rx queue pair. And applications using the same user/kernel interface
> > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > queue pair in a network adapter.
> >
> > We use virtio-net architecture to illustrate the framework.
> >
> >
> > |--------------------| pop add_buf |----------------|
> > | Qemu process | <--------- TX <---------- | Guest Kernel |
> > | | ---------> ----------> | |
> > | Virtio-net | push get_buf | |
> > | (Backend service) | ---------> RX ----------> | Virtio-net |
> > | | <--------- <---------- | driver |
> > | | push get_buf | |
> > |--------------------| |----------------|
> > |
> > |
> > | AIO (read & write) combined with Direct I/O
> > | (which substitute synced file operations)
> > |-----------------------------------------------------------------------|
> > | Host kernel | read: copy-less with directly mapped user |
> > | | space to kernel, payload directly DMAed |
> > | | into user space |
> > | | write: copy-less with directly mapped user |
> > | | space to kernel, payload directly hooked |
> > | | to a skb |
> > | | |
> > | (a likely | |
> > | queue pair | |
> > | instance) | |
> > | | | |
> > | NIC driver <--> TUN/TAP driver |
> > |-----------------------------------------------------------------------|
> > |
> > |
> > traditional adapter or a tx/rx queue pair
> >
> > The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
> > I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
> > kernel, we still can see it in SCSI tape driver.
> >
> > With traditional file operations, a copying of payload contents from/to the
> > kernel DMA address to/from a user buffer is needed. That's what the copying we
> > want to save.
> >
> > The proposed framework is like this:
> > A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
> > host side. KVM virto-net Backend service, the user space program submits
> > asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
> > The requests are corresponding to the vqueue elements include both transmission
> > & receive. They can be queued in one AIO request and later, the completion will
> > be notified through the underlying packets tx/rx processing of the rx/tx queue
> > pair.
> >
> > Detailed path:
> >
> > To guest Virtio-net driver, packets receive corresponding to asynchronous read
> > I/O requests of Backend service.
> >
> > 1) Guest Virtio-net driver provides header and payload address through the
> > receive vqueue to Virtio-net backend service.
> >
> > 2) Virtio-net backend service encapsulates multiple vqueue elements into
> > multiple AIO control blocks and composes them into one AIO read request.
> >
> > 3) Virtio-net backend service uses io_submit() syscall to pass the request to
> > the TUN/TAP device.
> >
> > 4) Virtio-net backend service uses io_getevents() syscall to check the
> > completion of the request.
> >
> > 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
> > for Direct I/O.
> > A modified NIC driver may render a skb which header is allocated in host
> > kernel, but the payload buffer is directly mapped from user space buffer which
> > are rendered through the AIO request by the Backend service. get_user_pages()
> > may do this. For one AIO read request, the TUN/TAP driver maintains a list for
> > the directly mapped buffers, and a NIC driver tries to get the buffers as
> > payload buffer to compose the new skbs. Of course, if getting the buffers
> > fails, then kernel allocated buffers are used.
> >
> > 6) Modern NIC cards now mostly have the header split feature. The NIC queue
> > pair then may directly DMA the payload into the user spaces mapped payload
> > buffers.
> > Thus a zero-copy for payload is implemented in packet receiving.
> >
> > 7) The TUN/TAP driver manually copy the host header to space user mapped.
> >
> > 8) aio_complete() to notify the Virtio-net backend service for io_getevents().
> >
> >
> > To guest Virtio-net driver, packets send corresponding to asynchronous write
> > I/O requests of backend. The path is similar to packet receive.
> >
> > 1) Guest Virtio-net driver provides header and payload address filled with
> > contents through the transmit vqueue to Virtio-net backed service.
> >
> > 2) Virtio-net backend service encapsulates the vqueue elements into multiple
> > AIO control blocks and composes them into one AIO write request.
> >
> > 3) Virtio-net backend service uses the io_submit() syscall to pass the
> > requests to the TUN/TAP device.
> >
> > 4) Virtio-net backend service uses io_getevents() syscall to check the request
> > completion.
> >
> > 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
> > header contents are copied into the skb header. The directly mapped user space
> > buffer is easily hooked into skb. Thus a zero copy for payload is implemented
> > in packet sending.
> >
> > 6) aio_complete() to notify the Virtio-net backend service for io_getevents().
> >
> > The proposed framework is described as above.
> >
> > Consider the modifications to the kernel and qemu:
> >
> > To kernel:
> > 1) The TUN/TAP driver may be modified a lot to implement AIO device operations
> > and to implement directly user space mapping into kernel. Code to maintain the
> > directly mapped user buffers should be in. It's just a modification for driver.
> >
> > 2) The NIC driver may be modified to compose skb differently and slightly data
> > structure change to add user directly mapped buffer pointer.
> > Here, maybe it's better for a NIC driver to present an interface for an rx/tx
> > queue pair instance which will also apply to traditional hardware, the kernel
> > interface should not be changed to make the other components happy.
> > The abstraction is useful, though it is not needed immediately here.
> >
> > 3) The skb shared info structure may be modified a little to contain the user
> > directly mapped info.
> >
> > To Qemu:
> > 1) The Virtio-net backend service may be modified to handle AIO read/write
> > requests from the vqueues.
> > 2) Maybe a separate pthread to handle the AIO request triggering is needed.
> >
> > Any comments are appreciated here.
>
> * Code is easier to review than bullet points.
>
> * Direct I/O has to be safe when page is shared by multiple threads,
> and has to be non-blocking since network I/O can take indeterminately
> long (think big queue's, tunneling, ...)
>
> * In the past attempts at Direct I/O on network have always had SMP
> TLB issues. The page has to be flipped or marked as COW on all CPU's
> and the cost of the Inter Processor Interrupt to steal the page has
> been slower than copying
The Guest shouldn't touch the packet, until the virtio net protocol says
it's finished with (just like a real NIC). Even if it's being nasty, if we
have hw csum it'll get the right csum on the wire, and if we don't we copy
internally anyway.
So I think this isn't a problem...
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists