lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 21 Sep 2009 16:37:22 +0930
From:	Rusty Russell <rusty@...tcorp.com.au>
To:	virtualization@...ts.linux-foundation.org
Cc:	Stephen Hemminger <shemminger@...tta.com>,
	"Xin, Xiaohui" <xiaohui.xin@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"mst@...hat.com" <mst@...hat.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"hpa@...or.com" <hpa@...or.com>, "mingo@...e.hu" <mingo@...e.hu>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote:
> On Tue, 1 Sep 2009 14:58:19 +0800
> "Xin, Xiaohui" <xiaohui.xin@...el.com> wrote:
> 
> >               [RFC] Virtual Machine Device Queues (VMDq) support on KVM
> > 
> > Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
> > and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
> > for each tx/rx queue pair. Here we present a generic framework, in which network
> > traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
> > any software copy.
> > 
> > Actually this framework can apply to traditional network adapters which have
> > just one tx/rx queue pair. And applications using the same user/kernel interface
> > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > queue pair in a network adapter.
> > 
> > We use virtio-net architecture to illustrate the framework.
> > 
> > 
> > |--------------------|     pop               add_buf    |----------------|
> > |    Qemu process    |  <---------    TX   <----------  | Guest Kernel   |
> > |                    |  --------->         ---------->  |                |
> > |    Virtio-net      |     push              get_buf    |                |
> > |  (Backend service) |  --------->    RX   ---------->  |  Virtio-net    |
> > |                    |  <---------         <----------  |    driver      |
> > |                    |     push              get_buf    |                |
> > |--------------------|                                  |----------------|
> >                    |
> >                    |
> >                    | AIO (read & write) combined with Direct I/O
> >                    |   (which substitute synced file operations)
> > |-----------------------------------------------------------------------|
> > |     Host kernel  | read: copy-less with directly mapped user          |
> > |                  |       space to kernel, payload directly DMAed      |
> > |                  |       into user space                              |
> > |                  | write: copy-less with directly mapped user         |
> > |                  |       space to kernel, payload directly hooked     |
> > |                  |       to a skb                                     |
> > |                  |                                                    |
> > |  (a likely       |                                                    |
> > |   queue pair     |                                                    |
> > |   instance)      |                                                    |
> > |      |           |                                                    |
> > | NIC driver <-->  TUN/TAP driver                                       |
> > |-----------------------------------------------------------------------|
> >        |
> >        |
> >    traditional adapter or a tx/rx queue pair
> > 
> > The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
> > I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
> > kernel, we still can see it in SCSI tape driver.
> > 
> > With traditional file operations, a copying of payload contents from/to the
> > kernel DMA address to/from a user buffer is needed. That's what the copying we
> > want to save.
> > 
> > The proposed framework is like this:
> > A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
> > host side. KVM virto-net Backend service, the user space program submits
> > asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
> > The requests are corresponding to the vqueue elements include both transmission
> > & receive. They can be queued in one AIO request and later, the completion will
> > be notified through the underlying packets tx/rx processing of the rx/tx queue
> > pair.
> > 
> > Detailed path:
> > 
> > To guest Virtio-net driver, packets receive corresponding to asynchronous read
> > I/O requests of Backend service.
> > 
> > 1) Guest Virtio-net driver provides header and payload address through the
> > receive vqueue to Virtio-net backend service.
> > 
> > 2) Virtio-net backend service encapsulates multiple vqueue elements into
> > multiple AIO control blocks and composes them into one AIO read request.
> > 
> > 3) Virtio-net backend service uses io_submit() syscall to pass the request to
> > the TUN/TAP device.
> > 
> > 4) Virtio-net backend service uses io_getevents() syscall to check the
> > completion of the request.
> > 
> > 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
> > for Direct I/O.
> >    A modified NIC driver may render a skb which header is allocated in host
> > kernel, but the payload buffer is directly mapped from user space buffer which
> > are rendered through the AIO request by the Backend service. get_user_pages()
> > may do this. For one AIO read request, the TUN/TAP driver maintains a list for
> > the directly mapped buffers, and a NIC driver tries to get the buffers as
> > payload buffer to compose the new skbs. Of course, if getting the buffers
> > fails, then kernel allocated buffers are used.
> > 
> > 6) Modern NIC cards now mostly have the header split feature. The NIC queue
> > pair then may directly DMA the payload into the user spaces mapped payload
> > buffers.
> > Thus a zero-copy for payload is implemented in packet receiving.
> > 
> > 7) The TUN/TAP driver manually copy the host header to space user mapped.
> > 
> > 8) aio_complete() to notify the Virtio-net backend service for io_getevents().
> > 
> > 
> > To guest Virtio-net driver, packets send corresponding to asynchronous write
> > I/O requests of backend. The path is similar to packet receive.
> > 
> > 1) Guest Virtio-net driver provides header and payload address filled with
> > contents through the transmit vqueue to Virtio-net backed service.
> > 
> > 2) Virtio-net backend service encapsulates the vqueue elements into multiple
> > AIO control blocks and composes them into one AIO write request.
> > 
> > 3) Virtio-net backend service uses the io_submit() syscall to pass the
> > requests to the TUN/TAP device.
> > 
> > 4) Virtio-net backend service uses io_getevents() syscall to check the request
> > completion.
> > 
> > 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
> > header contents are copied into the skb header. The directly mapped user space
> > buffer is easily hooked into skb. Thus a zero copy for payload is implemented
> > in packet sending.
> > 
> > 6) aio_complete() to notify the Virtio-net backend service for io_getevents().
> > 
> > The proposed framework is described as above.
> > 
> > Consider the modifications to the kernel and qemu:
> > 
> > To kernel:
> > 1) The TUN/TAP driver may be modified a lot to implement AIO device operations
> > and to implement directly user space mapping into kernel. Code to maintain the
> > directly mapped user buffers should be in. It's just a modification for driver.
> > 
> > 2) The NIC driver may be modified to compose skb differently and slightly data
> > structure change to add user directly mapped buffer pointer.
> > Here, maybe it's better for a NIC driver to present an interface for an rx/tx
> > queue pair instance which will also apply to traditional hardware, the kernel
> > interface should not be changed to make the other components happy.
> > The abstraction is useful, though it is not needed immediately here.
> > 
> > 3) The skb shared info structure may be modified a little to contain the user
> > directly mapped info.
> > 
> > To Qemu:
> > 1) The Virtio-net backend service may be modified to handle AIO read/write
> > requests from the vqueues.
> > 2) Maybe a separate pthread to handle the AIO request triggering is needed.
> > 
> > Any comments are appreciated here.
> 
> * Code is easier to review than bullet points.
> 
> * Direct I/O has to be safe when page is shared by multiple threads,
>   and has to be non-blocking since network I/O can take indeterminately
>   long (think big queue's, tunneling, ...)
> 
> * In the past attempts at Direct I/O on network have always had SMP
>   TLB issues. The page has to be flipped or marked as COW on all CPU's
>   and the cost of the Inter Processor Interrupt to steal the page has
>   been slower than copying

The Guest shouldn't touch the packet, until the virtio net protocol says
it's finished with (just like a real NIC).  Even if it's being nasty, if we
have hw csum it'll get the right csum on the wire, and if we don't we copy
internally anyway.

So I think this isn't a problem...
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists