linux-kernel - Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Thu, 28 Jun 2012 13:31:24 +0800
From:	Jason Wang <jasowang@...hat.com>
To:	Sridhar Samudrala <sri@...ibm.com>
CC:	"Michael S. Tsirkin" <mst@...hat.com>, habanero@...ux.vnet.ibm.com,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	krkumar2@...ibm.com, tahm@...ux.vnet.ibm.com, akong@...hat.com,
	davem@...emloft.net, shemminger@...tta.com, mashirle@...ibm.com
Subject: Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support

On 06/28/2012 12:52 PM, Sridhar Samudrala wrote:
> On 6/27/2012 8:02 PM, Jason Wang wrote:
>> On 06/27/2012 04:44 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 27, 2012 at 01:16:30PM +0800, Jason Wang wrote:
>>>> On 06/26/2012 06:42 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Jun 26, 2012 at 11:42:17AM +0800, Jason Wang wrote:
>>>>>> On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>>>>>>>> This patch adds multiqueue support for tap device. This is done 
>>>>>>>> by abstracting
>>>>>>>> each queue as a file/socket and allowing multiple sockets to be 
>>>>>>>> attached to the
>>>>>>>> tuntap device (an array of tun_file were stored in the 
>>>>>>>> tun_struct). Userspace
>>>>>>>> could write and read from those files to do the parallel packet
>>>>>>>> sending/receiving.
>>>>>>>>
>>>>>>>> Unlike the previous single queue implementation, the socket and 
>>>>>>>> device were
>>>>>>>> loosely coupled, each of them were allowed to go away first. In 
>>>>>>>> order to let the
>>>>>>>> tx path lockless, netif_tx_loch_bh() is replaced by 
>>>>>>>> RCU/NETIF_F_LLTX to
>>>>>>>> synchronize between data path and system call.
>>>>>>> Don't use LLTX/RCU. It's not worth it.
>>>>>>> Use something like netif_set_real_num_tx_queues.
>>>>>>>
>>>>>>>> The tx queue selecting is first based on the recorded rxq index 
>>>>>>>> of an skb, it
>>>>>>>> there's no such one, then choosing based on rx hashing 
>>>>>>>> (skb_get_rxhash()).
>>>>>>>>
>>>>>>>> Signed-off-by: Jason Wang<jasowang@...hat.com>
>>>>>>> Interestingly macvtap switched to hashing first:
>>>>>>> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
>>>>>>> (the commit log is corrupted but see what it
>>>>>>> does in the patch).
>>>>>>> Any idea why?
>>>>>> Yes, so tap should be changed to behave same as macvtap. I remember
>>>>>> the reason we do that is to make sure the packet of a single flow to
>>>>>> be queued to a fixed socket/virtqueues. As 10g cards like ixgbe
>>>>>> choose the rx queue for a flow based on the last tx queue where the
>>>>>> packets of that flow comes. So if we are using recored rx queue in
>>>>>> macvtap, the queue index of a flow would change as vhost thread
>>>>>> moves amongs processors.
>>>>> Hmm. OTOH if you override this, if TX is sent from VCPU0, RX might 
>>>>> land
>>>>> on VCPU1 in the guest, which is not good, right?
>>>> Yes, but better than making the rx moves between vcpus when we use
>>>> recorded rx queue.
>>> Why isn't this a problem with native TCP?
>>> I think what happens is one of the following:
>>> - moving between CPUs is more expensive with tun
>>>    because it can queue so much data on xmit
>>> - scheduler makes very bad decisions about VCPUs
>>>    bouncing them around all the time
>>
>> For usual native TCP/host process, as it reads and writes tcp 
>> sockets, so it make make sense to move rx to the porcessor where the 
>> process moves. But vhost does not do tcp stuffs and ixgbe would still 
>> move rx when vhost process moves, and we can't even make sure the 
>> vhost process that handling rx is running on processor that handle rx 
>> interrupt.
>
> We also saw this behavior with the default ixgbe configuration. If 
> vhost is pinned to a CPU all
> packets for that VM are received on a single RX queue.
> So even if the VM is doing multiple TCP_RR sessions, packets for all 
> the flows are received
> on a single RX queue. Without pinning, vhost moves around and so does 
> the packets across
> the RX queues.
>
> I think
>         ethtool -K ethX ntuple on
> will disable this behavior and it should be possible to program the 
> flow director using ethtool -U.
> This way we can split the packets across the host NIC RX queues based 
> on the flows, but it is not
> clear if this would help with the current model of single vhost per 
> device.
> With per-cpu vhost,  each RX queue can be handled by the matching 
> vhost, but if we have only
> 1 queue in the VMs virtio-net device, that could become the bottleneck.

Yes, I've been thinking about this. And instead of using ethtool -U 
(maybe possible for macvtap but hard for tuntap), we can 'teach' the 
ixgbe of the rxq it would used for a flow because ixgbe_select_queue() 
would first select the txq based on the recorded rxq. So if we want the 
flow using a dedicated rxq say N, we can record N to the rxq in tuntap 
before we passing the skb to bridge.

> Multi-queue virtio-net should help here, but we need the same number 
> of queues in VM's virtio-net
> device as the host's NIC so that each vhost can handle the 
> corresponding virtio queue.
> But if the VM has only 2 vcpus, i think it is not efficient to have 8 
> virtio-net queues.(to match a host
> with 8 physical cpus and 8 RX queues in the NIC).

Ideally, if we can 2 queues in guest, it's better to only use 2 queues 
in host to avoid extra contention.
>
> Thanks
> Sridhar
>
>>
>>> Could we isolate which it is? Does the problem
>>> still happen if you pin VCPUs to host cpus?
>>> If not it's the queue depth.
>>
>> It may not help as tun does not record the vcpu/queue that send the 
>> stream, so it can't transmit the packets back the same vcpu/queue.
>>>> Flow steering is needed to make sure the tx and
>>>> rx on the same vcpu.
>>> That involves IPI between processes, so it might be
>>> very expensive for kvm.
>>>
>>>>>> But during test tun/tap, one interesting thing I find is that even
>>>>>> ixgbe has recorded the queue index during rx, it seems be lost when
>>>>>> tap tries to transmit skbs to userspace.
>>>>> dev_pick_tx does this I think but ndo_select_queue
>>>>> should be able to get it without trouble.
>>>>>
>>>>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/