linux-kernel - Re: virtio-blk: should num_vqs be limited by num_possible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3f24cfa-5160-0ccc-decc-a48362c8de19@oracle.com>
Date:   Thu, 14 Mar 2019 14:12:32 +0800
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     Cornelia Huck <cohuck@...hat.com>, mst@...hat.com,
        jasowang@...hat.com
Cc:     virtualization@...ts.linux-foundation.org,
        linux-block@...r.kernel.org, axboe@...nel.dk,
        linux-kernel@...r.kernel.org
Subject: Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?



On 3/13/19 5:39 PM, Cornelia Huck wrote:
> On Wed, 13 Mar 2019 11:26:04 +0800
> Dongli Zhang <dongli.zhang@...cle.com> wrote:
> 
>> On 3/13/19 1:33 AM, Cornelia Huck wrote:
>>> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
>>> Dongli Zhang <dongli.zhang@...cle.com> wrote:
>>>   
>>>> I observed that there is one msix vector for config and one shared vector
>>>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>>>> is more than the number of possible cpus:
>>>>
>>>> qemu: "-smp 4" while "-device virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
>>>>
>>>> # cat /proc/interrupts 
>>>>            CPU0       CPU1       CPU2       CPU3
>>>> ... ...
>>>>  24:          0          0          0          0   PCI-MSI 65536-edge      virtio0-config
>>>>  25:          0          0          0         59   PCI-MSI 65537-edge      virtio0-virtqueues
>>>> ... ...
>>>>
>>>>
>>>> However, when num-queues is the same as number of possible cpus:
>>>>
>>>> qemu: "-smp 4" while "-device virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>>>
>>>> # cat /proc/interrupts 
>>>>            CPU0       CPU1       CPU2       CPU3
>>>> ... ... 
>>>>  24:          0          0          0          0   PCI-MSI 65536-edge      virtio0-config
>>>>  25:          2          0          0          0   PCI-MSI 65537-edge      virtio0-req.0
>>>>  26:          0         35          0          0   PCI-MSI 65538-edge      virtio0-req.1
>>>>  27:          0          0         32          0   PCI-MSI 65539-edge      virtio0-req.2
>>>>  28:          0          0          0          0   PCI-MSI 65540-edge      virtio0-req.3
>>>> ... ...
>>>>
>>>> In above case, there is one msix vector per queue.  
>>>
>>> Please note that this is pci-specific...
>>>   
>>>>
>>>>
>>>> This is because the max number of queues is not limited by the number of
>>>> possible cpus.
>>>>
>>>> By default, nvme (regardless about write_queues and poll_queues) and
>>>> xen-blkfront limit the number of queues with num_possible_cpus().  
>>>
>>> ...and these are probably pci-specific as well.  
>>
>> Not pci-specific, but per-cpu as well.
> 
> Ah, I meant that those are pci devices.
> 
>>
>>>   
>>>>
>>>>
>>>> Is this by design on purpose, or can we fix with below?
>>>>
>>>>
>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>>>> index 4bc083b..df95ce3 100644
>>>> --- a/drivers/block/virtio_blk.c
>>>> +++ b/drivers/block/virtio_blk.c
>>>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>>>  	if (err)
>>>>  		num_vqs = 1;
>>>>  
>>>> +	num_vqs = min(num_possible_cpus(), num_vqs);
>>>> +
>>>>  	vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>>>  	if (!vblk->vqs)
>>>>  		return -ENOMEM;  
>>>
>>> virtio-blk, however, is not pci-specific.
>>>
>>> If we are using the ccw transport on s390, a completely different
>>> interrupt mechanism is in use ('floating' interrupts, which are not
>>> per-cpu). A check like that should therefore not go into the generic
>>> driver.
>>>   
>>
>> So far there seems two options.
>>
>> The 1st option is to ask the qemu user to always specify "-num-queues" with the
>> same number of vcpus when running x86 guest with pci for virtio-blk or
>> virtio-scsi, in order to assign a vector for each queue.
> 
> That does seem like an extra burden for the user: IIUC, things work
> even if you have too many queues, it's just not optimal. It sounds like
> something that can be done by a management layer (e.g. libvirt), though.
> 
>> Or, is it fine for virtio folks to add a new hook to 'struct virtio_config_ops'
>> so that different platforms (e.g., pci or ccw) would use different ways to limit
>> the max number of queues in guest, with something like below?
> 
> That sounds better, as both transports and drivers can opt-in here.
> 
> However, maybe it would be even better to try to come up with a better
> strategy of allocating msix vectors in virtio-pci. More vectors in the
> num_queues > num_cpus case, even if they still need to be shared?
> Individual vectors for n-1 cpus and then a shared one for the remaining
> queues?
> 
> It might even be device-specific: Have some low-traffic status queues
> share a vector, and provide an individual vector for high-traffic
> queues. Would need some device<->transport interface, obviously.
> 

This sounds a little bit similar to multiple hctx maps?

So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
nr_cpu_ids hw queues.

2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
... ...
3021         /*
3022          * There is no use for more h/w queues than cpus if we just have
3023          * a single map
3024          */
3025         if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
3026                 set->nr_hw_queues = nr_cpu_ids;

Even the block layer would limit the number of hw queues by nr_cpu_ids when
(set->nr_maps == 1).

That's why I think virtio-blk should use the similar solution as nvme
(regardless about write_queues and poll_queues) and xen-blkfront.

Added Jason again. I do not know why the mailing list of
virtualization@...ts.linux-foundation.org always filters out Jason's email...


Dongli Zhang