[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <03837ec6-e0e1-4ba8-4b1d-0a125c059f23@oracle.com>
Date: Sun, 13 Aug 2023 22:13:44 -0500
From: michael.christie@...cle.com
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: hch@...radead.org, stefanha@...hat.com, jasowang@...hat.com,
sgarzare@...hat.com, virtualization@...ts.linux-foundation.org,
brauner@...nel.org, ebiederm@...ssion.com,
torvalds@...ux-foundation.org, konrad.wilk@...cle.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 8/13/23 2:01 PM, Michael S. Tsirkin wrote:
> On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
>> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
>>> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@...cle.com wrote:
>>>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>>>> For vhost workers we use the kthread API which inherit's its values from
>>>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>>>> being checked, so while tools like libvirt try to control the number of
>>>>>> threads based on the nproc rlimit setting we can end up creating more
>>>>>> threads than the user wanted.
>>>>>>
>>>>>> This patch has us use the vhost_task helpers which will inherit its
>>>>>> values/checks from the thread that owns the device similar to if we did
>>>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>>>> so we can remove those calls.
>>>>>>
>>>>>> Signed-off-by: Mike Christie <michael.christie@...cle.com>
>>>>>> Acked-by: Michael S. Tsirkin <mst@...hat.com>
>>>>>
>>>>>
>>>>> Hi Mike,
>>>>> So this seems to have caused a measureable regression in networking
>>>>> performance (about 30%). Take a look here, and there's a zip file
>>>>> with detailed measuraments attached:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>>>>>
>>>>>
>>>>> Could you take a look please?
>>>>> You can also ask reporter questions there assuming you
>>>>> have or can create a (free) account.
>>>>>
>>>>
>>>> Sorry for the late reply. I just got home from vacation.
>>>>
>>>> The account creation link seems to be down. I keep getting a
>>>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>>>
>>>> Can you give me Quan's email?
>>>>
>>>> I think I can replicate the problem. I just need some extra info from Quan:
>>>>
>>>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>>>> 2. The kernel config
>>>> 3. Any tuning that was done. Is tuned running in guest and/or host running the
>>>> VMs and what profile is being used in each.
>>>> 4. Number of vCPUs and virtqueues being used.
>>>> 5. Can they dump the contents of:
>>>>
>>>> /sys/kernel/debug/sched
>>>>
>>>> and
>>>>
>>>> sysctl -a
>>>>
>>>> on the host running the VMs.
>>>>
>>>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
>>>> the scheduler to batch:
>>>>
>>>> ps -T -o comm,pid,tid $QEMU_THREAD
>>>>
>>>> then for each vhost thread do:
>>>>
>>>> chrt -b -p 0 $VHOST_THREAD
>>>>
>>>> Does that end up increasing perf? When I do this I see throughput go up by
>>>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>>>> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
>>>> It's just a difference I noticed when running some other tests.
>>>
>>>
>>> Mike I'm unsure what to do at this point. Regressions are not nice
>>> but if the kernel is released with the new userspace api we won't
>>> be able to revert. So what's the plan?
>>>
>>
>> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
>> 6.4 perform the same for me. I've tried your setup and settings and with different
>> combos of using things like tuned and irqbalance.
>>
>> I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
>> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
>> would then reset the sched settings. So in 6.4 if I just tune the parent differently
>> I can cause different performance. If we want the 6.3 behavior we can do the patch
>> below.
>>
>> However, I don't think you guys are hitting this because you are just running
>> qemu from the normal shell and were not doing anything fancy with the sched
>> settings.
>>
>>
>> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
>> index da35e5b7f047..f2c2638d1106 100644
>> --- a/kernel/vhost_task.c
>> +++ b/kernel/vhost_task.c
>> @@ -2,6 +2,7 @@
>> /*
>> * Copyright (C) 2021 Oracle Corporation
>> */
>> +#include <uapi/linux/sched/types.h>
>> #include <linux/slab.h>
>> #include <linux/completion.h>
>> #include <linux/sched/task.h>
>> @@ -22,9 +23,16 @@ struct vhost_task {
>>
>> static int vhost_task_fn(void *data)
>> {
>> + static const struct sched_param param = { .sched_priority = 0 };
>> struct vhost_task *vtsk = data;
>> bool dead = false;
>>
>> + /*
>> + * Don't inherit the parent's sched info, so we maintain compat from
>> + * when we used kthreads and it reset this info.
>> + */
>> + sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m);
>> +
>> for (;;) {
>> bool did_work;
>>
>>
>>
>
> yes seems unlikely, still, attach this to bugzilla so it can be
> tested?
>
> and, what will help you debug? any traces to enable?
I added the patch and asked for a perf trace.
>
> Also wasn't there another issue with a non standard config?
> Maybe if we fix that it will by chance fix this one too?
>
It was when CONFIG_RT_GROUP_SCHED was enabled in the kernel config then
I would see a large drop in IOPs/throughput.
In the current 6.5-rc6 I don't see the problem anymore. I haven't had a
chance to narrow down what fixed it.
Powered by blists - more mailing lists