linux-kernel - Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACGkMEuSGT-e-i-8U7hum-N_xEnsEKL+_07Mipf6gMLFFhj2Aw@mail.gmail.com>
Date:   Mon, 11 Dec 2023 15:26:46 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     Tobias Huschle <huschle@...ux.ibm.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>,
        kvm@...r.kernel.org, virtualization@...ts.linux.dev,
        netdev@...r.kernel.org
Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6
 sched/fair: Add lag based placement)

On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>
> On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > >
> > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > > the terminology.
> > > > > >
> > > > > > - vhost is looping through available queues.
> > > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > > - kworker does something with that queue and terminates quickly.
> > > > > >
> > > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > > it will just keep on looping until it is migrated off the CPU.
> > > > >
> > > > >
> > > > > Normally it takes the buffers off the queue and is done with it.
> > > > > I am guessing that at the same time guest is running on some other
> > > > > CPU and keeps adding available buffers?
> > > > >
> > > >
> > > > It seems to do just that, there are multiple other vhost instances
> > > > involved which might keep filling up thoses queues.
> > > >
> > >
> > > No vhost is ever only draining queues. Guest is filling them.
> > >
> > > > Unfortunately, this makes the problematic vhost instance to stay on
> > > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > > explicitly woken up by vhost, so it wants it to do something.

It looks to me vhost doesn't use workqueue but the worker by itself.

> > > >
> > > > At this point it seems that there is an assumption about the scheduler
> > > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > > far, it seems like EEVDF does what is intended to do.
> > > >
> > > > Shouldn't there be a more explicit mechanism in use that allows the
> > > > kworker to be scheduled in favor of the vhost?

Vhost did a brunch of copy_from_user() which should trigger
__might_fault() so a __might_sleep() most of the case.

> > > >
> > > > It is also concerning that the vhost seems cannot be preempted by the
> > > > scheduler while executing that loop.
> > >
> > >
> > > Which loop is that, exactly?
> >
> > The loop continously passes translate_desc in drivers/vhost/vhost.c
> > That's where I put the trace statements.
> >
> > The overall sequence seems to be (top to bottom):
> >
> > handle_rx
> > get_rx_bufs
> > vhost_get_vq_desc
> > vhost_get_avail_head
> > vhost_get_avail
> > __vhost_get_user_slow
> > translate_desc               << trace statement in here
> > vhost_iotlb_itree_first
>
> I wonder why do you keep missing cache and re-translating.
> Is pr_debug enabled for you? If not could you check if it
> outputs anything?
> Or you can tweak:
>
> #define vq_err(vq, fmt, ...) do {                                  \
>                 pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
>                 if ((vq)->error_ctx)                               \
>                                 eventfd_signal((vq)->error_ctx, 1);\
>         } while (0)
>
> to do pr_err if you prefer.
>
> > These functions show up as having increased overhead in perf.
> >
> > There are multiple loops going on in there.
> > Again the disclaimer though, I'm not familiar with that code at all.
>
>
> So there's a limit there: vhost_exceeds_weight should requeue work:
>
>         } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
>
> then we invoke scheduler each time before re-executing it:
>
>
> {
>         struct vhost_worker *worker = data;
>         struct vhost_work *work, *work_next;
>         struct llist_node *node;
>
>         node = llist_del_all(&worker->work_list);
>         if (node) {
>                 __set_current_state(TASK_RUNNING);
>
>                 node = llist_reverse_order(node);
>                 /* make sure flag is seen after deletion */
>                 smp_wmb();
>                 llist_for_each_entry_safe(work, work_next, node, node) {
>                         clear_bit(VHOST_WORK_QUEUED, &work->flags);
>                         kcov_remote_start_common(worker->kcov_handle);
>                         work->fn(work);
>                         kcov_remote_stop();
>                         cond_resched();
>                 }
>         }
>
>         return !!node;
> }
>
> These are the byte and packet limits:
>
> /* Max number of bytes transferred before requeueing the job.
>  * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> /* Max number of packets transferred before requeueing the job.
>  * Using this limit prevents one virtqueue from starving others with small
>  * pkts.
>  */
> #define VHOST_NET_PKT_WEIGHT 256
>
>
> Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

Or a dirty hack like cond_resched() in translate_desc().

Thanks


>
> --
> MST
>