[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210507154332.hiblsd6ot5wzwkdj@steredhat>
Date: Fri, 7 May 2021 17:43:32 +0200
From: Stefano Garzarella <sgarzare@...hat.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Stefan Hajnoczi <stefanha@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Jason Wang <jasowang@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, kvm@...r.kernel.org,
virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
Joel Fernandes <joelaf@...gle.com>,
Linux Trace Devel <linux-trace-devel@...r.kernel.org>
Subject: Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with
vhost tasks
On Fri, May 07, 2021 at 10:40:36AM -0400, Steven Rostedt wrote:
>On Fri, 7 May 2021 16:11:20 +0200
>Stefano Garzarella <sgarzare@...hat.com> wrote:
>
>> Hi Steven,
>>
>> On Wed, May 05, 2021 at 04:38:55PM -0400, Steven Rostedt wrote:
>> >The new trace-cmd 3.0 (which is almost ready to be released) allows for
>> >tracing between host and guests with timestamp synchronization such that
>> >the events on the host and the guest can be interleaved in the proper order
>> >that they occur. KernelShark now has a plugin that visualizes this
>> >interaction.
>> >
>> >The implementation requires that the guest has a vsock CID assigned, and on
>> >the guest a "trace-cmd agent" is running, that will listen on a port for
>> >the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
>> >can be called and the host will connect to the guest agent through the
>> >cid/port pair and have the agent enable tracing on behalf of the host and
>> >send the trace data back down to it.
>> >
>> >The problem is that there is no sure fire way to find the CID for a guest.
>> >Currently, the user must know the cid, or we have a hack that looks for the
>> >qemu process and parses the --guest-cid parameter from it. But this is
>> >prone to error and does not work on other implementation (was told that
>> >crosvm does not use qemu).
>>
>> For debug I think could be useful to link the vhost-vsock kthread to the
>> CID, but for the user point of view, maybe is better to query the VM
>> management layer, for example if you're using libvirt, you can easily do:
>>
>> $ virsh dumpxml fedora34 | grep cid
>> <cid auto='yes' address='3'/>
>
>We looked into going this route, but then that means trace-cmd host/guest
>tracing needs a way to handle every layer, as some people use libvirt
>(myself included), some people use straight qemu, some people us Xen, and
>some people use crosvm. We need to support all of them. Which is why I'm
>looking at doing this from the lowest common denominator, and since vsock
>is a requirement from trace-cmd to do this tracing, getting the thread
>that's related to the vsock is that lowest denominator.
Makes sense.
Just a note, there are some VMMs, like Firecracker, Cloud Hypervisor, or
QEMU with vhost-user-vsock, that don't use vhost-vsock in the host, but
they implements an hybrid vsock over Unix Domain Socket:
https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md
So in that case this approach or netlink/devlink, would not work, but
the application in the host can't use a vsock socket, so maybe isn't a
problem.
>
>>
>> >
>> >As I can not find a way to discover CIDs assigned to guests via any kernel
>> >interface, I decided to create this one. Note, I'm not attached to it. If
>> >there's a better way to do this, I would love to have it. But since I'm not
>> >an expert in the networking layer nor virtio, I decided to stick to what I
>> >know and add a debugfs interface that simply lists all the
>> >registered
>> >CIDs
>> >and the worker task that they are associated with. The worker task at
>> >least has the PID of the task it represents.
>>
>> I honestly don't know if it's the best interface, like I said maybe for
>> debugging it's fine, but if we want to expose it to the user in some
>> way, we could support devlink/netlink to provide information about the
>> vsock devices currently in use.
>
>Ideally, a devlink/netlink is the right approach. I just had no idea on how
>to implement that ;-) So I went with what I know, which is debugfs files!
>
>
>
>> >Signed-off-by: Steven Rostedt (VMware) <rostedt@...dmis.org>
>> >---
>> >diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> >index 5e78fb719602..4f03b25b23c1 100644
>> >--- a/drivers/vhost/vsock.c
>> >+++ b/drivers/vhost/vsock.c
>> >@@ -15,6 +15,7 @@
>> > #include <linux/virtio_vsock.h>
>> > #include <linux/vhost.h>
>> > #include <linux/hashtable.h>
>> >+#include <linux/debugfs.h>
>> >
>> > #include <net/af_vsock.h>
>> > #include "vhost.h"
>> >@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = {
>> > .fops = &vhost_vsock_fops,
>> > };
>> >
>> >+static struct dentry *vsock_file;
>> >+
>> >+struct vsock_file_iter {
>> >+ struct hlist_node *node;
>> >+ int index;
>> >+};
>> >+
>> >+
>> >+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos)
>> >+{
>> >+ struct vsock_file_iter *iter = v;
>> >+ struct vhost_vsock *vsock;
>> >+
>> >+ if (pos)
>> >+ (*pos)++;
>> >+
>> >+ if (iter->index >= (int)HASH_SIZE(vhost_vsock_hash))
>> >+ return NULL;
>> >+
>> >+ if (iter->node)
>> >+ iter->node = rcu_dereference_raw(hlist_next_rcu(iter->node));
>> >+
>> >+ for (;;) {
>> >+ if (iter->node) {
>> >+ vsock = hlist_entry_safe(rcu_dereference_raw(iter->node),
>> >+ struct vhost_vsock, hash);
>> >+ if (vsock->guest_cid)
>> >+ break;
>> >+ iter->node =
>> >rcu_dereference_raw(hlist_next_rcu(iter->node));
>> >+ continue;
>> >+ }
>> >+ iter->index++;
>> >+ if (iter->index >= HASH_SIZE(vhost_vsock_hash))
>> >+ return NULL;
>> >+
>> >+ iter->node = rcu_dereference_raw(hlist_first_rcu(&vhost_vsock_hash[iter->index]));
>> >+ }
>> >+ return iter;
>> >+}
>> >+
>> >+static void *vsock_start(struct seq_file *m, loff_t *pos)
>> >+{
>> >+ struct vsock_file_iter *iter = m->private;
>> >+ loff_t l = 0;
>> >+ void *t;
>> >+
>> >+ rcu_read_lock();
>>
>> Instead of keeping this rcu lock between vsock_start() and vsock_stop(),
>> maybe it's better to make a dump here of the bindings (pid/cid), save it
>> in an array, and iterate it in vsock_next().
>
>The start/stop of a seq_file() is made for taking locks. I do this with all
>my code in ftrace. Yeah, there's a while loop between the two, but that's
>just to fill the buffer. It's not that long and it never goes to userspace
>between the two. You can even use this for spin locks (but I wouldn't
>recommend doing it for raw ones).
Ah okay, thanks for the clarification!
I was worried because building with `make C=2` I had these warnings:
../drivers/vhost/vsock.c:944:13: warning: context imbalance in 'vsock_start' - wrong count at exit
../drivers/vhost/vsock.c:963:13: warning: context imbalance in 'vsock_stop' - unexpected unlock
Maybe we need to annotate the functions somehow.
>
>>
>> >+
>> >+ iter->index = -1;
>> >+ iter->node = NULL;
>> >+ t = vsock_next(m, iter, NULL);
>> >+
>> >+ for (; iter->index < HASH_SIZE(vhost_vsock_hash) && l < *pos;
>> >+ t = vsock_next(m, iter, &l))
>> >+ ;
>>
>> A while() maybe was more readable...
>
>Again, I just cut and pasted from my other code.
>
>If you have a good idea on how to implement this with netlink (something
>that ss or netstat can dislpay), I think that's the best way to go.
Okay, I'll take a look and get back to you.
If it's too complicated, we can go ahead with this patch.
Thanks,
Stefano
Powered by blists - more mailing lists