[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <467e1b51-e335-4280-befa-adc2cf283ab5@linux.microsoft.com>
Date: Fri, 23 Jan 2026 13:11:06 -0800
From: Nuno Das Neves <nunodasneves@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"skinsburskii@...ux.microsoft.com" <skinsburskii@...ux.microsoft.com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>,
"longli@...rosoft.com" <longli@...rosoft.com>,
"prapal@...ux.microsoft.com" <prapal@...ux.microsoft.com>,
"mrathor@...ux.microsoft.com" <mrathor@...ux.microsoft.com>,
"paekkaladevi@...ux.microsoft.com" <paekkaladevi@...ux.microsoft.com>,
Jinank Jain <jinankjain@...rosoft.com>
Subject: Re: [PATCH v4 7/7] mshv: Add debugfs to view hypervisor statistics
On 1/23/2026 9:09 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@...ux.microsoft.com> Sent: Wednesday, January 21, 2026 1:46 PM
>>
>> Introduce a debugfs interface to expose root and child partition stats
>> when running with mshv_root.
>>
>> Create a debugfs directory "mshv" containing 'stats' files organized by
>> type and id. A stats file contains a number of counters depending on
>> its type. e.g. an excerpt from a VP stats file:
>>
>> TotalRunTime : 1997602722
>> HypervisorRunTime : 649671371
>> RemoteNodeRunTime : 0
>> NormalizedRunTime : 1997602721
>> IdealCpu : 0
>> HypercallsCount : 1708169
>> HypercallsTime : 111914774
>> PageInvalidationsCount : 0
>> PageInvalidationsTime : 0
>>
>> On a root partition with some active child partitions, the entire
>> directory structure may look like:
>>
>> mshv/
>> stats # hypervisor stats
>> lp/ # logical processors
>> 0/ # LP id
>> stats # LP 0 stats
>> 1/
>> 2/
>> 3/
>> partition/ # partition stats
>> 1/ # root partition id
>> stats # root partition stats
>> vp/ # root virtual processors
>> 0/ # root VP id
>> stats # root VP 0 stats
>> 1/
>> 2/
>> 3/
>> 42/ # child partition id
>> stats # child partition stats
>> vp/ # child VPs
>> 0/ # child VP id
>> stats # child VP 0 stats
>> 1/
>> 43/
>> 55/
>>
>> On L1VH, some stats are not present as it does not own the hardware
>> like the root partition does:
>> - The hypervisor and lp stats are not present
>> - L1VH's partition directory is named "self" because it can't get its
>> own id
>> - Some of L1VH's partition and VP stats fields are not populated, because
>> it can't map its own HV_STATS_AREA_PARENT page.
>>
>> Co-developed-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
>> Co-developed-by: Praveen K Paladugu <prapal@...ux.microsoft.com>
>> Signed-off-by: Praveen K Paladugu <prapal@...ux.microsoft.com>
>> Co-developed-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
>> Signed-off-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
>> Co-developed-by: Purna Pavan Chandra Aekkaladevi
>> <paekkaladevi@...ux.microsoft.com>
>> Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@...ux.microsoft.com>
>> Co-developed-by: Jinank Jain <jinankjain@...rosoft.com>
>> Signed-off-by: Jinank Jain <jinankjain@...rosoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@...ux.microsoft.com>
>> Reviewed-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
>> ---
>> drivers/hv/Makefile | 1 +
>> drivers/hv/hv_counters.c | 1 +
>> drivers/hv/hv_synic.c | 177 +++++++++
>
> This new file hv_synic.c seems to be spurious. It looks like you unintentionally
> picked up this new file from the build tree where you were creating the patches
> for this series.
>
Oh, that's embarrassing! Yes, it's a half-baked, unrelated work-in-progress...
Please ignore!
<snip>
>> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
>> new file mode 100644
>> index 000000000000..72eb0ae44e4b
>> --- /dev/null
>> +++ b/drivers/hv/mshv_debugfs.c
>> @@ -0,0 +1,703 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2026, Microsoft Corporation.
>> + *
>> + * The /sys/kernel/debug/mshv directory contents.
>> + * Contains various statistics data, provided by the hypervisor.
>> + *
>> + * Authors: Microsoft Linux virtualization team
>> + */
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/stringify.h>
>> +#include <asm/mshyperv.h>
>> +#include <linux/slab.h>
>> +
>> +#include "mshv.h"
>> +#include "mshv_root.h"
>> +
>> +#include "hv_counters.c"
>> +
>> +#define U32_BUF_SZ 11
>> +#define U64_BUF_SZ 21
>> +#define NUM_STATS_AREAS (HV_STATS_AREA_PARENT + 1)
>
> This is sort of weak in that it doesn't really guard against
> changes in the enum that defines HV_STATS_AREA_PARENT.
> It would work if it were defined as part of the enum, but then
> you are changing the code coming from the Windows world,
> which I know is a different problem.
>
> The enum is part of the hypervisor ABI and hence isn't likely to
> change, but it still feels funny to define NUM_STATS_AREAS like
> this. I would suggest dropping this and just using
> HV_STATS_AREA_COUNT for the memory allocations even
> though doing so will allocate space for a stats area pointer
> that isn't used by this code. It's only a few bytes.
>
That would work, but then I'd want to have a comment explaining
that the decision is intentional, otherwise I think it's just as
confusing to have unexplained wasted space.
Alternatively, the usage of SELF and PARENT (but not INTERNAL)
could be made explicit by a compile-time check, and a comment to
clarify:
/* Only support SELF and PARENT areas */
#define NUM_STATS_AREAS 2
static_assert(HV_STATS_AREA_SELF == 0 && HV_STATS_AREA_PARENT == 1,
"SELF and PARENT areas must be usable as indices into an array of size NUM_STATS_AREAS")
>> +
>> +static struct dentry *mshv_debugfs;
>> +static struct dentry *mshv_debugfs_partition;
>> +static struct dentry *mshv_debugfs_lp;
>> +static struct dentry **parent_vp_stats;
>> +static struct dentry *parent_partition_stats;
>> +
>> +static u64 mshv_lps_count;
>> +static struct hv_stats_page **mshv_lps_stats;
>> +
>> +static int lp_stats_show(struct seq_file *m, void *v)
>> +{
>> + const struct hv_stats_page *stats = m->private;
>> + struct hv_counter_entry *entry = hv_lp_counters;
>> + int i;
>> +
>> + for (i = 0; i < ARRAY_SIZE(hv_lp_counters); i++, entry++)
>> + seq_printf(m, "%-29s: %llu\n", entry->name,
>> + stats->data[entry->idx]);
>> +
>> + return 0;
>> +}
>> +DEFINE_SHOW_ATTRIBUTE(lp_stats);
>> +
>> +static void mshv_lp_stats_unmap(u32 lp_index)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .lp.lp_index = lp_index,
>> + .lp.stats_area_type = HV_STATS_AREA_SELF,
>> + };
>> + int err;
>> +
>> + err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
>> + mshv_lps_stats[lp_index], &identity);
>> + if (err)
>> + pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
>> + __func__, lp_index, err);
>
> Perhaps set mshv_lps_stats[lp_index] to NULL? I don't think it's actually
> required, but similar code later in this file sets some pointers to NULL
> just as good hygiene.
>
Good idea, I'll do that.
>> +}
>> +
<snip>
>> +
>> +static int __init mshv_debugfs_lp_create(struct dentry *parent)
>> +{
>> + struct dentry *lp_dir;
>> + int err, lp_index;
>> +
>> + mshv_lps_stats = kcalloc(mshv_lps_count,
>> + sizeof(*mshv_lps_stats),
>> + GFP_KERNEL_ACCOUNT);
>> +
>> + if (!mshv_lps_stats)
>> + return -ENOMEM;
>> +
>> + lp_dir = debugfs_create_dir("lp", parent);
>> + if (IS_ERR(lp_dir)) {
>> + err = PTR_ERR(lp_dir);
>> + goto free_lp_stats;
>> + }
>> +
>> + for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
>> + err = lp_debugfs_create(lp_index, lp_dir);
>> + if (err)
>> + goto remove_debugfs_lps;
>> + }
>> +
>> + mshv_debugfs_lp = lp_dir;
>> +
>> + return 0;
>> +
>> +remove_debugfs_lps:
>> + for (lp_index -= 1; lp_index >= 0; lp_index--)
>> + mshv_lp_stats_unmap(lp_index);
>> + debugfs_remove_recursive(lp_dir);
>> +free_lp_stats:
>> + kfree(mshv_lps_stats);
>
> Set mshv_lps_stats to NULL?
>
Agreed, thanks.
>> +
>> + return err;
>> +}
>> +
<snip>
>> +
>> +static void mshv_debugfs_parent_partition_remove(void)
>> +{
>> + int idx;
>> +
>> + for_each_online_cpu(idx)
>> + parent_vp_debugfs_remove(idx,
>
> The first parameter here ("idx") should be translated through the
> hv_vp_index[] array like is done in mshv_debugfs_parent_partition_create().
>
Ok, thanks
>> + parent_vp_stats[idx]);
>> +
>> + partition_debugfs_remove(hv_current_partition_id,
>> + parent_partition_stats);
>> + kfree(parent_vp_stats);
>> + parent_vp_stats = NULL;
>> + parent_partition_stats = NULL;
>> +
>
> Extra blank line.
>
Ack
>> +}
>> +
>> +static int __init parent_vp_debugfs_create(u32 vp_index,
>> + struct dentry **vp_stats_ptr,
>> + struct dentry *parent)
>> +{
>> + struct hv_stats_page **pstats;
>> + int err;
>> +
>> + pstats = kcalloc(2, sizeof(struct hv_stats_page *), GFP_KERNEL_ACCOUNT);
>
> Another case of using "2" that should be changed.
>
Ack
>> + if (!pstats)
>> + return -ENOMEM;
>> +
>> + err = mshv_vp_stats_map(hv_current_partition_id, vp_index, pstats);
>> + if (err)
>> + goto cleanup;
>> +
>> + err = vp_debugfs_create(hv_current_partition_id, vp_index, pstats,
>> + vp_stats_ptr, parent);
>> + if (err)
>> + goto unmap_vp_stats;
>> +
>> + return 0;
>> +
>> +unmap_vp_stats:
>> + mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
>> +cleanup:
>> + kfree(pstats);
>> + return err;
>> +}
>> +
>> +static int __init mshv_debugfs_parent_partition_create(void)
>> +{
>> + struct dentry *vp_dir;
>> + int err, idx, i;
>> +
>> + mshv_debugfs_partition = debugfs_create_dir("partition",
>> + mshv_debugfs);
>> + if (IS_ERR(mshv_debugfs_partition))
>> + return PTR_ERR(mshv_debugfs_partition);
>> +
>> + err = partition_debugfs_create(hv_current_partition_id,
>> + &vp_dir,
>> + &parent_partition_stats,
>> + mshv_debugfs_partition);
>> + if (err)
>> + goto remove_debugfs_partition;
>> +
>> + parent_vp_stats = kcalloc(num_possible_cpus(),
>
> num_possible_cpus() should not be used to allocate an array that is
> then indexed by the Linux CPU number. Use nr_cpu_ids instead when
> allocating the array. See commit 16b18fdf6bc7 for the full explanation.
> As explained in that commit message, using num_possible_cpus()
> doesn't break things now, but it might in the future.
>
Thanks, will do
>> + sizeof(*parent_vp_stats),
>> + GFP_KERNEL);
>> + if (!parent_vp_stats) {
>> + err = -ENOMEM;
>> + goto remove_debugfs_partition;
>> + }
>> +
>> + for_each_online_cpu(idx) {
>> + err = parent_vp_debugfs_create(hv_vp_index[idx],
>> + &parent_vp_stats[idx],
>> + vp_dir);
>> + if (err)
>> + goto remove_debugfs_partition_vp;
>> + }
>> +
>> + return 0;
>> +
>> +remove_debugfs_partition_vp:
>> + for_each_online_cpu(i) {
>> + if (i >= idx)
>> + break;
>> + parent_vp_debugfs_remove(i, parent_vp_stats[i]);
>> + }
>> + partition_debugfs_remove(hv_current_partition_id,
>> + parent_partition_stats);
>> +
>> + kfree(parent_vp_stats);
>> + parent_vp_stats = NULL;
>> + parent_partition_stats = NULL;
>> +
>> +remove_debugfs_partition:
>> + debugfs_remove_recursive(mshv_debugfs_partition);
>> + mshv_debugfs_partition = NULL;
>> + return err;
>> +}
>> +
>> +static int hv_stats_show(struct seq_file *m, void *v)
>> +{
>> + const struct hv_stats_page *stats = m->private;
>> + struct hv_counter_entry *entry = hv_hypervisor_counters;
>> + int i;
>> +
>> + for (i = 0; i < ARRAY_SIZE(hv_hypervisor_counters); i++, entry++)
>> + seq_printf(m, "%-25s: %llu\n", entry->name,
>> + stats->data[entry->idx]);
>> +
>> + return 0;
>> +}
>> +DEFINE_SHOW_ATTRIBUTE(hv_stats);
>> +
>> +static void mshv_hv_stats_unmap(void)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .hv.stats_area_type = HV_STATS_AREA_SELF,
>> + };
>> + int err;
>> +
>> + err = hv_unmap_stats_page(HV_STATS_OBJECT_HYPERVISOR, NULL, &identity);
>> + if (err)
>> + pr_err("%s: failed to unmap hypervisor stats: %d\n",
>> + __func__, err);
>> +}
>> +
>> +static void * __init mshv_hv_stats_map(void)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .hv.stats_area_type = HV_STATS_AREA_SELF,
>> + };
>> + struct hv_stats_page *stats;
>> + int err;
>> +
>> + err = hv_map_stats_page(HV_STATS_OBJECT_HYPERVISOR, &identity, &stats);
>> + if (err) {
>> + pr_err("%s: failed to map hypervisor stats: %d\n",
>> + __func__, err);
>> + return ERR_PTR(err);
>> + }
>> + return stats;
>> +}
>> +
>> +static int __init mshv_debugfs_hv_stats_create(struct dentry *parent)
>> +{
>> + struct dentry *dentry;
>> + u64 *stats;
>> + int err;
>> +
>> + stats = mshv_hv_stats_map();
>> + if (IS_ERR(stats))
>> + return PTR_ERR(stats);
>> +
>> + dentry = debugfs_create_file("stats", 0400, parent,
>> + stats, &hv_stats_fops);
>> + if (IS_ERR(dentry)) {
>> + err = PTR_ERR(dentry);
>> + pr_err("%s: failed to create hypervisor stats dentry: %d\n",
>> + __func__, err);
>> + goto unmap_hv_stats;
>> + }
>> +
>> + mshv_lps_count = num_present_cpus();
>
> This method of setting mshv_lps_count, and the iteration through the lp_index
> in mshv_debugfs_lp_create() and mshv_debugfs_lp_remove(), seems risky. The
> lp_index gets passed to the hypervisor, so it must be the hypervisor's concept
> of the lp_index. Is that always guaranteed to be the same as Linux's numbering
> of the present CPUs? There may be edge cases where it is not. For example, what
> if Linux in the root partition were booted with the "nosmt" kernel boot option,
> such that Linux ignores all the 2nd hyper-threads in a core? Could that create
> a numbering mismatch?
>
Ah, this was using the hypervisor stats page before; HvLogicalProcessors. But
I removed the enum, so I thought this would be a reasonable way to get the number
of LPs, but I think I'm mistaken.
For context, there is a fix to how LP and VP numbers are assigned in
hv_smp_prepare_cpus(), but it's part of a future patchset. That fix ensures the
LP indices are dense. The code looks like:
/* create dense LPs from 0-N for all apicids */
i = next_smallest_apicid(apicids, 0);
for (lpidx = 1; i != INT_MAX; lpidx++) {
node = __apicid_to_node[i];
if (node == NUMA_NO_NODE)
node = 0;
/* params: node num, lp index, apic id */
ret = hv_call_add_logical_proc(node, lpidx, i);
BUG_ON(ret);
i = next_smallest_apicid(apicids, i);
}
/* create a VP for each present CPU */
lpidx = 1; /* skip BSP cpu 0 */
for_each_present_cpu(i) {
if (i == 0)
continue;
/* params: node num, domid, vp index, lp index */
ret = hv_call_create_vp(numa_cpu_node(i),
hv_current_partition_id, lpidx, lpidx);
BUG_ON(ret);
lpidx++;
}
For what it's worth, with that fix^ I tested with "nosmt" and things worked as I
would expect: All LPs were displayed in debugfs, but every second LP was not in
use by Linux, as evidenced by e.g. the number of timer interrupts not going up:
LpTimerInterrupts : 1
Also, only every second VP was created (0, 2, 4, 6...) since the others aren't
in the present mask at boot.
> Note that for vp_index, we have the hv_vp_index[] array for translating from
> Linux's concept of a CPU number to Hyper-V's concept of vp_index. For
> example, mshv_debugfs_parent_partition_create() correctly goes through
> this translation. And presumably when the VMM code does the
> MSHV_CREATE_VP ioctl, it is passing in a hypervisor vp_index.
>
> Everything may work fine "as is" for the moment, but the lp functions here
> are still conflating the hypervisor's LP numbering with Linux's CPU numbering,
> and that seems like a recipe for trouble somewhere down the road. I'm
> not sure how the hypervisor interprets the "lp_index" part of the identity
> argument passed to a hypercall, so I'm not sure what the fix is.
>
The simplest thing for now might be to bring back that enum value
HvLogicalProcessors just for this one usage. I'll admit I'm not familiar with
all the nuances here so there are still probably edge cases here.
>> +
>> + return 0;
>> +
>> +unmap_hv_stats:
>> + mshv_hv_stats_unmap();
>> + return err;
>> +}
>> +
<snip>
Powered by blists - more mailing lists