linux-kernel - Re: "statsfs" API design

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191109154952.GA1365674@kroah.com>
Date:   Sat, 9 Nov 2019 16:49:52 +0100
From:   Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     KVM list <kvm@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Alex Williamson <alex.williamson@...hat.com>,
        Peter Feiner <pfeiner@...gle.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: "statsfs" API design

On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs.  Reading need not be particularly lightweight, but writing
> must be fast.  Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast?  Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs.  However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this.  Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs.  Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API.  I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
> 
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
> 
> A value represents a single quantity that is gathered by the statsfs
> client.  It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
> 
> Values are described by a struct like this one:
> 
> 	struct statsfs_value {
> 		const char *name;
> 		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
> 		u16 aggr_kind;		/* Bitmask with zero or more of
> 					 * STAT_AGGR_{MIN,MAX,SUM,...}
> 					 */
> 		u16 mode;		/* File mode */
> 		int offset;		/* Offset from base address
> 					 * to field containing the value
> 					 */
> 	};
> 
> As you can see, values are basically integers stored somewhere in a
> struct.   The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API.  They can include two kinds of
> elements:
> 
> - values as described above.  The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
> 
> - subordinate sources
> 
> Adding a subordinate source has two effects:
> 
> - it creates a subdirectory for each subordinate source
> 
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too.  If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated.  That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
> 
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
> 
> Sources are manipulated with these four functions:
> 
> 	struct statsfs_source *statsfs_source_create(const char *fmt,
> 						     ...);
> 	void statsfs_source_add_values(struct statsfs_source *source,
> 				       struct statsfs_value *stat,
> 				       int n, void *ptr);
> 	void statsfs_source_add_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 	void statsfs_source_remove_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
> 
> 	void statsfs_source_get(struct statsfs_source *);
> 	void statsfs_source_put(struct statsfs_source *);
> 
> Finally,
> 
> 	void statsfs_source_register(struct statsfs_source *source);
> 
> lets you create a toplevel statsfs directory.
> 
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
> 
> /* Globals */
> 	struct statsfs_value vcpu_stats[] = ...;
> 	struct statsfs_value vm_stats[] = ...;
> 	static struct statsfs_source *kvm_source;
> 
> /* On module creation */
> 	kvm_source = statsfs_source_create("kvm");
> 	statsfs_source_register(kvm_source);
> 
> /* On VM creation */
> 	kvm->src = statsfs_source_create("%d-%d\n",
> 				         task_pid_nr(current), fd);
> 	statsfs_source_add_values(kvm->src, vm_stats,
> 				  ARRAY_SIZE(vm_stats),
> 				  &kvm->stats);
> 	statsfs_source_add_subordinate(kvm_source, kvm->src);
> 
> /* On vCPU creation */
> 	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> 	statsfs_source_add_values(vcpu_src, vcpu_stats,
> 				  ARRAY_SIZE(vcpu_stats),
> 				  &vcpu->stats);
> 	statsfs_source_add_subordinate(kvm->src, vcpu_src);
> 	/*
> 	 * No need to keep the vcpu_src around since there's no
> 	 * separate vCPU deletion event; rely on refcount
> 	 * exclusively.
> 	 */
> 	statsfs_source_put(vcpu_src);
> 
> /* On VM deletion */
> 	statsfs_source_remove_subordinate(kvm_source, kvm->src);
> 	statsfs_source_put(kvm->src);
> 
> /* On KVM exit */
> 	statsfs_source_put(kvm_source);
> 
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace?  The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily.  How will
this be addressed here?

thanks,

greg k-h