[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250226-portieren-staudamm-10823e224307@brauner>
Date: Wed, 26 Feb 2025 13:14:32 +0100
From: Christian Brauner <brauner@...nel.org>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Oleg Nesterov <oleg@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>, "Michael S. Tsirkin" <mst@...hat.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>, linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [GIT PULL] KVM changes for Linux 6.14
On Wed, Feb 05, 2025 at 12:49:30PM +0100, Christian Brauner wrote:
> On Tue, Feb 04, 2025 at 05:05:06PM +0100, Paolo Bonzini wrote:
> > On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@...nel.org> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote:
> > > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@...hat.com> wrote:
> > > > > On 01/26, Linus Torvalds wrote:
> > > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@...hat.com> wrote:
> > > > > > >
> > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/
> > > > > > > case, next_tid() can just check same_thread_group,
> > > > > >
> > > > > > That was my thinking yes.
> > > > > >
> > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would
> > > > > > hide it from some management tool and be used for nefarious purposes
> > > > >
> > > > > Agreed,
> > > > >
> > > > > > (even if they then show up elsewhere that the tool wouldn't look at).
> > > > >
> > > > > Even if we move them from /proc/*/task to /proc ?
> > > >
> > > > Indeed---as long as they show up somewhere, it's not worse than it
> > > > used to be. The reason why I'd prefer them to stay in /proc/*/task is
> > > > that moving them away at least partly negates the benefits of the
> > > > "workers are children of the starter" model. For example it
> > > > complicates measuring their cost within the process that runs the VM.
> > > > Maybe it's more of a romantic thing than a real practical issue,
> > > > because in the real world resource accounting for VMs is done via
> > > > cgroups. But unlike the lazy creation in KVM, which is overall pretty
> > > > self-contained, I am afraid the ugliness in procfs would be much worse
> > > > compared to the benefit, if there's a benefit at all.
> > > >
> > > > > Perhaps, I honestly do not know what will/can confuse userspace more.
> > > >
> > > > At the very least, marking workers as "Kthread: 1" makes sense and
>
> You mean in /proc/<pid>/status? Yeah, we can do that. This expands the
> definition of Kthread a bit. It would now mean anything that the kernel
> spawned for userspace. But that is probably fine.
>
> But it won't help with the problem of just checking /proc/<pid>/task/ to
> figure out whether the caller is single-threaded or not. If the caller
> has more than 1 entry in there they need to walk through all of them and
> parse through /proc/<pid>/status to discount them if they're kernel
> threads.
>
> > > > should not cause too much confusion. I wouldn't go beyond that unless
> > > > we get more reports of similar issues, and I'm not even sure how
> > > > common it is for userspace libraries to check for single-threadedness.
> > >
> > > Sorry, just saw this thread now.
> > >
> > > What if we did what Linus suggests and hide (odd) user workers from
> > > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter
> > > would only list the workers that got spawned by the kernel for that
> > > particular task? This would acknowledge their somewhat special status
> > > and allow userspace to still detect them as "Hey, I didn't actually
> > > spawn those, they got shoved into my workload by the kernel for me.".
> >
> > Wouldn't the workers then disappear completely from ps, top or other
> > tools that look at /proc/$PID/task? That seems a bit too underhanded
> > towards userspace...
>
> So maybe, but then there's also the possibility to do:
>
> - Have /proc/<pid>/status list all tasks.
> - Have /proc/<pid>/worker only list user workers spawned by the kernel for userspace.
>
> count(/proc/<pid>/status) - count(/proc/<pid>/workers) == 1 => (userspace) single threaded
>
> My wider point is that I would prefer we add something that is
> consistent and doesn't have to give the caller a different view than a
> foreign task. I think that will just create confusion in the long run.
So what I had in mind was (quickly sketched) the rough draft below. This
will unconditionally skip PF_USER_WORKER tasks in /proc/<pid>/task and
will only list them in /proc/<pid>/worker.
fs/proc/base.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 116 insertions(+), 6 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index cd89e956c322..60e6b2cea259 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3315,10 +3315,13 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
* Thread groups
*/
static const struct file_operations proc_task_operations;
+static const struct file_operations proc_worker_operations;
static const struct inode_operations proc_task_inode_operations;
+static const struct inode_operations proc_worker_inode_operations;
static const struct pid_entry tgid_base_stuff[] = {
DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
+ DIR("worker", S_IRUGO|S_IXUGO, proc_worker_inode_operations, proc_worker_operations),
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
@@ -3835,11 +3838,14 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
fs_info = proc_sb_info(dentry->d_sb);
ns = fs_info->pid_ns;
- rcu_read_lock();
- task = find_task_by_pid_ns(tid, ns);
- if (task)
- get_task_struct(task);
- rcu_read_unlock();
+ scoped_guard(rcu) {
+ task = find_task_by_pid_ns(tid, ns);
+ if (task) {
+ if (task->flags & PF_USER_WORKER)
+ goto out;
+ get_task_struct(task);
+ }
+ }
if (!task)
goto out;
if (!same_thread_group(leader, task))
@@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
tid = (int)(intptr_t)file->private_data;
file->private_data = NULL;
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
- task;
+ task && !(task->flags & PF_USER_WORKER);
task = next_tid(task), ctx->pos++) {
char name[10 + 1];
unsigned int len;
@@ -3987,6 +3993,97 @@ static int proc_task_getattr(struct mnt_idmap *idmap,
return 0;
}
+static struct dentry *
+proc_worker_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+ struct task_struct *task;
+ struct task_struct *leader = get_proc_task(dir);
+ unsigned tid;
+ struct proc_fs_info *fs_info;
+ struct pid_namespace *ns;
+ struct dentry *result = ERR_PTR(-ENOENT);
+
+ if (!leader)
+ goto out_no_task;
+
+ tid = name_to_int(&dentry->d_name);
+ if (tid == ~0U)
+ goto out;
+
+ fs_info = proc_sb_info(dentry->d_sb);
+ ns = fs_info->pid_ns;
+ scoped_guard(rcu) {
+ task = find_task_by_pid_ns(tid, ns);
+ if (task) {
+ if (!(task->flags & PF_USER_WORKER))
+ goto out;
+ get_task_struct(task);
+ }
+ }
+ if (!task)
+ goto out;
+ if (!same_thread_group(leader, task))
+ goto out_drop_task;
+
+ result = proc_task_instantiate(dentry, task, NULL);
+out_drop_task:
+ put_task_struct(task);
+out:
+ put_task_struct(leader);
+out_no_task:
+ return result;
+}
+
+static int proc_worker_getattr(struct mnt_idmap *idmap, const struct path *path,
+ struct kstat *stat, u32 request_mask,
+ unsigned int query_flags)
+{
+ generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(path->dentry), stat);
+ return 0;
+}
+
+static int proc_worker_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task;
+ struct pid_namespace *ns;
+ int tid;
+
+ if (proc_inode_is_dead(inode))
+ return -ENOENT;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ /* We cache the tgid value that the last readdir call couldn't
+ * return and lseek resets it to 0.
+ */
+ ns = proc_pid_ns(inode->i_sb);
+ tid = (int)(intptr_t)file->private_data;
+ file->private_data = NULL;
+ for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
+ task && (task->flags & PF_USER_WORKER);
+ task = next_tid(task), ctx->pos++) {
+ char name[10 + 1];
+ unsigned int len;
+
+ tid = task_pid_nr_ns(task, ns);
+ if (!tid)
+ continue; /* The task has just exited. */
+ len = snprintf(name, sizeof(name), "%u", tid);
+ if (!proc_fill_cache(file, ctx, name, len,
+ proc_task_instantiate, task, NULL)) {
+ /* returning this tgid failed, save it as the first
+ * pid for the next readir call */
+ file->private_data = (void *)(intptr_t)tid;
+ put_task_struct(task);
+ break;
+ }
+ }
+
+ return 0;
+}
+
/*
* proc_task_readdir() set @file->private_data to a positive integer
* value, so casting that to u64 is safe. generic_llseek_cookie() will
@@ -4005,6 +4102,19 @@ static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence)
return off;
}
+static const struct inode_operations proc_worker_inode_operations = {
+ .lookup = proc_worker_lookup,
+ .getattr = proc_worker_getattr,
+ .setattr = proc_setattr,
+ .permission = proc_pid_permission,
+};
+
+static const struct file_operations proc_worker_operations = {
+ .read = generic_read_dir,
+ .iterate_shared = proc_worker_readdir,
+ .llseek = proc_dir_llseek,
+};
+
static const struct inode_operations proc_task_inode_operations = {
.lookup = proc_task_lookup,
.getattr = proc_task_getattr,
--
2.47.2
Powered by blists - more mailing lists