linux-kernel - Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZCRjulHDZN3U5bUY@cmpxchg.org>
Date:   Wed, 29 Mar 2023 12:13:46 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Domenico Cerasuolo <cerasuolodomenico@...il.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        surenb@...gle.com, brauner@...nel.org, chris@...isdown.name
Subject: Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s
 period

On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> PSI offers 2 mechanisms to get information about a specific resource
> pressure. One is reading from /proc/pressure/<resource>, which gives
> average pressures aggregated every 2s. The other is creating a pollable
> fd for a specific resource and cgroup.
> 
> The trigger creation requires CAP_SYS_RESOURCE, and gives the
> possibility to pick specific time window and threshold, spawing an RT
> thread to aggregate the data.
> 
> Systemd would like to provide containers the option to monitor pressure
> on their own cgroup and sub-cgroups. For example, if systemd launches a
> container that itself then launches services, the container should have
> the ability to poll() for pressure in individual services. But neither
> the container nor the services are privileged.
> 
> This patch implements a mechanism to allow unprivileged users to create
> pressure triggers. The difference with privileged triggers creation is
> that unprivileged ones must have a time window that's a multiple of 2s.
> This is so that we can avoid unrestricted spawning of rt threads, and
> use instead the same aggregation mechanism done for the averages, which
> runs independently of any triggers.
> 
> Suggested-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@...il.com>

Overall it looks good to me. Thanks for adding the comment on the
privilege check, it's much easier to understand now.

A few nitpicks below:

> @@ -151,6 +151,9 @@ struct psi_trigger {
>  
>  	/* Deferred event(s) from previous ratelimit window */
>  	bool pending_event;
> +
> +	/* Used to differentiate destruction action*/
> +	enum psi_aggregators aggregator;
>  };

The comment is a bit mysterious. How about

	/* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */

> @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>  	group->avg_last_update = sched_clock();
>  	group->avg_next_update = group->avg_last_update + psi_period;
> -	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>  	mutex_init(&group->avgs_lock);
> -	/* Init trigger-related members */
> +
> +	/* Init rtpoll trigger-related members */
>  	atomic_set(&group->rtpoll_scheduled, 0);
>  	mutex_init(&group->rtpoll_trigger_lock);
>  	INIT_LIST_HEAD(&group->rtpoll_triggers);
> @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
>  	init_waitqueue_head(&group->rtpoll_wait);
>  	timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
>  	rcu_assign_pointer(group->rtpoll_task, NULL);
> +
> +	/* Init avg trigger-related members */
> +	INIT_LIST_HEAD(&group->avg_triggers);
> +	memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> +	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>  }

Can you move those above the rtpoll inits?

It helps with navigating the code and spotting missing inits when the
init sequence follows the order of the struct members.

> @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>  	return growth;
>  }
>  
> -static u64 update_triggers(struct psi_group *group, u64 now, bool *update_total)
> +static u64 update_triggers(struct psi_group *group, u64 now, bool *update_total,
> +						   enum psi_aggregators aggregator)
>  {
>  	struct psi_trigger *t;
> -	u64 *total = group->total[PSI_POLL];
> +	u64 *total = group->total[aggregator];
> +	struct list_head *triggers = aggregator == PSI_AVGS ? &group->avg_triggers
> +			: &group->rtpoll_triggers;
> +	u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total : group->rtpoll_total;
>  	*update_total = false;

These lines are a bit too long. When the init part gets that long,
it's preferable to move it outside of the decl block:

	if (aggregator == PSI_AVGS) {
		triggers = &group->avg_triggers;
		aggregator_total = group->avg_total;
	} else {
		triggers = &group->rtpoll_triggers;
		aggregator_total = group->rtpoll_total;
	}

> @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
>  		u64 period = ULLONG_MAX;
>  
>  		list_del(&t->node);
> -		group->rtpoll_nr_triggers[t->state]--;
> -		if (!group->rtpoll_nr_triggers[t->state])
> -			group->rtpoll_states &= ~(1 << t->state);
> -		/* reset min update period for the remaining triggers */
> -		list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> -			period = min(period, div_u64(tmp->win.size,
> -					UPDATES_PER_WINDOW));
> -		group->rtpoll_min_period = period;
> -		/* Destroy rtpoll_task when the last trigger is destroyed */
> -		if (group->rtpoll_states == 0) {
> -			group->rtpoll_until = 0;
> -			task_to_destroy = rcu_dereference_protected(
> -					group->rtpoll_task,
> -					lockdep_is_held(&group->rtpoll_trigger_lock));
> -			rcu_assign_pointer(group->rtpoll_task, NULL);
> -			del_timer(&group->rtpoll_timer);
> +		if (t->aggregator == PSI_AVGS) {
> +			group->avg_nr_triggers[t->state]--;
> +		} else {
> +			group->rtpoll_nr_triggers[t->state]--;
> +			if (!group->rtpoll_nr_triggers[t->state])
> +				group->rtpoll_states &= ~(1 << t->state);
> +			/* reset min update period for the remaining triggers */
> +			list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> +				period = min(period, div_u64(tmp->win.size,
> +						UPDATES_PER_WINDOW));
> +			group->rtpoll_min_period = period;
> +			/* Destroy rtpoll_task when the last trigger is destroyed */
> +			if (group->rtpoll_states == 0) {
> +				group->rtpoll_until = 0;
> +				task_to_destroy = rcu_dereference_protected(
> +						group->rtpoll_task,
> +						lockdep_is_held(&group->rtpoll_trigger_lock));
> +				rcu_assign_pointer(group->rtpoll_task, NULL);
> +				del_timer(&group->rtpoll_timer);

These lines are quite long too, we usually shoot for a line length of
80 characters. Can you do

		if (t->aggregator == PSI_AVGS) {
			group->avg_nr_triggers[t->state]--;
			return;
		}

		/* Else, it's an rtpoll trigger */
		group->rtpoll_nr_triggers[t->state]--;
		...

With that, please add

Acked-by: Johannes Weiner <hannes@...xchg.org>