linux-kernel - Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <ZCSL/OBkpw5EClUu@cmpxchg.org>
Date:   Wed, 29 Mar 2023 15:05:32 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Domenico Cerasuolo <cerasuolodomenico@...il.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        surenb@...gle.com, brauner@...nel.org, chris@...isdown.name
Subject: Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s
 period

On Wed, Mar 29, 2023 at 08:32:12PM +0200, Domenico Cerasuolo wrote:
> On Wed, Mar 29, 2023 at 6:13 PM Johannes Weiner <hannes@...xchg.org> wrote:
> 
> > On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> > > PSI offers 2 mechanisms to get information about a specific resource
> > > pressure. One is reading from /proc/pressure/<resource>, which gives
> > > average pressures aggregated every 2s. The other is creating a pollable
> > > fd for a specific resource and cgroup.
> > >
> > > The trigger creation requires CAP_SYS_RESOURCE, and gives the
> > > possibility to pick specific time window and threshold, spawing an RT
> > > thread to aggregate the data.
> > >
> > > Systemd would like to provide containers the option to monitor pressure
> > > on their own cgroup and sub-cgroups. For example, if systemd launches a
> > > container that itself then launches services, the container should have
> > > the ability to poll() for pressure in individual services. But neither
> > > the container nor the services are privileged.
> > >
> > > This patch implements a mechanism to allow unprivileged users to create
> > > pressure triggers. The difference with privileged triggers creation is
> > > that unprivileged ones must have a time window that's a multiple of 2s.
> > > This is so that we can avoid unrestricted spawning of rt threads, and
> > > use instead the same aggregation mechanism done for the averages, which
> > > runs independently of any triggers.
> > >
> > > Suggested-by: Johannes Weiner <hannes@...xchg.org>
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@...il.com>
> >
> > Overall it looks good to me. Thanks for adding the comment on the
> > privilege check, it's much easier to understand now.
> >
> > A few nitpicks below:
> >
> > > @@ -151,6 +151,9 @@ struct psi_trigger {
> > >
> > >       /* Deferred event(s) from previous ratelimit window */
> > >       bool pending_event;
> > > +
> > > +     /* Used to differentiate destruction action*/
> > > +     enum psi_aggregators aggregator;
> > >  };
> >
> > The comment is a bit mysterious. How about
> >
> >         /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
> >
> > > @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> > >               seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > >       group->avg_last_update = sched_clock();
> > >       group->avg_next_update = group->avg_last_update + psi_period;
> > > -     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >       mutex_init(&group->avgs_lock);
> > > -     /* Init trigger-related members */
> > > +
> > > +     /* Init rtpoll trigger-related members */
> > >       atomic_set(&group->rtpoll_scheduled, 0);
> > >       mutex_init(&group->rtpoll_trigger_lock);
> > >       INIT_LIST_HEAD(&group->rtpoll_triggers);
> > > @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> > >       init_waitqueue_head(&group->rtpoll_wait);
> > >       timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> > >       rcu_assign_pointer(group->rtpoll_task, NULL);
> > > +
> > > +     /* Init avg trigger-related members */
> > > +     INIT_LIST_HEAD(&group->avg_triggers);
> > > +     memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> > > +     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >  }
> >
> > Can you move those above the rtpoll inits?
> >
> > It helps with navigating the code and spotting missing inits when the
> > init sequence follows the order of the struct members.
> >
> > > @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win,
> > u64 now, u64 value)
> > >       return growth;
> > >  }
> > >
> > > -static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total)
> > > +static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total,
> > > +                                                enum psi_aggregators
> > aggregator)
> > >  {
> > >       struct psi_trigger *t;
> > > -     u64 *total = group->total[PSI_POLL];
> > > +     u64 *total = group->total[aggregator];
> > > +     struct list_head *triggers = aggregator == PSI_AVGS ?
> > &group->avg_triggers
> > > +                     : &group->rtpoll_triggers;
> > > +     u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total
> > : group->rtpoll_total;
> > >       *update_total = false;
> >
> > These lines are a bit too long. When the init part gets that long,
> > it's preferable to move it outside of the decl block:
> >
> >         if (aggregator == PSI_AVGS) {
> >                 triggers = &group->avg_triggers;
> >                 aggregator_total = group->avg_total;
> >         } else {
> >                 triggers = &group->rtpoll_triggers;
> >                 aggregator_total = group->rtpoll_total;
> >         }
> >
> > > @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >               u64 period = ULLONG_MAX;
> > >
> > >               list_del(&t->node);
> > > -             group->rtpoll_nr_triggers[t->state]--;
> > > -             if (!group->rtpoll_nr_triggers[t->state])
> > > -                     group->rtpoll_states &= ~(1 << t->state);
> > > -             /* reset min update period for the remaining triggers */
> > > -             list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> > > -                     period = min(period, div_u64(tmp->win.size,
> > > -                                     UPDATES_PER_WINDOW));
> > > -             group->rtpoll_min_period = period;
> > > -             /* Destroy rtpoll_task when the last trigger is destroyed
> > */
> > > -             if (group->rtpoll_states == 0) {
> > > -                     group->rtpoll_until = 0;
> > > -                     task_to_destroy = rcu_dereference_protected(
> > > -                                     group->rtpoll_task,
> > > -
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > -                     rcu_assign_pointer(group->rtpoll_task, NULL);
> > > -                     del_timer(&group->rtpoll_timer);
> > > +             if (t->aggregator == PSI_AVGS) {
> > > +                     group->avg_nr_triggers[t->state]--;
> > > +             } else {
> > > +                     group->rtpoll_nr_triggers[t->state]--;
> > > +                     if (!group->rtpoll_nr_triggers[t->state])
> > > +                             group->rtpoll_states &= ~(1 << t->state);
> > > +                     /* reset min update period for the remaining
> > triggers */
> > > +                     list_for_each_entry(tmp, &group->rtpoll_triggers,
> > node)
> > > +                             period = min(period, div_u64(tmp->win.size,
> > > +                                             UPDATES_PER_WINDOW));
> > > +                     group->rtpoll_min_period = period;
> > > +                     /* Destroy rtpoll_task when the last trigger is
> > destroyed */
> > > +                     if (group->rtpoll_states == 0) {
> > > +                             group->rtpoll_until = 0;
> > > +                             task_to_destroy =
> > rcu_dereference_protected(
> > > +                                             group->rtpoll_task,
> > > +
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > +                             rcu_assign_pointer(group->rtpoll_task,
> > NULL);
> > > +                             del_timer(&group->rtpoll_timer);
> >
> > These lines are quite long too, we usually shoot for a line length of
> > 80 characters. Can you do
> >
> >                 if (t->aggregator == PSI_AVGS) {
> >                         group->avg_nr_triggers[t->state]--;
> >                         return;
> >                 }
> >
> >                 /* Else, it's an rtpoll trigger */
> >                 group->rtpoll_nr_triggers[t->state]--;
> >                 ...
> >
> Can't return there I think, the function doesn't end after the else branch,
> should I put a `goto out` instead to jump the rtpoll code?

You're right, I missed the bottom part beyond the diff.

Looking closer, I think trigger_create and trigger_destroy are
actually buggy. They have to protect against update_trigger(), so both
creation and destruction must take the right lock - avgs_lock or
rtpoll_trigger_lock - before modifying the list. They're both taking
only the rtpoll_trigger_lock right now.

IOW the trigger type distinction needs to be higher up in general.

	if (t->aggregator == PSI_AVGS) {
		mutex_lock(&group->avgs_lock);
		...
	} else {
		mutex_lock(&group->rtpoll_trigger_lock);
		...
	}