[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <506eaa3d-be84-c51e-3252-2979847054fe@redhat.com>
Date: Wed, 8 Jun 2022 14:16:45 -0400
From: Waiman Long <longman@...hat.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: Tejun Heo <tj@...nel.org>, Jens Axboe <axboe@...nel.dk>,
cgroups@...r.kernel.org, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, Ming Lei <ming.lei@...hat.com>
Subject: Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
On 6/8/22 12:57, Michal Koutný wrote:
> Hello.
>
> On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long <longman@...hat.com> wrote:
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case.
> Yes, there's no point to flush stats for idle devices if there can be
> many of them. Good idea.
>
>> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
>> +{
>> + return xchg(&lhead->first, &llist_last);
>> +}
>> +
>> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
>> +{
>> + struct llist_node *next = READ_ONCE(lnode->next);
>> + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
>> + lnode)->blkg;
>> +
>> + WRITE_ONCE(lnode->next, NULL);
>> + percpu_ref_put(&blkg->refcnt);
>> + return next;
>> +}
> Idea/just asking: would it make sense to generalize this into llist.c
> (this is basically llist_del_first() + llist_del_all() with a sentinel)?
> For the sake of reusability.
I have thought about that. It can be done as a follow-up patch to add a
sentinel version into llist and use that instead. Of course, I can also
update this patchset to include that.
>
>> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \
>> + for (; (node != &llist_last) && \
>> + (pos = llist_entry(node, struct blkg_iostat_set, lnode), \
>> + nxt = fetch_delete_lnode_next(node), true); \
>> + node = nxt)
>> +
> It's good hygiene to parenthesize the args.
I am aware of that. I will certainly add that if it is a generic macro
that can have many users.
>
>> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>> }
>> bis->cur.ios[rwd]++;
>>
>> + if (!READ_ONCE(bis->lnode.next)) {
>> + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> + llist_add(&bis->lnode, lhead);
>> + percpu_ref_get(&bis->blkg->refcnt);
>> + }
>> +
> When a blkg's cgroup is rmdir'd, what happens with the lhead list?
> We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
> init_and_link_css however adds reference form blkcg->css to cgroup->css.
> The blkcg->css would be (transitively) pinned by the lhead list and
> hence would prevent the final flush (when refs drop to zero). Seems like
> a cyclic dependency.
>
> Luckily, there's also per-subsys flushing in css_release which could be
> moved after rmdir (offlining) but before last ref is gone:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..d830e6a8fb3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)
>
> if (ss) {
> /* css release path */
> - if (!list_empty(&css->rstat_css_node)) {
> - cgroup_rstat_flush(cgrp);
> - list_del_rcu(&css->rstat_css_node);
> - }
> -
> cgroup_idr_replace(&ss->css_idr, NULL, css->id);
> if (ss->css_released)
> ss->css_released(css);
> @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
> css->flags &= ~CSS_ONLINE;
> RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
>
> + if (!list_empty(&css->rstat_css_node)) {
> + cgroup_rstat_flush(css->cgrp);
> + list_del_rcu(&css->rstat_css_node);
> + }
> +
> wake_up_all(&css->cgroup->offline_waitq);
> }
>
> (not tested)
Good point.
Your change may not be enough since there could be update after the
flush which will pin the blkg and hence blkcg. I guess one possible
solution may be to abandon the llist and revert back to list iteration
when offline. I need to think a bit more about that.
>
>
>> u64_stats_update_end_irqrestore(&bis->sync, flags);
>> if (cgroup_subsys_on_dfl(io_cgrp_subsys))
>> - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
>> + cgroup_rstat_updated(blkcg->css.cgroup, cpu);
> Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under
> cgroup_subsys_on_dfl()? The stats can be read on v1 anyway.
I don't quite understand here. The change is not specific to v1 or v2.
What do you mean by the stat is readable on v1?
Cheers,
Longman
Powered by blists - more mailing lists