linux-kernel - Re: [PATCH v2 2/2] blk-cgroup: Optimize blkcg_rstat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ca091a5c-4ae1-e973-403e-4086d4527102@redhat.com>
Date:   Wed, 1 Jun 2022 14:15:46 -0400
From:   Waiman Long <longman@...hat.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Jens Axboe <axboe@...nel.dk>, cgroups@...r.kernel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        Ming Lei <ming.lei@...hat.com>
Subject: Re: [PATCH v2 2/2] blk-cgroup: Optimize blkcg_rstat_flush()


On 6/1/22 13:48, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 01, 2022 at 12:53:24PM -0400, Waiman Long wrote:
>> +static struct llist_node llist_last;	/* Last sentinel node of llist */
> Can you please add comment explaining why we need the special sentinel and
> empty helper?

It was mentioned in the commit log, but I will add a comment to repeat 
that. It is because lnode.next is used as a flag to indicate its 
presence in the lockless list. By default, the first one that go into 
the lockless list will have a NULL value in its next pointer. So I have 
to put a sentinel node that to make sure that the next pointer is always 
non-NULL.


>
>> +static inline bool blkcg_llist_empty(struct llist_head *lhead)
>> +{
>> +	return lhead->first == &llist_last;
>> +}
>> +
>> +static inline void init_blkcg_llists(struct blkcg *blkcg)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
>> +}
>> +
>> +static inline struct llist_node *
>> +fetch_delete_blkcg_llist(struct llist_head *lhead)
>> +{
>> +	return xchg(&lhead->first, &llist_last);
>> +}
>> +
>> +/*
>> + * The retrieved blkg_iostat_set is immediately marked as not in the
>> + * lockless list by clearing its node->next pointer. It could be put
>> + * back into the list by a parallel update before the iostat's are
>> + * finally flushed. So being in the list doesn't always mean it has new
>> + * iostat's to be flushed.
>> + */
> Isn't the above true for any sort of mechanism which tracking pending state?
> You gotta clear the pending state before consuming so that you don't miss
> the events which happen while data is being consumed.

That is true. I was about thinking what race conditions can happen with 
these changes. The above comment is for the race that can happen which 
is benign. I am remove it if you think it is necessary.

>
>> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
>> +	for (; (node != &llist_last) &&					\
>> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
>> +		nxt = node->next, node->next = NULL, true);		\
>> +		node = nxt)
>> +
>>   /**
>>    * blkcg_css - find the current css
>>    *
> ...
>> @@ -852,17 +888,26 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
>>   static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>>   {
>>   	struct blkcg *blkcg = css_to_blkcg(css);
>> -	struct blkcg_gq *blkg;
>> +	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +	struct llist_node *lnode, *lnext;
>> +	struct blkg_iostat_set *bisc;
>>   
>>   	/* Root-level stats are sourced from system-wide IO stats */
>>   	if (!cgroup_parent(css->cgroup))
>>   		return;
>>   
>> -	rcu_read_lock();
>> +	if (blkcg_llist_empty(lhead))
>> +		return;
>>   
>> -	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
>> +	lnode = fetch_delete_blkcg_llist(lhead);
>> +
>> +	/*
>> +	 * No RCU protection is needed as it is assumed that blkg_iostat_set's
>> +	 * in the percpu lockless list won't go away until the flush is done.
>> +	 */
> Can you please elaborate on why this is safe?

You are right that the comment is probably not quite right. I will put 
the rcu_read_lock/unlock() back in. However, we don't have a rcu 
iterator for the lockless list. On the other hand, blkcg_rstat_flush() 
is now called with irq disabled. So rcu_read_lock() is not technically 
needed.

Will send out a v3 soon.

Thanks,
Longman

>
>> +	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
>> +		struct blkcg_gq *blkg = bisc->blkg;
>>   		struct blkcg_gq *parent = blkg->parent;
>> -		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
>>   		struct blkg_iostat cur, delta;
>>   		unsigned long flags;
>>   		unsigned int seq;
> Overall, looks fantastic to me. Thanks a lot for working on it.
>