linux-kernel - Re: [PATCH 1/2] sched/core: switch struct rq->nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87sf1b6o9w.ffs@tglx>
Date: Thu, 29 Feb 2024 20:52:59 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, mingo@...nel.org
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal
 int

On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>> So but I just noticed that there is actually an issue with this:
>> 
>>>  unsigned int nr_iowait_cpu(int cpu)
>>>  {
>>> -	return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>> +	struct rq *rq = cpu_rq(cpu);
>>> +
>>> +	return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>> 
>> The access to rq->nr_iowait is not protected by the runqueue lock and
>> therefore a data race when @cpu is not the current CPU.
>> 
>> This needs to be properly annotated and explained why it does not
>> matter.
>
> But that was always racy before as well, if someone else is inc/dec'ing
> ->nr_iowait while it's being read, you could get either the before or
> after value. This doesn't really change that. I could've sworn I
> mentioned that in the commit message, but I did not.

There are actually two issues here:

1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
   is consistent in itself.

   Non-atomic inc/dec is not guaranteeing that the concurrent read is a
   consistent value as the compiler is free to do store/load
   tearing. Unlikely but not guaranteed to never happen.

   KCSAN will complain about it sooner than later and then someone has
   to go and do the analysis and the annotation. I rather let you do
   the reasoning now than chasing you down later :)

2) What's worse is that the result can be completely bogus:

   i.e.

   CPU0                                 CPU1                    CPU2
   a = rq(CPU1)->nr_iowait; // 0
                                        rq->nr_iowait++;
                                                                rq(CPU1)->nr_iowait_remote++;
   b = rq(CPU1)->nr_iowait_remote; // 1

   r = a - b; // -1
   return (unsigned int) r; // UINT_MAX

   The consumers of this interface might be upset. :)

   While with a single atomic_t it's guaranteed that the result is
   always greater or equal zero.

>> So s/Reviewed-by/Un-Reviewed-by/
>> 
>> Though thinking about it some more. Is this split a real benefit over
>> always using the atomic? Do you have numbers to show?
>
> It was more on Peter's complaint that now we're trading a single atomic
> for two, hence I got to thinking about nr_iowait in general. I don't
> have numbers showing it matters, as mentioned in another email the most
> costly part about this seems to be fetching task->in_iowait and not the
> actual atomic.

On the write side (except for the remote case) the cache line is already
dirty on the current CPU and I doubt that the atomic will be
noticable. If there is concurrent remote access to the runqueue then the
cache line is bouncing no matter what.

On the read side there is always an atomic operation required, so it's
not really different.

I assume Peter's complaint was about the extra nr_iowait_acct part. I
think that's solvable without the extra atomic_t member and with a
single atomic_add()/sub(). atomic_t is 32bit wide, so what about
splitting the thing and adding/subtracting both in one go?

While sketching this I noticed that prepare/finish can be written w/o
any conditionals.

int io_schedule_prepare(void)
{
	int flags = current->in_iowait + current->in_iowait_acct << 16;

	current->in_iowait = 1;
	current->in_iowait_acct = 1;
	blk_flush_plug(current->plug, true);
	return flags;
}

void io_schedule_finish(int old_wait_flags)
{
	current->in_iowait = flags & 0x01;
        current->in_iowait_acct = flags >> 16;
}

Now __schedule():

	if (prev->in_iowait) {
           	int x = 1 + current->in_iowait_acct << 16;

		atomic_add(x, rq->nr_iowait);
		delayacct_blkio_start();
	}

and ttwu_do_activate():

	if (p->in_iowait) {
        	int x = 1 + current->in_iowait_acct << 16;

                delayacct_blkio_end(p);
                atomic_sub(x, task_rq(p)->nr_iowait);
	}

and try_to_wake_up():

	delayacct_blkio_end(p);

	int x = 1 + current->in_iowait_acct << 16;

	atomic_add(x, task_rq(p)->nr_iowait);

nr_iowait_acct_cpu() becomes:

        return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;

and nr_iowait_cpu():

        return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);

Obviously written with proper inline wrappers and defines, but you get
the idea.

Hmm?

Thanks,

        tglx