linux-kernel - Re: [PATCH 1/2] sched/core: switch struct rq->nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1fc564c-528a-47d0-8b68-d596d6681eb5@kernel.dk>
Date: Thu, 29 Feb 2024 15:30:05 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, mingo@...nel.org
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal
 int

On 2/29/24 12:52 PM, Thomas Gleixner wrote:
> On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
>> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>>> So but I just noticed that there is actually an issue with this:
>>>
>>>>  unsigned int nr_iowait_cpu(int cpu)
>>>>  {
>>>> -	return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>>> +	struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> +	return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>>>
>>> The access to rq->nr_iowait is not protected by the runqueue lock and
>>> therefore a data race when @cpu is not the current CPU.
>>>
>>> This needs to be properly annotated and explained why it does not
>>> matter.
>>
>> But that was always racy before as well, if someone else is inc/dec'ing
>> ->nr_iowait while it's being read, you could get either the before or
>> after value. This doesn't really change that. I could've sworn I
>> mentioned that in the commit message, but I did not.
> 
> There are actually two issues here:
> 
> 1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
>    is consistent in itself.
> 
>    Non-atomic inc/dec is not guaranteeing that the concurrent read is a
>    consistent value as the compiler is free to do store/load
>    tearing. Unlikely but not guaranteed to never happen.
> 
>    KCSAN will complain about it sooner than later and then someone has
>    to go and do the analysis and the annotation. I rather let you do
>    the reasoning now than chasing you down later :)

Fair enough.

> 2) What's worse is that the result can be completely bogus:
> 
>    i.e.
> 
>    CPU0                                 CPU1                    CPU2
>    a = rq(CPU1)->nr_iowait; // 0
>                                         rq->nr_iowait++;
>                                                                 rq(CPU1)->nr_iowait_remote++;
>    b = rq(CPU1)->nr_iowait_remote; // 1
> 
>    r = a - b; // -1
>    return (unsigned int) r; // UINT_MAX
> 
>    The consumers of this interface might be upset. :)
> 
>    While with a single atomic_t it's guaranteed that the result is
>    always greater or equal zero.

Yeah OK, this is a real problem...

>>> So s/Reviewed-by/Un-Reviewed-by/
>>>
>>> Though thinking about it some more. Is this split a real benefit over
>>> always using the atomic? Do you have numbers to show?
>>
>> It was more on Peter's complaint that now we're trading a single atomic
>> for two, hence I got to thinking about nr_iowait in general. I don't
>> have numbers showing it matters, as mentioned in another email the most
>> costly part about this seems to be fetching task->in_iowait and not the
>> actual atomic.
> 
> On the write side (except for the remote case) the cache line is already
> dirty on the current CPU and I doubt that the atomic will be
> noticable. If there is concurrent remote access to the runqueue then the
> cache line is bouncing no matter what.

That was my exact thinking too, same cacheline and back-to-back atomics
don't really matter vs a single atomic on it.

> On the read side there is always an atomic operation required, so it's
> not really different.
> 
> I assume Peter's complaint was about the extra nr_iowait_acct part. I
> think that's solvable without the extra atomic_t member and with a
> single atomic_add()/sub(). atomic_t is 32bit wide, so what about
> splitting the thing and adding/subtracting both in one go?
> 
> While sketching this I noticed that prepare/finish can be written w/o
> any conditionals.
> 
> int io_schedule_prepare(void)
> {
> 	int flags = current->in_iowait + current->in_iowait_acct << 16;
> 
> 	current->in_iowait = 1;
> 	current->in_iowait_acct = 1;
> 	blk_flush_plug(current->plug, true);
> 	return flags;
> }
> 
> void io_schedule_finish(int old_wait_flags)
> {
> 	current->in_iowait = flags & 0x01;
>         current->in_iowait_acct = flags >> 16;
> }
> 
> Now __schedule():
> 
> 	if (prev->in_iowait) {
>            	int x = 1 + current->in_iowait_acct << 16;
> 
> 		atomic_add(x, rq->nr_iowait);
> 		delayacct_blkio_start();
> 	}
> 
> and ttwu_do_activate():
> 
> 	if (p->in_iowait) {
>         	int x = 1 + current->in_iowait_acct << 16;
> 
>                 delayacct_blkio_end(p);
>                 atomic_sub(x, task_rq(p)->nr_iowait);
> 	}
> 
> 
> and try_to_wake_up():
> 
> 	delayacct_blkio_end(p);
> 
> 	int x = 1 + current->in_iowait_acct << 16;
> 
> 	atomic_add(x, task_rq(p)->nr_iowait);
> 
> nr_iowait_acct_cpu() becomes:
> 
>         return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;
> 
> and nr_iowait_cpu():
> 
>         return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);
> 
> Obviously written with proper inline wrappers and defines, but you get
> the idea.

I'll play with this a bit, but do we want to switch to an atomic_long_t
for this? 2^16 in iowait seems extreme, but it definitely seems possible
to overflow it.

-- 
Jens Axboe