linux-kernel - Re: [PATCH] sched/isolation: isolate from handling managed interrupt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87d0bjdj88.fsf@nanos.tec.linutronix.de>
Date:   Fri, 17 Jan 2020 01:23:03 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Peter Xu <peterx@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [PATCH] sched/isolation: isolate from handling managed interrupt

Ming,

Ming Lei <ming.lei@...hat.com> writes:
> On Thu, Jan 16, 2020 at 01:08:17PM +0100, Thomas Gleixner wrote:
>> Ming Lei <ming.lei@...hat.com> writes:
>> > -	ret = chip->irq_set_affinity(data, mask, force);
>> > +	zalloc_cpumask_var(&tmp_mask, GFP_ATOMIC);
>> 
>> I clearly told you:
>> 
>>     "That's wrong. This code is called with interrupts disabled, so
>>      GFP_KERNEL is wrong. And NO, we won't do a GFP_ATOMIC allocation
>>      here."
>> 
>> Is that last sentence unclear in any way?
>
> Yeah, it is clear.
>
> But GFP_ATOMIC is usually allowed in atomic context, could you
> explain it a bit why it can't be done in this case?

You could have asked that question before sending this patch :)

> We still can fallback to current behavior if the allocation fails.

Allocation fail is not the main concern here. In general we avoid atomic
allocations where ever we can, but even more so in contexts which are
fully atomic even on PREEMPT_RT simply because PREEMPT_RT cannot support
that.

Regular spin_lock(); GFP_ATOMIC; spin_unlock(); sections are not a
problem because spinlocks turn into sleepable 'spinlocks' on RT and are
preemptible.

irq_desc::lock is a raw spinlock which is a true spinlock on RT for
obvious reasons and there any form of memory allocation is a NONO.

> Or could you suggest to solve the issue in other way if GFP_ATOMIC
> can't be done?

Just use the same mechanism as irq_setup_affinity() for now. Not pretty,
but it does the job. I have a plan how to avoid that, but that's a
larger surgery.

Let me give you a few comments on the rest of the patch while at it.

> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/task.h>
>  #include <uapi/linux/sched/types.h>
>  #include <linux/task_work.h>
> +#include <linux/sched/isolation.h>

Can you please move this include next to the other linux/sched/ one?

> @@ -212,12 +213,29 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  {
>  	struct irq_desc *desc = irq_data_to_desc(data);
>  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> +	const struct cpumask *housekeeping_mask =
> +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);

Bah. This is unreadable garbage. What's wrong with defining the variable
and retrieving the pointer later on? Especially as this is only required
when there is an actual managed interrupt.

> +	/*
> +	 * Userspace can't change managed irq's affinity, make sure that
> +	 * isolated CPU won't be selected as the effective CPU if this
> +	 * irq's affinity includes at least one housekeeping CPU.
> +	 *
> +	 * This way guarantees that isolated CPU won't be interrupted if
> +	 * IO isn't submitted from isolated CPU.

This comment is more confusing than helpful. What about:

	/*
         * If this is a managed interrupt check whether the requested
         * affinity mask intersects with a housekeeping CPU. If so, then
         * remove the isolated CPUs from the mask and just keep the
         * housekeeping CPU(s). This prevents the affinity setter from
         * routing the interrupt to an isolated CPU to avoid that I/O
         * submitted from a housekeeping CPU causes interrupts on an
         * isolated one.
         *
         * If the masks do not intersect then keep the requested mask.
         * The isolated target CPUs are only receiving interrupts
         * when the I/O operation was submitted directly from them.
         */

Or something to that effect. Hmm?

> +	 */
> +	if (irqd_affinity_is_managed(data) && tmp_mask &&
> +			cpumask_intersects(mask, housekeeping_mask))
> +		cpumask_and(tmp_mask, mask, housekeeping_mask);

Now while writing the above comment the following interesting scenario
came to my mind:

Housekeeping CPUs 0-2, Isolated CPUs 3-7

Device has 4 queues. So the spreading results in:

 q0:   CPU 0/1, q1:   CPU 2/3, q2:   CPU 4/5, q3:   CPU 6/7

q1 is the interesting one. It's the one which gets the housekeeping mask
applied and CPU3 is removed.

So if CPU2 is offline when the mask is applied, then the result of the
affinity setting operation is going to be an error code because the
resulting mask is empty.

Admittedly this is a weird corner case, but it's bound to happen and the
resulting bug report is going to be hard to decode unless the reporter
can provide a 100% reproducer or a very accurate description of the
scenario.

Thanks,

        tglx