linux-kernel - [debug PATCHes] Re: smp_call_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150331105656.GA25180@gmail.com>
Date:	Tue, 31 Mar 2015 12:56:56 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Chris J Arges <chris.j.arges@...onical.com>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: [debug PATCHes] Re: smp_call_function_single lockups


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> bit clear" seems to be a real issue.

It's interesting in particular when it happens with an edge-triggered 
interrupt source: it's much harder to miss level triggered IRQs, which 
stay around until actively handled. Edge triggered irqs are more 
fragile to loss of event processing.

> > Anyway, maybe this sheds some more light on this issue. I can 
> > reproduce this at will, so let me know of other experiments to do.

Btw., could you please describe (again) what your current best method 
for reproduction is? It's been a long discussion ...

> Somebody else who knows the apic needs to also take a look, but I'd 
> love to hear what the actual hardware interrupt is (from that 
> "vector_irq[vector]" thing above.
> 
> I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> are 0xf0-0xff), so it sounds like an external one. But that then 
> requires the whole mapping table thing.
> 
> Ingo/Peter/Jiang - is there anything else useful we could print out? 
> I worry about the irq movement code. Can we add printk's to when an 
> irq is chasing from one CPU to another and doing that 
> "move_in_progress" thing? I've always been scared of that code.

1)

So the irq_cfg::move_in_progress field originates from:

  610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")

and I agree that it looks fragile, so it would be nice to see how much 
(if at all?) it gets used, by sticking a few pr_info()s in it.

Debug patch for the vector movement state machine attached below. 
Untested.

2)

But ... having taken a quick look at the vector movement handling, I 
am scared more than ever,and I cannot convince myself that it's race 
free. It's possibly harmless, but still, famous last words ...

For example, most ->move_in_progress transitions happen with the 
vector_lock held - with the exception of send_cleanup_vector(): there 
we are only holding the desc->lock, but that's not enough! For 
example, this codepath:

static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
{
        unsigned me;

        if (likely(!cfg->move_in_progress))
                return;

        me = smp_processor_id();

        if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
                send_cleanup_vector(cfg);

...

void send_cleanup_vector(struct irq_cfg *cfg)
{
...

        cfg->move_in_progress = 0;
}

is blatantly racy, unless I missed something obvious?

2b)

Initially I thought this only affects slowpaths like driver unregister 
or CPU hotplug, but I think this would be relevant for the fastpath of 
edge triggered irqs as well:

void apic_ack_edge(struct irq_data *data)
{
        irq_complete_move(irqd_cfg(data));
        irq_move_irq(data);
        ack_APIC_irq();
}

and irq_complete_move() checks and clears cfg->move_in_progress as 
listed above.

So in most scenarios this is probably harmless, because it can in the 
worst case result in the missing of a ->move_in_progress flag setting.

But it does not look harmless in the apic_set_affinity() code path: 
there we call into assign_irq_vector() without the desc->lock held, 
due to this gem in the IRQ core:

int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
{
        unsigned long flags;
        struct irq_desc *desc = irq_get_desc_lock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);

        if (!desc)
                return -EINVAL;
        desc->affinity_hint = m;
        irq_put_desc_unlock(desc, flags);
        /* set the initial affinity to prevent every interrupt being on CPU0 */
        if (m)
                __irq_set_affinity(irq, m, false);
        return 0;
}

argh!

Now if this ever crosses path with an assign_irq_vector() call on 
another CPU, then I really cannot see what would save us from deep 
trouble: we use cfg->vector while that is possibly being modified, 
right?

So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
to make sure it's not used during this test. In particular:

drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, NULL);
drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, mask);

might be triggering it in the virtualization code...

The second patch below adds the relevant pr_info(). (untested)

Now this too might be unrelated, because the affinity hint was added 
ages ago, in:

e7a297b0d7d6 ("genirq: Add CPU mask affinity hint")

and the potentially racy nature of calling into set_affinity without 
the desc lock held was probably not realized back then.

VirtIO's use of the affinity-hint was added a while ago as well, four 
years ago, in:

  75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue")

2c)

The other thing that worries me here is that we apparently send an 
IRQ-move IPI while having an un-acked local APIC (!). I have vague 
memories that this was problematic and fragile before.

To ease my worries in this area I've attached a third patch below that 
changes the order around.

This too is ancient behavior - but might be more prominent on certain 
hardware (and virtualization) variants, so it has a chance to be 
relevant.

3)

Furthermore, I also noticed that the whole vector_lock()/unlock() 
business is actively misleading IMHO, the vector_lock looks local to 
vector.c, while we have these two calls that essentially export it ...

Some of that damage was inflicted in this ancient commit:

  d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug")

the proper approach would be to protect against hotplug events in 
assign_irq_vector(), not the other way around!

Probably not a functional problem, yet somewhat disgusting.

-------------------------------

So ... all in one, these various problems might be related to the 
regression or not, but they sure need addressing.

Thanks,

	Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..1dd1de9dd690 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 			cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
 			cfg->move_in_progress =
 			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector);
 			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
 			break;
 		}
@@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 			cpumask_copy(cfg->old_domain, cfg->domain);
 			cfg->move_in_progress =
 			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector);
 		}
 		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg)
 		}
 	}
 	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector);
 	raw_spin_unlock_irqrestore(&vector_lock, flags);
 }
 
@@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg)
 		free_cpumask_var(cleanup_mask);
 	}
 	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector);
 }
 
 asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index c0a1100d911f..18300a7b8ed2 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 	desc->affinity_hint = m;
 	irq_put_desc_unlock(desc, flags);
 	/* set the initial affinity to prevent every interrupt being on CPU0 */
-	if (m)
+	if (m) {
+		pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq);
 		__irq_set_affinity(irq, m, false);
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_hint);

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..833a981c5420 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
 
 void apic_ack_edge(struct irq_data *data)
 {
+	ack_APIC_irq();
+
+	/* Might generate IPIs, so do this after having ACKed the APIC: */
 	irq_complete_move(irqd_cfg(data));
 	irq_move_irq(data);
-	ack_APIC_irq();
 }
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/