[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bab7c70a-f279-ac6c-9b23-f8ca97b9f17d@ozlabs.ru>
Date: Sat, 14 Nov 2020 14:55:54 +1100
From: Alexey Kardashevskiy <aik@...abs.ru>
To: Cédric Le Goater <clg@...d.org>,
linux-kernel@...r.kernel.org
Cc: Marc Zyngier <maz@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Michael Ellerman <mpe@...erman.id.au>, Qian Cai <cai@....pw>,
Rob Herring <robh@...nel.org>,
Frederic Barrat <fbarrat@...ux.ibm.com>,
Michal Suchánek <msuchanek@...e.de>
Subject: Re: [PATCH kernel v3] genirq/irqdomain: Add reference counting to
IRQs
On 14/11/2020 05:19, Cédric Le Goater wrote:
> On 11/9/20 10:46 AM, Alexey Kardashevskiy wrote:
>> PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
>> Device drivers map/unmap hardware interrupts via irq_create_mapping()/
>> irq_dispose_mapping(). The problem with that these interrupts are
>> shared and when performing hot unplug, we need to unmap the interrupt
>> only when the last device is released.
>
> The background context for such a need is that the POWER9 and POWER10
> processors have a new XIVE interrupt controller which uses MMIO pages
> for interrupt management. Each interrupt has a pair of pages which are
> required to be unmapped in some environment, like PHB removal. And so,
> all interrupts need to be unmmaped.
>
>>
>> This reuses already existing irq_desc::kobj for this purpose.
>> The refcounter is naturally 1 when the descriptor is allocated already;
>> this adds kobject_get() in places where already existing mapped virq
>> is returned.
>>
>> This reorganizes irq_dispose_mapping() to release the kobj and let
>> the release callback do the cleanup.
>>
>> As kobject_put() is called directly now (not via RCU), it can also handle
>> the early boot case (irq_kobj_base==NULL) with the help of
>> the kobject::state_in_sysfs flag and without additional irq_sysfs_del().
>
> Could this change be done in a following patch ?
No. Before this patch, we remove from sysfs - call kobject_del() -
before calling kobject_put() which we do via RCU. After the patch, this
kobject_del() is called from the very last kobject_put() and when we get
to this release handler - the sysfs node is already removed and we get a
message about the missing parent.
>> While at this, clean up the comment at where irq_sysfs_del() was called.>
>> Quick grep shows no sign of irq reference counting in drivers. Drivers
>> typically request mapping when probing and dispose it when removing;
>
> Some ARM drivers call directly irq_alloc_descs() and irq_free_descs().
> Is that a problem ?
Kind of, I'll need to go through these places and replace
irq_free_descs() with kobject_put() (may be some wrapper or may be
change irq_free_descs() to do kobject_put()).
>> platforms tend to dispose only if setup failed and the rest seems
>> calling one dispose per one mapping. Except (at least) PPC/pseries
>> which needs https://lkml.org/lkml/2020/10/27/259
>>
>> Cc: Cédric Le Goater <clg@...d.org>
>> Cc: Marc Zyngier <maz@...nel.org>
>> Cc: Michael Ellerman <mpe@...erman.id.au>
>> Cc: Qian Cai <cai@....pw>
>> Cc: Rob Herring <robh@...nel.org>
>> Cc: Frederic Barrat <fbarrat@...ux.ibm.com>
>> Cc: Michal Suchánek <msuchanek@...e.de>
>> Cc: Thomas Gleixner <tglx@...utronix.de>
>> Signed-off-by: Alexey Kardashevskiy <aik@...abs.ru>
>
> I used this patch and the ppc one doing the LSI removal:
>
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20201027090655.14118-3-aik@ozlabs.ru/
>
> on different P10 and P9 systems, on a large system (>1K HW theads),
> KVM guests and pSeries machines. Checked that PHB removal was OK.
>
> Tested-by: Cédric Le Goater <clg@...d.org>
>
> But IRQ subsystem covers much more than these systems.
Indeed. But doing our own powerpc-only reference counting on top of
irs_desc is just ugly.
>
> Some comments below,
>
>> ---
>>
>> This is what it is fixing for powerpc:
>>
>> There was a comment about whether hierarchical IRQ domains should
>> contribute to this reference counter and I need some help here as
>> I cannot see why.
>> It is reverse now - IRQs contribute to domain->mapcount and
>> irq_domain_associate/irq_domain_disassociate take necessary steps to
>> keep this counter in order. What might be missing is that if we have
>> cascade of IRQs (as in the IOAPIC example from
>> Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
>> contribute to the children IRQs and it is up to
>> irq_domain_ops::alloc/free hooks, and they all seem to be eventually
>> calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
>> right.
>>
>> Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
>> to see in debugfs about IRQs but on my thinkpad there nothing about
>> hierarchy.
>>
>> So I'll ask again :)
>>
>> What is the easiest way to get irq-hierarchical hardware?
>> I have a bunch of powerpc boxes (no good) but also a raspberry pi,
>> a bunch of 32/64bit orange pi's, an "armada" arm box,
>> thinkpads - is any of this good for the task?
>>
>>
>>
>> ---
>> Changes:
>> v3:
>> * removed very wrong kobject_get/_put from irq_domain_associate/
>> irq_domain_disassociate as these are called from kobject_release so
>> irq_descs were never actually released
>> * removed irq_sysfs_del as 1) we do not seem to need it with changed
>> counting 2) produces a "no parent" warning as it would be called from
>> kobject_release which removes sysfs nodes itself
>>
>> v2:
>> * added more get/put, including irq_domain_associate/irq_domain_disassociate
>> ---
>> kernel/irq/irqdesc.c | 55 ++++++++++++++++++------------------------
>> kernel/irq/irqdomain.c | 37 ++++++++++++++++------------
>> 2 files changed, 46 insertions(+), 46 deletions(-)
>>
>> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
>> index 1a7723604399..79c904ebfd5c 100644
>> --- a/kernel/irq/irqdesc.c
>> +++ b/kernel/irq/irqdesc.c
>> @@ -295,18 +295,6 @@ static void irq_sysfs_add(int irq, struct irq_desc *desc)
>> }
>> }
>>
>> -static void irq_sysfs_del(struct irq_desc *desc)
>> -{
>> - /*
>> - * If irq_sysfs_init() has not yet been invoked (early boot), then
>> - * irq_kobj_base is NULL and the descriptor was never added.
>> - * kobject_del() complains about a object with no parent, so make
>> - * it conditional.
>> - */
>> - if (irq_kobj_base)
>> - kobject_del(&desc->kobj);
>> -}
>> -
>> static int __init irq_sysfs_init(void)
>> {
>> struct irq_desc *desc;
>> @@ -337,7 +325,6 @@ static struct kobj_type irq_kobj_type = {
>> };
>>
>> static void irq_sysfs_add(int irq, struct irq_desc *desc) {}
>> -static void irq_sysfs_del(struct irq_desc *desc) {}
>>
>> #endif /* CONFIG_SYSFS */
>>
>> @@ -419,20 +406,40 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
>> return NULL;
>> }
>>
>> +static void delayed_free_desc(struct rcu_head *rhp);
>
> Can you move delayed_free_desc() here ? It is small.
Yes, I kept it this way to make review slightly easier.
>
>> static void irq_kobj_release(struct kobject *kobj)
>> {
>> struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
>> +#ifdef CONFIG_IRQ_DOMAIN
>
> may be, we could use IS_ENABLED(CONFIG_IRQ_DOMAIN) and add a specific routine
> for the domain case.
May be.
>> + struct irq_domain *domain;
>> + unsigned int virq = desc->irq_data.irq;
>>
>> - free_masks(desc);
>> - free_percpu(desc->kstat_irqs);
>> - kfree(desc);
>> + domain = desc->irq_data.domain;
>> + if (domain) {
>> + if (irq_domain_is_hierarchy(domain)) {
>> + irq_domain_free_irqs(virq, 1);
>> + } else {
>> + irq_domain_disassociate(domain, virq);
>> + irq_free_desc(virq);
>> + }
>> + }
>> +#endif
>> + /*
>> + * We free the descriptor, masks and stat fields via RCU. That
>> + * allows demultiplex interrupts to do rcu based management of
>> + * the child interrupts.
>> + * This also allows us to use rcu in kstat_irqs_usr().
>> + */
>> + call_rcu(&desc->rcu, delayed_free_desc);
>> }
>>
>> static void delayed_free_desc(struct rcu_head *rhp)
>> {
>> struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
>>
>> - kobject_put(&desc->kobj);
>> + free_masks(desc);
>> + free_percpu(desc->kstat_irqs);
>> + kfree(desc);
>> }
>>
>> static void free_desc(unsigned int irq)>
>> @@ -443,24 +450,10 @@ static void free_desc(unsigned int irq)
>> unregister_irq_proc(irq, desc);
>>
>> /*
>> - * sparse_irq_lock protects also show_interrupts() and
>> - * kstat_irq_usr(). Once we deleted the descriptor from the
>> - * sparse tree we can free it. Access in proc will fail to
>> - * lookup the descriptor.
>> - *
>> * The sysfs entry must be serialized against a concurrent
>> * irq_sysfs_init() as well.
>> */
>> - irq_sysfs_del(desc);
>
> I would leave that change in another patch.
Explained above.
>
>> delete_irq_desc(irq);
>> -
>> - /*
>> - * We free the descriptor, masks and stat fields via RCU. That
>> - * allows demultiplex interrupts to do rcu based management of
>> - * the child interrupts.
>> - * This also allows us to use rcu in kstat_irqs_usr().
>> - */
>> - call_rcu(&desc->rcu, delayed_free_desc);
>> }
>>
>> static int alloc_descs(unsigned int start, unsigned int cnt, int node,
>> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
>> index cf8b374b892d..d79d679bec35 100644
>> --- a/kernel/irq/irqdomain.c
>> +++ b/kernel/irq/irqdomain.c
>> @@ -638,6 +638,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>> {
>> struct device_node *of_node;
>> int virq;
>> + struct irq_desc *desc;
>>
>> pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);
>>
>> @@ -655,7 +656,9 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>> /* Check if mapping already exists */
>> virq = irq_find_mapping(domain, hwirq);
>> if (virq) {
>> + desc = irq_to_desc(virq);
>> pr_debug("-> existing mapping on virq %d\n", virq);
>> + kobject_get(&desc->kobj);
>
> Could we have an inline helper irq_desc_get() to do :
>
> struct irq_desc *desc = irq_to_desc(virq);
> kobject_get(&desc->kobj);
>
> It will remove quite a few lines.
>
> Whether it takes a 'struct irq_desc' or 'int virq' as a parameter, is up to
> you I guess.
>
> It's good way to hide the mapping counter used to get a mapping reference
> also. We might change it to its own variable if we find issues.
>
>> return virq;
>> }
>>
>> @@ -751,6 +754,7 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
>> irq_hw_number_t hwirq;
>> unsigned int type = IRQ_TYPE_NONE;
>> int virq;
>> + struct irq_desc *desc;
>>
>> if (fwspec->fwnode) {
>> domain = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_WIRED);
>> @@ -787,8 +791,11 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
>> * current trigger type then we are done so return the
>> * interrupt number.
>> */
>> - if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
>> + if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq)) {
>> + desc = irq_to_desc(virq);
>> + kobject_get(&desc->kobj);
>> return virq;
>> + }
>>
>> /*
>> * If the trigger type has not been set yet, then set
>> @@ -800,6 +807,8 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
>> return 0;
>>
>> irqd_set_trigger_type(irq_data, type);
>> + desc = irq_to_desc(virq);
>> + kobject_get(&desc->kobj);
>> return virq;
>> }
>>
>> @@ -852,22 +861,12 @@ EXPORT_SYMBOL_GPL(irq_create_of_mapping);
>> */
>> void irq_dispose_mapping(unsigned int virq)
>> {
>> - struct irq_data *irq_data = irq_get_irq_data(virq);
>> - struct irq_domain *domain;
>> + struct irq_desc *desc = irq_to_desc(virq);
>>
>> - if (!virq || !irq_data)
>> + if (!virq || !desc)
>> return;
>>
>> - domain = irq_data->domain;
>> - if (WARN_ON(domain == NULL))
>> - return;
>> -
>> - if (irq_domain_is_hierarchy(domain)) {
>> - irq_domain_free_irqs(virq, 1);
>> - } else {
>> - irq_domain_disassociate(domain, virq);
>> - irq_free_desc(virq);
>> - }
>> + kobject_put(&desc->kobj);
>> }
>> EXPORT_SYMBOL_GPL(irq_dispose_mapping);
>>
>> @@ -1413,6 +1412,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
>> bool realloc, const struct irq_affinity_desc *affinity)
>> {
>> int i, ret, virq;
>> + bool get_ref = false;
>
> This needs a comment on why we are not always getting a ref and
> anyhow this looks hacky.
>
>>
>> if (domain == NULL) {
>> domain = irq_default_domain;
>> @@ -1422,6 +1422,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
>>
>> if (realloc && irq_base >= 0) {
>> virq = irq_base;
>> + get_ref = true;
>
> The realloc flag is only used for old x86 HW and the IRQ IPI API.
>
> Could we make special IRQ routines for these callers and let them
> handle the get ref ?
I'll try this. Thanks for the review!
>
> C.
>
>
>> } else {
>> virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
>> affinity);
>> @@ -1453,8 +1454,14 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
>> }
>> }
>>
>> - for (i = 0; i < nr_irqs; i++)
>> + for (i = 0; i < nr_irqs; i++) {
>> irq_domain_insert_irq(virq + i);
>> + if (get_ref) {
>> + struct irq_desc *desc = irq_to_desc(virq + i);
>> +
>> + kobject_get(&desc->kobj);
>> + }
>> + }
>> mutex_unlock(&irq_domain_mutex);
>>
>> return virq;
>>
>
--
Alexey
Powered by blists - more mailing lists