[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6cefe7ca-842b-d3af-0299-588b9307703b@gmail.com>
Date: Wed, 23 Feb 2022 09:54:26 -0800
From: Florian Fainelli <f.fainelli@...il.com>
To: Peter Robinson <pbrobinson@...il.com>
Cc: Doug Berger <opendmb@...il.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
bcm-kernel-feedback-list@...adcom.com, netdev@...r.kernel.org,
Javier Martinez Canillas <javierm@...hat.com>
Subject: Re: [PATCH] net: bcmgenet: Return not supported if we don't have a
WoL IRQ
On 2/23/2022 9:45 AM, Peter Robinson wrote:
>>> The top two are pre/post plugging an ethernet cable with the patched
>>> kernel, the last two are the broken kernel. There doesn't seem to be a
>>> massive difference in interrupts but you likely know more of what
>>> you're looking for.
>>
>> There is not a difference in the hardware interrupt numbers being
>> claimed by GENET which are both GIC interrupts 189 and 190 (157 + 32 and
>> 158 + 32). In the broken case we can see that the second interrupt line
>> (interrupt 190), which is the one that services the non-default TX
>> queues does not fire up at all whereas it does in the patched case.
>>
>> The transmit queue timeout makes sense given that transmit queue 2
>> (which is not the default one, default is 0) has its interrupt serviced
>> by the second interrupt line (190). We can see it not firing up, hence
>> the timeout.
>>
>> What I *think* might be happening here is the following:
>>
>> - priv->wol_irq = platform_get_irq_optional(pdev, 2) returns a negative
>> error code we do not install the interrupt handler for the WoL interrupt
>> since it is not valid
>>
>> - bcmgenet_set_wol() is called, we do not check priv->wol_irq, so we
>> call enable_irq_wake(priv->wol_irq) and somehow irq_set_irq_wake() is
>> able to resolve that irq number to a valid interrupt descriptor
>>
>> - eventually we just mess up the interrupt descriptor for interrupt 49
>> and it stops working
>>
>> Now since this appears to be an ACPI-enabled system, we may be hitting
>> this part of the code in platform_get_irq_optional():
>>
>> r = platform_get_resource(dev, IORESOURCE_IRQ, num);
>> if (has_acpi_companion(&dev->dev)) {
>> if (r && r->flags & IORESOURCE_DISABLED) {
>> ret = acpi_irq_get(ACPI_HANDLE(&dev->dev),
>> num, r);
>> if (ret)
>> goto out;
>> }
>> }
>>
>> and then I am not clear what interrupt this translates into here, or
>> whether it is possible to get a valid interrupt descriptor here.
>>
>> The patch is fine in itself, but I would really prefer that we get to
>> the bottom of this rather than have a superficial understanding of the
>> nature of the problem.
>
> I have no problems working with you to improve the driver, the problem
> I have is this is currently a regression in 5.17 so I would like to
> see something land, whether it's reverting the other patch, landing
> thing one or another straight forward fix and then maybe revisit as
> whole in 5.18.
Understood and I won't require you or me to complete this investigating
before fixing the regression, this is just so we understand where it
stemmed from and possibly fix the IRQ layer if need be. Given what I
just wrote, do you think you can sprinkle debug prints throughout the
kernel to figure out whether enable_irq_wake() somehow messes up the
interrupt descriptor of interrupt and test that theory? We can do that
offline if you want.
--
Florian
Powered by blists - more mailing lists