lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.21.1710112227110.21191@linux.fjfi.cvut.cz>
Date:   Wed, 11 Oct 2017 22:48:26 +0200 (CEST)
From:   David Kozub <zub@...ux.fjfi.cvut.cz>
To:     Daniel Lezcano <daniel.lezcano@...aro.org>
cc:     Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org
Subject: Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX
 2c3 - null call

On Wed, 11 Oct 2017, Daniel Lezcano wrote:

> On 10/10/2017 23:19, David Kozub wrote:
>> On Tue, 10 Oct 2017, Daniel Lezcano wrote:
>>
>>> On 09/10/2017 21:33, David Kozub wrote:
>>>> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>>>>
>>>>> On 07/10/2017 23:26, David Kozub wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>>>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>>>>
>>>>>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>>>>> registered
>>>>>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>>>>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>>>>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>>>>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>>>>> event, using IRQ 7
>>>>>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>>>>>> at   (null)
>>>>>> [    2.412698] IP:   (null)
>>>>>> [    2.412702] *pde = 00000000
>>>>>> [    2.412713] Oops: 0000 [#1]
>>>>>> [    2.412716] Modules linked in:
>>>>>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>>>>> 4.14.0-rc3-humel-test17 #36
>>>>>> [    2.412739] task: c0010000 task.stack: c008e000
>>>>>> [    2.412744] EIP:   (null)
>>>>>> [    2.412749] EFLAGS: 00210093 CPU: 0
>>>>>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>>>>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>>>>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>>>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>>>>> [    2.412793] Call Trace:
>>>>>> [    2.412800]  <IRQ>
>>>>>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>>>>>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>>>>>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>>>>>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>>>>>> [    2.412892]  handle_irq_event+0x1d/0x29
>>>>>> [    2.412905]  handle_level_irq+0x57/0xc6
>>>>>> [    2.412924]  handle_irq+0x47/0x52
>>>>>> [    2.412929]  </IRQ>
>>>>>> [    2.412934]  <SOFTIRQ>
>>>>>> [    2.412949]  do_IRQ+0x32/0x9b
>>>>>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>>>>>> [    2.412976]  common_interrupt+0x2e/0x34
>>>>>> ...
>>>>>>
>>>>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>>>>> mfgpt_tick() on line 132:
>>>>>> ...
>>>>>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>>>>>> ...
>>>>>> cs5535_clockevent.event_handler is null.
>>>>>>
>>>>>> Adding some more traces I see mfgpt_tick() gets called before
>>>>>> clockevents_config_and_register() finishes (invoked from
>>>>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>>>>> event_handler, it's NULL. Wrapping the event_handler call on line
>>>>>> 132 in
>>>>>> a null pointer check results in a working system.
>>>>>>
>>>>>> The issue is present at least also in kernel 4.13.5. In kernel
>>>>>> versions
>>>>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>>>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>>>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>>>>
>>>>>> My guess is that the timer interrupt is enabled too early. If I change
>>>>>> the order of clockevents_config_and_register() and
>>>>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>>>>> cs5535_mfgpt_init() like this:
>>>>>> ...
>>>>>>     /* Set up the clock event */
>>>>>>     printk(KERN_INFO DRV_NAME
>>>>>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>>>>         timer_irq);
>>>>>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>>>>                     0xF, 0xFFFE);
>>>>>>
>>>>>>     /* Set the clock scale and enable the event mode for CMP2 */
>>>>>>     val = MFGPT_SCALE | (3 << 8);
>>>>>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>>>> ...
>>>>>> the system boots and seems to be OK.
>>>>>
>>>>> What happens if instead of inverting those two lines, you add
>>>>> mfgpt_shutdown() early in the init function ?
>>>>
>>>> Hi Daniel,
>>>>
>>>> I can't call mfgpt_shutdown() before clockevents_config_and_register()
>>>> as mfgpt_shutdown() wants a struct clock_event_device* which is only
>>>> available after clockevents_config_and_register(). But I tried calling
>>>> disable_timer():
>>>>
>>>> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>>>>         }
>>>>         cs5535_event_clock = timer;
>>>>
>>>> +       disable_timer(cs5535_event_clock);
>>>> +
>>>>         /* Set up the IRQ on the MFGPT side */
>>>>         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>>>                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>>>
>>>> which should do the same thing. Unfortunately I still get the same BUG.
>>>>
>>>> I added some more traces and I think the IRQ comes just after:
>>>>     ret = setup_irq(timer_irq, &mfgptirq);
>>>
>>> Mmh, the interrupt flag tells it is a shared interrupt, bad :/
>>
>> While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not
>> really shared:
>>
>> $ cat /proc/interrupts
>>            CPU0
>>   0:       9959    XT-PIC      timer
>>   2:          0    XT-PIC      cascade
>>   4:       1763    XT-PIC      ttyS0
>>   7:       4648    XT-PIC      cs5535-clockevt
>>   9:          0    XT-PIC      ath
>>  10:        151    XT-PIC      eth0
>>  11:          0    XT-PIC      eth1
>>  12:          1    XT-PIC      ehci_hcd:usb1, ohci_hcd:usb2
>>  14:       4063    XT-PIC      pata_cs5536
>>  15:          0    XT-PIC      eth2
>>
>> (if this is sufficient proof)
>
> After digging a bit, the interrupt can be shared. There is a comment in
> the drivers/misc/cs5535-mfgpt.c:cs5535_mfgpt_set_irq() function.
>
> Moreover, the kernel can boot with a module parameter to change the irq
> number. By config, the default is 7 but it could be another interrupt
> shared with a device.
>
> But your setup, if I refer the output of /proc/interrupts, does not show
> the interrupt timer is shared.
>
>>> The interrupt handler checks if the interrupt was for the timer.
>>>
>>> So here, there are two options:
>>>
>>> - the timer is in an inconsistent state, with an interrupt line up and
>>> enabled. In this case, it should be reseted correctly before going
>>> further in the init function (disable, ack irq if any, and then request
>>> irq and register).
>>>
>>> .. or ..
>>>
>>> - the check at the beginning of the interrupt handler:
>>>        /* See if the interrupt was for us */
>>>        if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
>>>         MFGPT_SETUP_CMP1)))
>>>                return IRQ_NONE;
>>>
>>>   does not work and we try to handle an interrupt which is not for us.
>>
>> If this interrupt is not shared, this option can be ruled out, right?
>
> Yes.
>
>>> That could be any of these two (or both) and I suspect the commit
>>> 8f9327cbb brings to light this problem as before the interrupt handler
>>> had the check:
>>>     if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
>>>         return IRQ_HANDLED;
>>>
>>> with:
>>>
>>> static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;
>>
>> I agree, in the previous version any early calls would be ignored.
>>
>> I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup
>> options": "M: toggles MFGPT workaround – may be required to support high
>> speed timer. See AMD CS5536 data book section 5.16.3 for the gory
>> details. The system may hang during boot if you get it wrong…"
>>
>> I tried toggling the option and it didn't seem to have any effect on
>> this issue. Furthermore I did a BIOS update and now the menu option is
>> gone.
>>
>> The CD5536 Data Book is here:
>> https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3
>> issue doesn't seem to be relevant here.
>>
>> So I think the following must be happening:
>>
>>> - the timer is in an inconsistent state, with an interrupt line up and
>>> enabled. In this case, it should be reseted correctly before going
>>> further in the init function (disable, ack irq if any, and then request
>>> irq and register).
>>
>> Should I do something more to verify that it is this case?
>>
>> Could you please give me more detailed info on how to reset the IRQ to a
>> sane sate? Being a kernel noob that summary is not sufficient for me.
>
> The code should be already there in the driver.
>
> Usually, in order to prevent this kind issue, before calling
> clockevent_config_and_register(), we disable the timer and ack
> unconditionally any interrupt.
>
> So at the first glance, I would add:
>
> diff --git a/drivers/clocksource/cs5535-clockevt.c
> b/drivers/clocksource/cs5535-clockevt.c
> index a1df588..87e9581 100644
> --- a/drivers/clocksource/cs5535-clockevt.c
> +++ b/drivers/clocksource/cs5535-clockevt.c
> @@ -152,6 +152,9 @@ static int __init cs5535_mfgpt_init(void)
>        }
>        cs5535_event_clock = timer;
>
> +       disable_timer(timer);
> +       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
> +
>        /* Set up the IRQ on the MFGPT side */
>        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",

I tried that and the handler is still called. So I did some more random 
experiments and I found out that if I call disable_timer(timer) twice, 
then the issue is resolved (the handler is not called before the 
registration is finished.) And I don't have to set MFGPT_REG_COUNTER to 0.

I have no idea why do I have to call disable_timer twice.

Best regards,
David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ