[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP4=nvRTH5VxSO3VSDCospWcZagawTMs0L9J_kcKdGSkn7xT_Q@mail.gmail.com>
Date: Thu, 22 Aug 2024 11:32:07 +0200
From: Tomas Glozar <tglozar@...hat.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release
st 21. 8. 2024 v 22:02 odesÃlatel Steven Rostedt <rostedt@...dmis.org> napsal:
>
> I'm able to reproduce this with the above. Unfortunately, I can still
> reproduce it after applying this patch :-(
>
Thank you for looking at this. I was at first not too sure about
whether this is the proper fix, but after some discussion with Luis
(in CC), we have come to the conclusion that the double-close of the
timerlat_fd might be a possible explanation, and this patch worked for
both of us. Are you reproducing the same bug (NULL pointer dereference
in hrtimer_active) with the patch? IIUC that should not happen anymore
since the patch explicitly checks for zero in the hrtimer structure.
I have caught however a different panic in addition to the one
reported above while testing "rtla: Support idle state disabling via
libcpupower in timerlat" on an EL9 RT kernel:
BUG: kernel NULL pointer dereference, address: 0000000000000014
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G W
------- --- 5.14.0-452.el9.x86_64+rt #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
RIP: 0010:task_dump_owner+0x3d/0x100
RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580
RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80
R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c
R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8
FS: 00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? proc_pid_make_inode+0xa0/0x110
? __die_body.cold+0x8/0xd
? page_fault_oops+0x140/0x180
? do_user_addr_fault+0x61/0x690
? kvm_read_and_reset_apf_flags+0x45/0x60
? exc_page_fault+0x65/0x180
? asm_exc_page_fault+0x22/0x30
? task_dump_owner+0x3d/0x100
? task_dump_owner+0x36/0x100
proc_pid_make_inode+0xa0/0x110
proc_pid_instantiate+0x21/0xb0
proc_pid_lookup+0x95/0x170
proc_root_lookup+0x1d/0x50
__lookup_slow+0x9c/0x150
walk_component+0x158/0x1d0
link_path_walk.part.0.constprop.0+0x24e/0x3c0
? path_init+0x326/0x4d0
path_openat+0xb1/0x280
do_filp_open+0xb2/0x160
? migrate_enable+0xd5/0x150
? rt_spin_unlock+0x13/0x40
do_sys_openat2+0x96/0xd0
__x64_sys_openat+0x53/0xa0
...
</TASK>
This was preceded by a WARN:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 6 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Not tainted
5.14.0-452.el9.x86_64+rt #1
RIP: 0010:refcount_warn_saturate+0x74/0x110
Call Trace:
<TASK>
[ 78.184877] proc_pid_lookup+0x161/0x170
[ 78.184883] proc_root_lookup+0x1d/0x50
[ 78.184890] __lookup_slow+0x9c/0x150
[ 78.184899] walk_component+0x158/0x1d0
[ 78.184908] link_path_walk.part.0.constprop.0+0x24e/0x3c0
[ 78.184915] ? path_init+0x326/0x4d0
[ 78.184922] path_openat+0xb1/0x280
[ 78.184926] do_filp_open+0xb2/0x160
[ 78.184934] ? migrate_enable+0xd5/0x150
[ 78.184942] ? rt_spin_unlock+0x13/0x40
[ 78.184950] do_sys_openat2+0x96/0xd0
[ 78.184958] __x64_sys_openat+0x53/0xa0
[ 78.184964] do_syscall_64+0x5c/0xf0
[ 78.185011] entry_SYSCALL_64_after_hwframe+0x6e/0x76
...
</TASK>
> Looking at the code, the logic for handling the kthread seems off. I'll
> spend a little time to see if I can figure it out.
>
Yeah, it seems there might be multiple bugs in the user workload
handling, the other NULL pointer dereference and refcount warning
above might be related (but I have yet to reproduce it on an upstream
kernel). I'm also going to look at the code and will post any findings
here.
Tomas
Powered by blists - more mailing lists