linux-kernel - Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAP4=nvRTH5VxSO3VSDCospWcZagawTMs0L9J_kcKdGSkn7xT_Q@mail.gmail.com>
Date: Thu, 22 Aug 2024 11:32:07 +0200
From: Tomas Glozar <tglozar@...hat.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release

st 21. 8. 2024 v 22:02 odesílatel Steven Rostedt <rostedt@...dmis.org> napsal:
>
> I'm able to reproduce this with the above. Unfortunately, I can still
> reproduce it after applying this patch :-(
>

Thank you for looking at this. I was at first not too sure about
whether this is the proper fix, but after some discussion with Luis
(in CC), we have come to the conclusion that the double-close of the
timerlat_fd might be a possible explanation, and this patch worked for
both of us. Are you reproducing the same bug (NULL pointer dereference
in hrtimer_active) with the patch? IIUC that should not happen anymore
since the patch explicitly checks for zero in the hrtimer structure.

I have caught however a different panic in addition to the one
reported above while testing "rtla: Support idle state disabling via
libcpupower in timerlat" on an EL9 RT kernel:

BUG: kernel NULL pointer dereference, address: 0000000000000014
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Tainted: G        W
-------  ---  5.14.0-452.el9.x86_64+rt #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
RIP: 0010:task_dump_owner+0x3d/0x100
RSP: 0018:ffffadd6c0013aa8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa00c864f4580 RCX: ffffa00c87453e10
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa00c864f4580
RBP: ffffa00c87453e10 R08: ffffa00c87418e80 R09: ffffa00c87418e80
R10: ffffa00c88236600 R11: ffffffffb73f1868 R12: ffffa00c87453e0c
R13: 0000000000000000 R14: ffffa00cb5e430c0 R15: ffffa00cb5e430c8
FS:  00007f9336b41b40(0000) GS:ffffa00cffd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 00000000025ee002 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? proc_pid_make_inode+0xa0/0x110
? __die_body.cold+0x8/0xd
? page_fault_oops+0x140/0x180
? do_user_addr_fault+0x61/0x690
? kvm_read_and_reset_apf_flags+0x45/0x60
? exc_page_fault+0x65/0x180
? asm_exc_page_fault+0x22/0x30
? task_dump_owner+0x3d/0x100
? task_dump_owner+0x36/0x100
proc_pid_make_inode+0xa0/0x110
proc_pid_instantiate+0x21/0xb0
proc_pid_lookup+0x95/0x170
proc_root_lookup+0x1d/0x50
__lookup_slow+0x9c/0x150
walk_component+0x158/0x1d0
link_path_walk.part.0.constprop.0+0x24e/0x3c0
? path_init+0x326/0x4d0
path_openat+0xb1/0x280
do_filp_open+0xb2/0x160
? migrate_enable+0xd5/0x150
? rt_spin_unlock+0x13/0x40
do_sys_openat2+0x96/0xd0
__x64_sys_openat+0x53/0xa0
...
</TASK>

This was preceded by a WARN:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 6 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110
CPU: 6 PID: 1 Comm: systemd Kdump: loaded Not tainted
5.14.0-452.el9.x86_64+rt #1
RIP: 0010:refcount_warn_saturate+0x74/0x110
Call Trace:
<TASK>
[   78.184877]  proc_pid_lookup+0x161/0x170
[   78.184883]  proc_root_lookup+0x1d/0x50
[   78.184890]  __lookup_slow+0x9c/0x150
[   78.184899]  walk_component+0x158/0x1d0
[   78.184908]  link_path_walk.part.0.constprop.0+0x24e/0x3c0
[   78.184915]  ? path_init+0x326/0x4d0
[   78.184922]  path_openat+0xb1/0x280
[   78.184926]  do_filp_open+0xb2/0x160
[   78.184934]  ? migrate_enable+0xd5/0x150
[   78.184942]  ? rt_spin_unlock+0x13/0x40
[   78.184950]  do_sys_openat2+0x96/0xd0
[   78.184958]  __x64_sys_openat+0x53/0xa0
[   78.184964]  do_syscall_64+0x5c/0xf0
[   78.185011]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
...
</TASK>

> Looking at the code, the logic for handling the kthread seems off. I'll
> spend a little time to see if I can figure it out.
>

Yeah, it seems there might be multiple bugs in the user workload
handling, the other NULL pointer dereference and refcount warning
above might be related (but I have yet to reproduce it on an upstream
kernel). I'm also going to look at the code and will post any findings
here.



Tomas