[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240823125426.404f2705@gandalf.local.home>
Date: Fri, 23 Aug 2024 12:54:26 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: tglozar@...hat.com
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in
timerlat_fd_release
On Tue, 20 Aug 2024 15:00:01 +0200
tglozar@...hat.com wrote:
> From: Tomas Glozar <tglozar@...hat.com>
>
> When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
> NULL pointer dereference can be triggered by sending consequent SIGINT
> and SIGTERM signals to the workload process. That then causes
> timerlat_fd_release to be called twice in a row, and the second time,
> hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
> dereference.
>
> This can be reproduced using rtla:
> ```
> $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
> [1] 1675
> [1]+ Aborted (SIGTERM) rtla timerlat top -u -q
> [1] 1688
> client_loop: send disconnect: Broken pipe
> ```
> triggering the bug:
> ```
> BUG: kernel NULL pointer dereference, address: 0000000000000010
> Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted
> 6.10.0-rc2+ #1
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
> 04/01/2014
> RIP: 0010:hrtimer_active+0xd/0x50
> RSP: 0018:ffffa86641567cc0 EFLAGS: 00010286
> RAX: 000000000002e2c0 RBX: ffff994c6bf2e2c8 RCX: ffff994b0911ac18
> RDX: 0000000000000000 RSI: ffff994b02f10700 RDI: ffff994c6bf2e2c8
> RBP: ffff994c6bf2e340 R08: ffff994b158f7400 R09: ffff994b0911ac18
> R10: 0000000000000010 R11: ffff994b00d40f00 R12: ffff994c6bf2e2c8
> R13: ffff994b02049b20 R14: ffff994b011806c0 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff994c6bf00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 0000000139020006 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __die+0x24/0x70
> ? page_fault_oops+0x75/0x170
> ? mt_destroy_walk.isra.0+0x2b3/0x320
> ? exc_page_fault+0x70/0x160
> ? asm_exc_page_fault+0x26/0x30
> ? hrtimer_active+0xd/0x50
> hrtimer_cancel+0x15/0x40
> timerlat_fd_release+0x48/0xe0
> __fput+0xed/0x2c0
> task_work_run+0x59/0x90
> do_exit+0x275/0x4b0
> do_group_exit+0x30/0x80
> get_signal+0x917/0x960
> ? vfs_read+0xb7/0x340
> arch_do_signal_or_restart+0x29/0xf0
> ? syscall_exit_to_user_mode+0x70/0x1f0
> ? syscall_exit_work+0xf3/0x120
> syscall_exit_to_user_mode+0x1a0/0x1f0
> do_syscall_64+0x89/0x160
> ? clear_bhb_loop+0x25/0x80
> ? clear_bhb_loop+0x25/0x80
> ? clear_bhb_loop+0x25/0x80
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f75790fd9ec
> ...
> </TASK>
> ```
>
> Fix the NULL pointer dereference by checking tlat_var->kthread for zero
> first in timerlat_fd_release, before calling hrtimer_cancel.
> tlat_var->kthread is always non-zero unless the entire tlat_var is zero,
> since it is set to the TID of the userspace workload in timerlat_fd_open
> under a mutex.
>
So debugging this a bit more (with my persistent ring buffer!), I realize
this is just fixing a symptom and not the real issue (and I added my other
patch which fixes the kthread issue).
I added trace_printk() for every place the tlat->timer gets initialized,
started, canceled and memset to zero (via the memset() function). Those
looked like this:
for_each_cpu(cpu, cpu_online_mask) {
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
trace_printk("memset 0 %px\n", &tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
}
trace_printk("start timer %px for %px\n", &tlat->timer, tlat->kthread);
hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
etc.
And here's the result I found:
<...>-909 [005] ....1 69.830509: timerlat_fd_release: cancel timer ffff88823c6b5b28 for ffff88810dae6780
<...>-909 [005] ....1 69.830511: timerlat_fd_release: memset 0 ffff88823c6b5b28
<...>-902 [001] ....1 69.831296: timerlat_fd_release: cancel timer ffff88823c4b5b28 for ffff88810dae5640
<...>-902 [001] ....1 69.831300: timerlat_fd_release: memset 0 ffff88823c4b5b28
<...>-901 [000] ....1 69.831319: timerlat_fd_release: cancel timer ffff88823c435b28 for ffff888110b8d640
<...>-901 [000] ....1 69.831323: timerlat_fd_release: memset 0 ffff88823c435b28
<...>-905 [003] ....1 69.831371: timerlat_fd_release: cancel timer ffff88823c5b5b28 for ffff88810dae0000
<...>-905 [003] ....1 69.831375: timerlat_fd_release: memset 0 ffff88823c5b5b28
<...>-904 [002] ....1 69.831390: timerlat_fd_release: cancel timer ffff88823c535b28 for ffff88810dae2280
<...>-904 [002] ....1 69.831394: timerlat_fd_release: memset 0 ffff88823c535b28
<...>-907 [004] ....1 69.831442: timerlat_fd_release: cancel timer ffff88823c635b28 for ffff88810dae33c0
<...>-907 [004] ....1 69.831446: timerlat_fd_release: memset 0 ffff88823c635b28
<...>-911 [007] ....1 69.832689: timerlat_fd_release: cancel timer ffff88823c7b5b28 for ffff88810dae1140
<...>-911 [007] ....1 69.832693: timerlat_fd_release: memset 0 ffff88823c7b5b28
<...>-916 [005] ..... 69.838927: osnoise_workload_start: memset 0 ffff88823c435b28
<...>-916 [005] ..... 69.838931: osnoise_workload_start: memset 0 ffff88823c4b5b28
<...>-916 [005] ..... 69.838932: osnoise_workload_start: memset 0 ffff88823c535b28
<...>-916 [005] ..... 69.838933: osnoise_workload_start: memset 0 ffff88823c5b5b28
<...>-916 [005] ..... 69.838934: osnoise_workload_start: memset 0 ffff88823c635b28
<...>-916 [005] ..... 69.838935: osnoise_workload_start: memset 0 ffff88823c6b5b28
<...>-916 [005] ..... 69.838936: osnoise_workload_start: memset 0 ffff88823c735b28
<...>-916 [005] ..... 69.838938: osnoise_workload_start: memset 0 ffff88823c7b5b28
<...>-910 [006] ....1 69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000
<...>-910 [006] d...1 69.841134: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
<...>-910 [006] d...1 69.844649: console: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
<...>-910 [006] d...1 69.846887: console: CPU: 6 UID: 0 PID: 910 Comm: timerlatu/6 Not tainted 6.11.0-rc3-test-00027-g014f473a3416-dirty #128 70de8528c323e5e87113ad73b5320a6d112a9ae6
<...>-910 [006] d...1 69.851008: console: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
<...>-910 [006] d...1 69.853750: console: RIP: 0010:hrtimer_active+0x58/0x300
<...>-910 [006] d...1 69.855118: console: Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
<...>-910 [006] d...1 69.860562: console: RSP: 0018:ffff88810c7a7718 EFLAGS: 00010202
<...>-910 [006] d...1 69.860569: console:
<...>-910 [006] d...1 69.862099: console: RAX: 0000000000000002 RBX: ffff88823c735b28 RCX: ffffed10478e6b6b
<...>-910 [006] d...1 69.864181: console: RDX: dffffc0000000000 RSI: ffffed10478e6b6c RDI: ffff88823c735b28
<...>-910 [006] d...1 69.866262: console: RBP: 0000000000000000 R08: ffff88823c735b58 R09: ffff88823c735b60
<...>-910 [006] d...1 69.868359: console: R10: ffff88810dae4fe7 R11: ffffed102027af0a R12: ffff88823c735b28
<...>-910 [006] d...1 69.870459: console: R13: ffffffffa14f8d8b R14: 0000000000000010 R15: ffff88823c735b28
<...>-910 [006] d...1 69.872576: console: FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
<...>-910 [006] d...1 69.874764: console: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<...>-910 [006] d...1 69.876332: console: CR2: 00007fc741ad5f90 CR3: 00000001f729e006 CR4: 0000000000170ef0
<...>-910 [006] d...1 69.878275: console: Call Trace:
<...>-910 [006] d...1 69.878954: console: <TASK>
The timer that crashed (and the other timers would eventually too) was
ffff88823c735b28, and just looking at that timer from the above:
<...>-910 [006] ....1 69.827589: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
<...>-910 [006] ....1 69.828557: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
<...>-910 [006] ....1 69.829529: wait_next_period.isra.0: start timer ffff88823c735b28 for ffff88810dae4500
<...>-916 [005] ..... 69.838936: osnoise_workload_start: memset 0 ffff88823c735b28
<...>-910 [006] ....1 69.841066: timerlat_fd_release: cancel timer ffff88823c735b28 for 0000000000000000
It was the memset that cleared it and then calling hrtimer_cancel() on that
broke it. But notice something else here. The timer is never canceled!
That's a real bug.
This wasn't caused by the file descriptor closing twice and
timerlat_fd_release called again. There's no such thing otherwise there
would be lots of broken code in the kernel.
I added a trace_dump_stack():
for_each_cpu(cpu, cpu_online_mask) {
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
trace_printk("memset 0 %px\n", &tlat_var->timer);
trace_dump_stack(0);
memset(tlat_var, 0, sizeof(*tlat_var));
}
And have this:
<...>-924 [006] ..... 42.574419: osnoise_workload_start: memset 0 ffff88823c7b5b28
<...>-924 [006] ..... 42.574423: <stack trace>
=> __ftrace_trace_stack
=> osnoise_workload_start
=> timerlat_tracer_start
=> rb_simple_write
=> vfs_write
=> ksys_write
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-913 [007] ....1 42.580553: timerlat_fd_release: cancel timer ffff88823c7b5b28 for 0000000000000000
<...>-913 [007] d...1 42.580599: console: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
From your test case:
> $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
The "kill -INT $PID" caused the write to osnoise_workload_start(), and the
after 1ms you do the "kill -TERM $PID" that kill the process which closes
the file descriptor right after the reset.
The real fix here looks to be:
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..400a72cd6ab5 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -265,6 +265,8 @@ static inline void tlat_var_reset(void)
*/
for_each_cpu(cpu, cpu_online_mask) {
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
}
}
@@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
- hrtimer_cancel(&tlat_var->timer);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
osn_var->sampling = 0;
I'll make this into a real patch and send it out.
-- Steve
Powered by blists - more mailing lists