[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170131070406.GA18460@ast-mbp.thefacebook.com>
Date: Mon, 30 Jan 2017 23:04:08 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: crash in perf_event_read
Hi Peter,
rarely I'm seeing the following crash:
[40196.164255] BUG: unable to handle kernel paging request at 000000000000a11a
[40196.179636] IP: perf_event_read+0xd3/0x1a0
[40196.188669] PGD 82e93a067
[40196.188670] PUD 7e1ddf067
[40196.194629] PMD 0
[40196.200589]
[40196.208284] Oops: 0000 [#1] SMP
[40196.208285] Modules linked in:
[40196.208299] CPU: 24 PID: 4423 Comm: dynoKernelMon Not tainted 4.10.0-rc5-01189-gc6e0ad0ee5b0 #599
[40196.208300] Hardware name: Quanta Mono Lake-M.2 SATA 20F20BU0270/Mono Lake-M.2 SATA, BIOS F20_3A12 10/24/2016
[40196.208301] task: ffff8807e3b65580 task.stack: ffffc90009748000
[40196.208302] RIP: 0010:perf_event_read+0xd3/0x1a0
[40196.208303] RSP: 0018:ffffc9000974bd48 EFLAGS: 00010202
[40196.208304] RAX: 000000000000a040 RBX: ffff8807b79fd000 RCX: 0000000000000000
[40196.208304] RDX: 0000000000000018 RSI: 0000000000000000 RDI: 00000000ffffffff
[40196.208305] RBP: ffffc9000974bd80 R08: 0000000000000000 R09: 0000000000000000
[40196.208305] R10: ffff8807cf8c7038 R11: 0000000000000000 R12: ffffc9000974bde8
[40196.208306] R13: 0000000000000000 R14: ffff8807b79fd000 R15: 00000000000004e0
[40196.208307] FS: 00007ff1b45ff700(0000) GS:ffff88085f400000(0000) knlGS:0000000000000000
[40196.208307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40196.208308] CR2: 000000000000a11a CR3: 0000000850298000 CR4: 00000000003406e0
[40196.208308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[40196.208309] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[40196.208309] Call Trace:
[40196.208313] ? __might_sleep+0x4a/0x80
[40196.208316] perf_event_read_value+0x45/0x130
[40196.208318] perf_read+0x84/0x2d0
[40196.208322] __vfs_read+0x28/0x110
[40196.208325] ? security_file_permission+0x9b/0xc0
[40196.208327] vfs_read+0xa5/0x170
[40196.208327] SyS_read+0x46/0xa0
[40196.208329] do_syscall_64+0x4d/0xb0
[40196.208332] entry_SYSCALL64_slow_path+0x25/0x25
[40196.208333] RIP: 0033:0x7ff1b6f5716d
[40196.208334] RSP: 002b:00007ff1b45fd080 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
[40196.208335] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff1b6f5716d
[40196.208335] RDX: 0000000000000020 RSI: 00007ff1b45fd090 RDI: 0000000000000054
[40196.208336] RBP: 00007ff1b45fd0e0 R08: 00007ff1b65e45c0 R09: 00007ff1b45fc9b1
[40196.208336] R10: 0000000000000020 R11: 0000000000000293 R12: 0000000000000000
[40196.208337] R13: 00007ff1b5687000 R14: 00000004a817c7fe R15: 00000000000004e0
[40196.208337] Code: 60 02 00 00 74 30 48 63 cf 48 c7 c0 40 a0 00 00 48 8b 34 cd c0 93 d0 81 48 63 ca 48 8b 0c cd c0 93 d0 81 0f b7 8c 08 da 00 00 00 <66> 39 8c 30 da 00 00 00 0f 44 fa 48 8d 55 d0 b9 01 00 00 00 48
[40196.208353] RIP: perf_event_read+0xd3/0x1a0 RSP: ffffc9000974bd48
[40196.208353] CR2: 000000000000a11a
The RIP points to this asm:
0xffffffff8115fc9b <+155>: mov %gs:0x7eeaa486(%rip),%edx # 0xa128 <cpu_number>
0xffffffff8115fca2 <+162>: testb $0x2,0x68(%rbx)
0xffffffff8115fca6 <+166>: mov 0x260(%rdi),%edi
0xffffffff8115fcac <+172>: je 0xffffffff8115fcde <perf_event_read+222>
0xffffffff8115fcae <+174>: movslq %edi,%rcx
0xffffffff8115fcb1 <+177>: mov $0xa040,%rax
0xffffffff8115fcb8 <+184>: mov -0x7e2f6c40(,%rcx,8),%rsi
0xffffffff8115fcc0 <+192>: movslq %edx,%rcx
0xffffffff8115fcc3 <+195>: mov -0x7e2f6c40(,%rcx,8),%rcx
0xffffffff8115fccb <+203>: movzwl 0xda(%rax,%rcx,1),%ecx
0xffffffff8115fcd3 <+211>: cmp %cx,0xda(%rax,%rsi,1)
which is this C code:
perf_event_read():
local_cpu = get_cpu();
cpu_to_read = find_cpu_to_read(event, local_cpu);
put_cpu();
find_cpu_to_read():
event_pkg = topology_physical_package_id(event_cpu);
local_pkg = topology_physical_package_id(local_cpu);
if (event_pkg == local_pkg)
If I read the asm correctly at the time of the crash
event_cpu == RDI == 00000000ffffffff
or in other words event->oncpu == -1
which I think is technically possible here.
Any suggestions how to fix this?
Happy to test any patches, though I don't know how to reproduce reliably.
Thanks
Powered by blists - more mailing lists