linux-kernel - [BUG] possible deadlock in __schedule (with reproducer available)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <24481522-69BF-4CE7-A05D-1E7398400D80@u.nus.edu>
Date: Sat, 23 Nov 2024 03:39:45 +0000
From: Ruan Bonan <bonan.ruan@...us.edu>
To: "peterz@...radead.org" <peterz@...radead.org>, "mingo@...hat.com"
	<mingo@...hat.com>, "will@...nel.org" <will@...nel.org>, "longman@...hat.com"
	<longman@...hat.com>, "boqun.feng@...il.com" <boqun.feng@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kpsingh@...nel.org" <kpsingh@...nel.org>, "mattbobrowski@...gle.com"
	<mattbobrowski@...gle.com>, "ast@...nel.org" <ast@...nel.org>,
	"daniel@...earbox.net" <daniel@...earbox.net>, "andrii@...nel.org"
	<andrii@...nel.org>, "martin.lau@...ux.dev" <martin.lau@...ux.dev>,
	"eddyz87@...il.com" <eddyz87@...il.com>, "song@...nel.org" <song@...nel.org>,
	"yonghong.song@...ux.dev" <yonghong.song@...ux.dev>,
	"john.fastabend@...il.com" <john.fastabend@...il.com>, "sdf@...ichev.me"
	<sdf@...ichev.me>, "haoluo@...gle.com" <haoluo@...gle.com>,
	"jolsa@...nel.org" <jolsa@...nel.org>, "rostedt@...dmis.org"
	<rostedt@...dmis.org>, "mhiramat@...nel.org" <mhiramat@...nel.org>,
	"mathieu.desnoyers@...icios.com" <mathieu.desnoyers@...icios.com>,
	"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
	"linux-trace-kernel@...r.kernel.org" <linux-trace-kernel@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC: Fu Yeqi <e1374359@...us.edu>
Subject: [BUG] possible deadlock in __schedule (with reproducer available)

Hi there,

Recently we found a possible deadlock in __schedule, which is triggered together with BPF functionalities. The reproducer is available and we successfully reproduce this bug with the up-to-date source code of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git (commit: fcc79e1714e8c2b8e216dc3149812edd37884eef).

The final locking scenario is very similar to one existing report from syzbot: https://lore.kernel.org/bpf/00000000000034c0520608242de6@google.com/T/, while the call traces are different, and the aforementioned one seems to be ignored due to the lack of reproducer.

Syzkaller hit 'possible deadlock in __schedule' bug.

 </TASK>
FAULT_INJECTION: forcing a failure.
name fail_usercopy, interval 1, probability 0, space 0, times 0
======================================================
WARNING: possible circular locking dependency detected
6.12.0-rc7-00144-g66418447d27b #8 Not tainted
------------------------------------------------------
syz-executor144/330 is trying to acquire lock:
ffffffffbcd2da38 ((console_sem).lock){....}-{2:2}, at: down_trylock+0x20/0xa0 kernel/locking/semaphore.c:139

but task is already holding lock:
ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested kernel/sched/core.c:598 [inline]
ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock kernel/sched/sched.h:1506 [inline]
ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: rq_lock kernel/sched/sched.h:1805 [inline]
ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x140/0x1e70 kernel/sched/core.c:6592

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->__lock){-.-.}-{2:2}:
       lock_acquire+0x138/0x300 kernel/locking/lockdep.c:5825
       _raw_spin_lock_nested+0x29/0x40 kernel/locking/spinlock.c:378
       raw_spin_rq_lock_nested kernel/sched/core.c:598 [inline]
       raw_spin_rq_lock kernel/sched/sched.h:1506 [inline]
       __task_rq_lock+0xcf/0x410 kernel/sched/core.c:676
       wake_up_new_task+0x4db/0x9b0 kernel/sched/core.c:4825
       kernel_clone+0x356/0x5a0 kernel/fork.c:2817
       user_mode_thread+0xd6/0x120 kernel/fork.c:2864
       rest_init+0x23/0x300 init/main.c:712
       start_kernel+0x4a3/0x540 init/main.c:1105
       x86_64_start_reservations+0x24/0x30 arch/x86/kernel/head64.c:507
       x86_64_start_kernel+0x79/0x80 arch/x86/kernel/head64.c:488
       common_startup_64+0x12c/0x137

-> #1 (&p->pi_lock){-.-.}-{2:2}:
       lock_acquire+0x138/0x300 kernel/locking/lockdep.c:5825
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x8e/0xc0 kernel/locking/spinlock.c:162
       class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
       try_to_wake_up+0x5f/0xe10 kernel/sched/core.c:4165
       up+0x72/0x90 kernel/locking/semaphore.c:191
       __up_console_sem kernel/printk/printk.c:343 [inline]
       __console_unlock+0xc3/0xf0 kernel/printk/printk.c:2844
       __console_flush_and_unlock kernel/printk/printk.c:3241 [inline]
       console_unlock+0x16c/0x390 kernel/printk/printk.c:3279
       vprintk_emit+0x670/0xb90 kernel/printk/printk.c:2407
       dev_vprintk_emit+0x2bc/0x360 drivers/base/core.c:4942
       dev_printk_emit+0x82/0xb0 drivers/base/core.c:4953
       _dev_warn+0xcb/0xf0 drivers/base/core.c:5009
       _request_firmware+0xe4e/0xf00 drivers/base/firmware_loader/main.c:938
       request_firmware_work_func+0xda/0x1f0 drivers/base/firmware_loader/main.c:1195
       process_one_work kernel/workqueue.c:3229 [inline]
       process_scheduled_works+0x93b/0x13a0 kernel/workqueue.c:3310
       worker_thread+0xa53/0xfc0 kernel/workqueue.c:3391
       kthread+0x301/0x3b0 kernel/kthread.c:389
       ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244

-> #0 ((console_sem).lock){....}-{2:2}:
       check_prev_add kernel/locking/lockdep.c:3161 [inline]
       check_prevs_add kernel/locking/lockdep.c:3280 [inline]
       validate_chain+0x1b82/0x5750 kernel/locking/lockdep.c:3904
       __lock_acquire+0x139d/0x2040 kernel/locking/lockdep.c:5202
       lock_acquire+0x138/0x300 kernel/locking/lockdep.c:5825
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x8e/0xc0 kernel/locking/spinlock.c:162
       down_trylock+0x20/0xa0 kernel/locking/semaphore.c:139
       __down_trylock_console_sem+0x99/0x120 kernel/printk/printk.c:326
       console_trylock kernel/printk/printk.c:2827 [inline]
       console_trylock_spinning kernel/printk/printk.c:1990 [inline]
       vprintk_emit+0x414/0xb90 kernel/printk/printk.c:2406
       _printk+0x7a/0xa0 kernel/printk/printk.c:2432
       fail_dump lib/fault-inject.c:46 [inline]
       should_fail_ex+0x3be/0x570 lib/fault-inject.c:154
       strncpy_from_user+0x36/0x230 lib/strncpy_from_user.c:118
       strncpy_from_user_nofault+0x71/0x140 mm/maccess.c:186
       bpf_probe_read_user_str_common kernel/trace/bpf_trace.c:215 [inline]
       ____bpf_probe_read_user_str kernel/trace/bpf_trace.c:224 [inline]
       bpf_probe_read_user_str+0x2a/0x70 kernel/trace/bpf_trace.c:221
       bpf_prog_bc7c5c6b9645592f+0x3e/0x40
       bpf_dispatcher_nop_func include/linux/bpf.h:1265 [inline]
       __bpf_prog_run include/linux/filter.h:701 [inline]
       bpf_prog_run include/linux/filter.h:708 [inline]
       __bpf_trace_run kernel/trace/bpf_trace.c:2316 [inline]
       bpf_trace_run4+0x30b/0x4d0 kernel/trace/bpf_trace.c:2359
       __bpf_trace_sched_switch+0x1c6/0x2c0 include/trace/events/sched.h:222
       trace_sched_switch+0x12a/0x190 include/trace/events/sched.h:222
       __schedule+0xade/0x1e70 kernel/sched/core.c:6690
       __schedule_loop kernel/sched/core.c:6770 [inline]
       schedule+0xab/0x260 kernel/sched/core.c:6785
       exit_to_user_mode_loop kernel/entry/common.c:102 [inline]
       exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
       __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
       syscall_exit_to_user_mode+0x11d/0x250 kernel/entry/common.c:218
       do_syscall_64+0xed/0x1c0 arch/x86/entry/common.c:89
       entry_SYSCALL_64_after_hwframe+0x67/0x6f

other info that might help us debug this:

Chain exists of:
  (console_sem).lock --> &p->pi_lock --> &rq->__lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&rq->__lock);
                               lock(&p->pi_lock);
                               lock(&rq->__lock);
  lock((console_sem).lock);

 *** DEADLOCK ***

2 locks held by syz-executor144/330:
 #0: ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested kernel/sched/core.c:598 [inline]
 #0: ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock kernel/sched/sched.h:1506 [inline]
 #0: ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: rq_lock kernel/sched/sched.h:1805 [inline]
 #0: ffff888065cbd718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x140/0x1e70 kernel/sched/core.c:6592
 #1: ffffffffbce55000 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #1: ffffffffbce55000 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #1: ffffffffbce55000 (rcu_read_lock){....}-{1:2}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2315 [inline]
 #1: ffffffffbce55000 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x21e/0x4d0 kernel/trace/bpf_trace.c:2359



Syzkaller reproducer:
# {Threaded:false Repeat:true RepeatTimes:0 Procs:1 Slowdown:1 Sandbox: SandboxArg:0 Leak:false NetInjection:false NetDevices:false NetReset:false Cgroups:false BinfmtMisc:false CloseFDs:false KCSAN:false DevlinkPCI:false NicVF:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:false Swap:false UseTmpDir:false HandleSegv:false Trace:false LegacyOptions:{Collide:false Fault:false FaultCall:0 FaultNth:0}}
r0 = bpf$PROG_LOAD(0x5, &(0x7f00000000c0)={0x11, 0xb, &(0x7f0000000640)=ANY=[@ANYBLOB="18000000000000000000000000000000180100002020702500000000002020207b1af8ff00000000bfa100000000000007010000f8ffffffb702000004000000b703000000000000850000007200000095"], &(0x7f0000000200)='GPL\x00', 0x0, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, @fallback, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @void, @value}, 0x90)
bpf$BPF_RAW_TRACEPOINT_OPEN(0x11, &(0x7f0000000400)={&(0x7f0000000040)='sched_switch\x00', r0}, 0x10)
bpf$MAP_CREATE(0x0, 0x0, 0x0) (fail_nth: 1)

... Omitted ...


C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#ifndef __NR_bpf
#define __NR_bpf 321
#endif

static void sleep_ms(uint64_t ms)
{
  usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
  struct timespec ts;
  if (clock_gettime(CLOCK_MONOTONIC, &ts))
    exit(1);
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static bool write_file(const char* file, const char* what, ...)
{
  char buf[1024];
  va_list args;
  va_start(args, what);
  vsnprintf(buf, sizeof(buf), what, args);
  va_end(args);
  buf[sizeof(buf) - 1] = 0;
  int len = strlen(buf);
  int fd = open(file, O_WRONLY | O_CLOEXEC);
  if (fd == -1)
    return false;
  if (write(fd, buf, len) != len) {
    int err = errno;
    close(fd);
    errno = err;
    return false;
  }
  close(fd);
  return true;
}

static int inject_fault(int nth)
{
  int fd;
  fd = open("/proc/thread-self/fail-nth", O_RDWR);
  if (fd == -1)
    exit(1);
  char buf[16];
  sprintf(buf, "%d", nth);
  if (write(fd, buf, strlen(buf)) != (ssize_t)strlen(buf))
    exit(1);
  return fd;
}

static void kill_and_wait(int pid, int* status)
{
  kill(-pid, SIGKILL);
  kill(pid, SIGKILL);
  for (int i = 0; i < 100; i++) {
    if (waitpid(-1, status, WNOHANG | __WALL) == pid)
      return;
    usleep(1000);
  }
  DIR* dir = opendir("/sys/fs/fuse/connections");
  if (dir) {
    for (;;) {
      struct dirent* ent = readdir(dir);
      if (!ent)
        break;
      if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
        continue;
      char abort[300];
      snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
               ent->d_name);
      int fd = open(abort, O_WRONLY);
      if (fd == -1) {
        continue;
      }
      if (write(fd, abort, 1) < 0) {
      }
      close(fd);
    }
    closedir(dir);
  } else {
  }
  while (waitpid(-1, status, __WALL) != pid) {
  }
}

static void setup_test()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  write_file("/proc/self/oom_score_adj", "1000");
}

static const char* setup_fault()
{
  int fd = open("/proc/self/make-it-fail", O_WRONLY);
  if (fd == -1)
    return "CONFIG_FAULT_INJECTION is not enabled";
  close(fd);
  fd = open("/proc/thread-self/fail-nth", O_WRONLY);
  if (fd == -1)
    return "kernel does not have systematic fault injection support";
  close(fd);
  static struct {
    const char* file;
    const char* val;
    bool fatal;
  } files[] = {
      {"/sys/kernel/debug/failslab/ignore-gfp-wait", "N", true},
      {"/sys/kernel/debug/fail_futex/ignore-private", "N", false},
      {"/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem", "N", false},
      {"/sys/kernel/debug/fail_page_alloc/ignore-gfp-wait", "N", false},
      {"/sys/kernel/debug/fail_page_alloc/min-order", "0", false},
  };
  unsigned i;
  for (i = 0; i < sizeof(files) / sizeof(files[0]); i++) {
    if (!write_file(files[i].file, files[i].val)) {
      if (files[i].fatal)
        return "failed to write fault injection file";
    }
  }
  return NULL;
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
  int iter = 0;
  for (;; iter++) {
    int pid = fork();
    if (pid < 0)
      exit(1);
    if (pid == 0) {
      setup_test();
      execute_one();
      exit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      sleep_ms(10);
      if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
        break;
      if (current_time_ms() - start < 5000)
        continue;
      kill_and_wait(pid, &status);
      break;
    }
  }
}

uint64_t r[1] = {0xffffffffffffffff};

void execute_one(void)
{
  intptr_t res = 0;
  if (write(1, "executing program\n", sizeof("executing program\n") - 1)) {
  }
  *(uint32_t*)0x200000c0 = 0x11;
  *(uint32_t*)0x200000c4 = 0xb;
  *(uint64_t*)0x200000c8 = 0x20000640;
  memcpy((void*)0x20000640,
         "\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18"
         "\x01\x00\x00\x20\x20\x70\x25\x00\x00\x00\x00\x00\x20\x20\x20\x7b\x1a"
         "\xf8\xff\x00\x00\x00\x00\xbf\xa1\x00\x00\x00\x00\x00\x00\x07\x01\x00"
         "\x00\xf8\xff\xff\xff\xb7\x02\x00\x00\x04\x00\x00\x00\xb7\x03\x00\x00"
         "\x00\x00\x00\x00\x85\x00\x00\x00\x72\x00\x00\x00\x95",
         81);
  *(uint64_t*)0x200000d0 = 0x20000200;
  memcpy((void*)0x20000200, "GPL\000", 4);
  *(uint32_t*)0x200000d8 = 0;
  *(uint32_t*)0x200000dc = 0;
  *(uint64_t*)0x200000e0 = 0;
  *(uint32_t*)0x200000e8 = 0;
  *(uint32_t*)0x200000ec = 0;
  memset((void*)0x200000f0, 0, 16);
  *(uint32_t*)0x20000100 = 0;
  *(uint32_t*)0x20000104 = 0;
  *(uint32_t*)0x20000108 = -1;
  *(uint32_t*)0x2000010c = 0;
  *(uint64_t*)0x20000110 = 0;
  *(uint32_t*)0x20000118 = 0;
  *(uint32_t*)0x2000011c = 0;
  *(uint64_t*)0x20000120 = 0;
  *(uint32_t*)0x20000128 = 0;
  *(uint32_t*)0x2000012c = 0;
  *(uint32_t*)0x20000130 = 0;
  *(uint32_t*)0x20000134 = 0;
  *(uint64_t*)0x20000138 = 0;
  *(uint64_t*)0x20000140 = 0;
  *(uint32_t*)0x20000148 = 0;
  *(uint32_t*)0x2000014c = 0;
  *(uint32_t*)0x20000150 = 0;
  res = syscall(__NR_bpf, /*cmd=*/5ul, /*arg=*/0x200000c0ul, /*size=*/0x90ul);
  if (res != -1)
    r[0] = res;
  *(uint64_t*)0x20000400 = 0x20000040;
  memcpy((void*)0x20000040, "sched_switch\000", 13);
  *(uint32_t*)0x20000408 = r[0];
  *(uint32_t*)0x2000040c = 0;
  *(uint64_t*)0x20000410 = 0;
  syscall(__NR_bpf, /*cmd=*/0x11ul, /*arg=*/0x20000400ul, /*size=*/0x10ul);
  inject_fault(1);
  syscall(__NR_bpf, /*cmd=*/0ul, /*arg=*/0ul, /*size=*/0ul);
}
int main(void)
{
  syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
          /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
          /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
          /*offset=*/0ul);
  const char* reason;
  (void)reason;
  if ((reason = setup_fault()))
    printf("the reproducer may not work as expected: fault injection setup "
           "failed: %s\n",
           reason);
  loop();
  return 0;
}

Best,
Bonan