lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 4 Feb 2014 12:05:00 -0500
From:	Steven Rostedt <rostedt@...dmis.org>
To:	LKML <linux-kernel@...r.kernel.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Al Viro <viro@...IV.linux.org.uk>,
	David Smith <dsmith@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Igor Zhbanov <i.zhbanov@...sung.com>,
	Christoph Hellwig <hch@...radead.org>,
	Paul Menage <menage@...gle.com>
Subject: [RFC][PATCH] exec: Fix use after free of tracepoint
 trace_sched_process_exec

Igor reported garbage in a trace that looked like this:

           login-1718  [000] ...1  1340.195118: sched_process_exec: filename=/bin/login pid=1718 old_pid=1718
 systemd-cgroups-2061  [006] ...1  1341.672207: sched_process_exec: filename=/usr/lib/systemd/systemd-cgroups-agent pid=2061 old_pid=2061
 systemd-cgroups-2063  [004] ...1  1341.673437: sched_process_exec: filename=ÀÈûÉ pid=2063 old_pid=2063

See the junk in the last line for filename.

Investigating this, I found that the problem is where the tracepoint
lies. In exec.c exec_binprm() we have:

	ret = search_binary_handler(bprm);
	if (ret >= 0) {
		audit_bprm(bprm);
		trace_sched_process_exec(current, old_pid, bprm);

The tracepoint uses bprm->filename to record the process that was
executed. The problem is that the handlers called in
search_binary_handler() can release the process' old memory for the
new memory. The bprm->filename is not in this memory, but the issue is
when this happens, if the parent used a vfork() it will wake up at
that moment. To explain exactly what happens, we have this:

cgroup_release_agent()
 agentbuf = kstrdup();
 argv[0] = argentbuf;
 call_usermodehelper(argv[0],...,UMH_WAIT_EXEC);
  call_usermodehelper_setup()
    sub_info->path = path; (argv[0] from parameter)
    sub_info->work = __call_usermodehelper;
  call_usermodehelper_exec()
    queue_work(subinfo->work);
    wait_for_completion();

Now the work gets called:

__call_usermodehelper
  kernel_thread(call_helper, CLONE_VFORK);

Notice the VFORK flag. This means that the parent will sleep till the
child exits or execs.

call_helper()
  ____call_usermodehelper()
    do_execve(subinfo->path)
      filename = subinfo->path; (filename is allocated agentbuf)
      do_execve_common(filename)
        bprm->filename = filename;
        exec_bprm(bprm)
          search_binary_handler(bprm)
            fmt->load_binary() (load_elf_binary);
              load_elf_binary()
                flush_old_exec()
                  exec_mmap()
                    mm_release()
                      if (tsk->vfork_done)
                         complete_vfork_done(tsk);

Here we wake up the parent. The one that called
kthread_thread(call_helper). Now that one continues to run:

 	case UMH_WAIT_EXEC:
		if (pid < 0)
			sub_info->retval = pid;
		umh_complete(sub_info);

Now this wakes up the process that's waiting for completion. Which
continues to run and does:

		kfree(agentbuf);

This is what bprm->filename points to.

When the search_binary_handler() returns, (with a possibility that
filename has been freed), when it calls the tracepoint, the tracepoint
will record bogus data. Luckily it didn't crash the kernel.

Adding a few strategic trace_printks, we can see this happening:

     kworker/3:1-51    [003] ....    59.700772: cgroup_release_agent: agentbuf = ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent
 systemd-cgroups-1925  [005] ....    59.700866: do_execve_common.isra.25: calling search_binary_handler (bprm->filename=ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent
 systemd-cgroups-1925  [005] ....    59.700867: search_binary_handler: fmt->load_binary = load_misc_binary+0x0/0x302
 systemd-cgroups-1925  [005] ....    59.700868: search_binary_handler: fmt->load_binary = load_script+0x0/0x1e8
 systemd-cgroups-1925  [005] ....    59.700868: search_binary_handler: fmt->load_binary = load_elf_binary+0x0/0x179b
 systemd-cgroups-1925  [005] ....    59.700876: flush_old_exec: calling exec_mmap bprm->filename=ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent
 systemd-cgroups-1925  [005] ....    59.700876: flush_old_exec: calling mm_release
 systemd-cgroups-1925  [005] ....    59.700876: mm_release: tsk->vfork_done = ffff880000091d68
     kworker/3:1-51    [003] ....    59.700904: cgroup_release_agent: freeing agentbuf
 systemd-cgroups-1925  [005] ....    59.700911: do_execve_common.isra.25: back from binary (bprm->filename=ffff880114fc28c0 @$ü..ˆÿÿ/systemd/systemd-cgroups-agent
 systemd-cgroups-1925  [005] ...1    59.700912: sched_process_exec: filename=@$ü..ˆÿÿ/systemd/systemd-cgroups-agent pid=1925 old_pid=1925


The junk starts after "freeing agentbuf"

Now to fix this we need to save the filename before calling
search_binary_handler(). But we don't want to save it if we are not
tracing. Why slow everyone else down?  To prevent this, I added
a static key branch (nop or unconditional jump) that gets enabled when
the tracepoint sched_process_exec is registered, and disabled when the
tracepoint is unregistered.

When the tracepoint is registered, the filename will be allocated by a
kstrdup and freed after the call. If filename fails to allocate, we
simply fall back to brpm->tcomm (just the name, not the path).

This works, but is rather ugly. Looking for any other suggestions here.

Reported-by: Igor Zhbanov <i.zhbanov@...sung.com>
[ I'm assuming this is the right person. All I have is a nick "IZh" on IRC]
Signed-off-by: Steven Rostedt <rostedt@...dmis.org>

diff --git a/fs/exec.c b/fs/exec.c
index e1529b4..9bb1b0f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1407,9 +1407,22 @@ int search_binary_handler(struct linux_binprm *bprm)
 }
 EXPORT_SYMBOL(search_binary_handler);
 
+static struct static_key sched_process_exec_key = STATIC_KEY_INIT_FALSE;
+
+void sched_process_exec_key_reg(void)
+{
+	static_key_slow_inc(&sched_process_exec_key);
+}
+
+void sched_process_exec_key_unreg(void)
+{
+	static_key_slow_dec(&sched_process_exec_key);
+}
+
 static int exec_binprm(struct linux_binprm *bprm)
 {
 	pid_t old_pid, old_vpid;
+	char *filename;
 	int ret;
 
 	/* Need to fetch pid before load_binary changes it */
@@ -1418,10 +1431,34 @@ static int exec_binprm(struct linux_binprm *bprm)
 	old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
 	rcu_read_unlock();
 
-	ret = search_binary_handler(bprm);
+	/*
+	 * When the sched_process_exec tracepoint is activated, it
+	 * enables the static key sched_process_exec_key to do some more
+	 * work. The issue is that the tracepoint wants to record the
+	 * filename of the bprm but only if search_binary_handler() returns
+	 * without an error. The problem is that the search_binary_handler()
+	 * may call mm_release() which can wake up the parent that did a vfork
+	 * (see cgroup call_usermodehelper()) and the parent may free bprm->filename.
+	 *
+	 * The filename must be saved before calling search_binary_handler().
+	 * But as this is only needed for tracing, only do this work if
+	 * the tracepoint is enabled.
+	 *
+	 * The static key usage removes conditional branches as well.
+	 */ 
+	if (static_key_false(&sched_process_exec_key)) {
+		/* If filename fails to alloc, just use tcomm */
+		filename = kstrdup(bprm->filename, GFP_KERNEL);
+		ret = search_binary_handler(bprm);
+		if (ret >= 0)
+			trace_sched_process_exec(current, old_pid,
+						 filename ? filename : bprm->tcomm);
+		kfree(filename);
+	} else
+		ret = search_binary_handler(bprm);
+
 	if (ret >= 0) {
 		audit_bprm(bprm);
-		trace_sched_process_exec(current, old_pid, bprm);
 		ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
 		proc_exec_connector(current);
 	}
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..a7eaa06 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -276,30 +276,35 @@ TRACE_EVENT(sched_process_fork,
 		__entry->child_comm, __entry->child_pid)
 );
 
+extern void sched_process_exec_key_reg(void);
+extern void sched_process_exec_key_unreg(void);
+
 /*
  * Tracepoint for exec:
  */
-TRACE_EVENT(sched_process_exec,
+TRACE_EVENT_FN(sched_process_exec,
 
-	TP_PROTO(struct task_struct *p, pid_t old_pid,
-		 struct linux_binprm *bprm),
+	TP_PROTO(struct task_struct *p, pid_t old_pid, char *filename),
 
-	TP_ARGS(p, old_pid, bprm),
+	TP_ARGS(p, old_pid, filename),
 
 	TP_STRUCT__entry(
-		__string(	filename,	bprm->filename	)
+		__string(	filename,	filename	)
 		__field(	pid_t,		pid		)
 		__field(	pid_t,		old_pid		)
 	),
 
 	TP_fast_assign(
-		__assign_str(filename, bprm->filename);
+		__assign_str(filename, filename);
 		__entry->pid		= p->pid;
 		__entry->old_pid	= old_pid;
 	),
 
 	TP_printk("filename=%s pid=%d old_pid=%d", __get_str(filename),
-		  __entry->pid, __entry->old_pid)
+		  __entry->pid, __entry->old_pid),
+
+	sched_process_exec_key_reg,
+	sched_process_exec_key_unreg
 );
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists