linux-kernel - Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190319222652.GA105485@google.com>
Date:   Tue, 19 Mar 2019 18:26:52 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Christian Brauner <christian@...uner.io>
Cc:     Daniel Colascione <dancol@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sultan Alsawaf <sultan@...neltoast.com>,
        Tim Murray <timmurray@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Arve Hjønnevåg <arve@...roid.com>,
        Todd Kjos <tkjos@...roid.com>,
        Martijn Coenen <maco@...roid.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "open list:ANDROID DRIVERS" <devel@...verdev.osuosl.org>,
        linux-mm <linux-mm@...ck.org>,
        kernel-team <kernel-team@...roid.com>,
        Oleg Nesterov <oleg@...hat.com>,
        Andy Lutomirski <luto@...capital.net>,
        "Serge E. Hallyn" <serge@...lyn.com>, keescook@...omium.org
Subject: Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

On Tue, Mar 19, 2019 at 11:14:17PM +0100, Christian Brauner wrote:
[snip] 
> > 
> > ---8<-----------------------
> > 
> > From: Joel Fernandes <joelaf@...gle.com>
> > Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend
> > 
> > Signed-off-by: Joel Fernandes <joelaf@...gle.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/syscalls.h               |  1 +
> >  include/uapi/asm-generic/unistd.h      |  4 +-
> >  kernel/signal.c                        | 62 ++++++++++++++++++++++++++
> >  kernel/sys_ni.c                        |  3 ++
> >  6 files changed, 71 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 1f9607ed087c..2a63f1896b63 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -433,3 +433,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	pidfd_wait		sys_pidfd_wait			__ia32_sys_pidfd_wait
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 92ee0b4378d4..cf2e08a8053b 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -349,6 +349,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	pidfd_wait		__x64_sys_pidfd_wait
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e446806a561f..62160970ed3f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
> >  asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
> >  				       siginfo_t __user *info,
> >  				       unsigned int flags);
> > +asmlinkage long sys_pidfd_wait(int pidfd);
> >  
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..137aa8662230 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_pidfd_wait 428
> > +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
> >  
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 428
> > +#define __NR_syscalls 429
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index b7953934aa99..ebb550b87044 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3550,6 +3550,68 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > +	/*
> > +	 * This is just a test string, it will contain the actual
> > +	 * status of the pidfd in the future.
> > +	 */
> > +	char buf[] = "status";
> > +
> > +	return copy_to_iter(buf, strlen(buf)+1, to);
> > +}
> > +
> > +static const struct file_operations pidfd_wait_file_ops = {
> > +	.read_iter	= pidfd_wait_read_iter,
> > +};
> > +
> > +static struct inode *pidfd_wait_get_inode(struct super_block *sb)
> > +{
> > +	struct inode *inode = new_inode(sb);
> > +
> > +	inode->i_ino = get_next_ino();
> > +	inode_init_owner(inode, NULL, S_IFREG);
> > +
> > +	inode->i_op		= &simple_dir_inode_operations;
> > +	inode->i_fop		= &pidfd_wait_file_ops;
> > +
> > +	return inode;
> > +}
> > +
> > +SYSCALL_DEFINE1(pidfd_wait, int, pidfd)
> > +{
> > +	struct fd f;
> > +	struct inode *inode;
> > +	struct file *file;
> > +	int new_fd;
> > +	struct pid_namespace *pid_ns;
> > +	struct super_block *sb;
> > +	struct vfsmount *mnt;
> > +
> > +	f = fdget_raw(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	sb = file_inode(f.file)->i_sb;
> > +	pid_ns = sb->s_fs_info;
> > +
> > +	inode = pidfd_wait_get_inode(sb);
> > +
> > +	mnt = pid_ns->proc_mnt;
> > +
> > +	file = alloc_file_pseudo(inode, mnt, "pidfd_wait", O_RDONLY,
> > +			&pidfd_wait_file_ops);
> 
> So I dislike the idea of allocating new inodes from the procfs super
> block. I would like to avoid pinning the whole pidfd concept exclusively
> to proc. The idea is that the pidfd API will be useable through procfs
> via open("/proc/<pid>") because that is what users expect and really
> wanted to have for a long time. So it makes sense to have this working.
> But it should really be useable without it. That's why translate_pid()
> and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> api is "complete" you should be able to set CONFIG_PROCFS=N - even
> though that's crazy - and still be able to use pidfds. This is also a
> point akpm asked about when I did the pidfd_send_signal work.

Oh, ok. Somehow 'proc' and 'pid' sound very similar in terminology so
naturally I felt the proc fs superblock would be a fit, but I see your point.

> So instead of going throught proc we should probably do what David has
> been doing in the mount API and come to rely on anone_inode. So
> something like:
> 
> fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags);
> 
> and stash information such as pid namespace etc. in a pidfd struct or
> something that we then can stash file->private_data of the new file.
> This also lets us avoid all this open coding done here.
> Another advantage is that anon_inodes is its own kernel-internal
> filesystem.

Thanks for the suggestion! Agreed this is better and will do it this way then. 

thanks,

 - Joel