[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181005220518.747ri4q34obrnaoc@ast-mbp.dhcp.thefacebook.com>
Date:   Fri, 5 Oct 2018 15:05:20 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Al Viro <viro@...IV.linux.org.uk>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        "David S . Miller" <davem@...emloft.net>, daniel@...earbox.net,
        luto@...capital.net, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH bpf-next 1/6] bpf: introduce BPF_PROG_TYPE_FILE_FILTER
On Fri, Oct 05, 2018 at 05:46:59AM +0100, Al Viro wrote:
> On Wed, Oct 03, 2018 at 07:57:45PM -0700, Alexei Starovoitov wrote:
>  
> > @@ -15,6 +15,7 @@
> >  #include <linux/bpf.h>
> >  #include <linux/bpf-cgroup.h>
> >  #include <net/sock.h>
> > +#include <../fs/mount.h>
> 
> No.
I've considered splitting cgroup_file_filter_ctx_access() into
separate .c inside fs/ directory, but it felt odd to move just that
function whereas the rest of the bpf bits are in kernel/bpf/
only to avoid doing this "../fs/" hack.
How about calling this new file fs/bpf_file_filter.c ?
> > +	struct file *file = NULL;
> > +	struct inode *inode;
> > +	struct super_block *sb;
> > +	struct mount *mnt;
> 
> Fuck, no.
> 
> > +	case offsetof(struct bpf_file_info, mnt_id):
> > +		/* dst = real_mount(file->f_path.mnt)->mnt_id */
> > +		mnt = real_mount(LD_1(file->f_path.mnt));
> > +		LD_n(mnt->mnt_id);
> 
> NAK.  Anything in struct mount is private to just a couple of
> files in fs/*.c.  Don't do that.  And keep in mind that internal
> details can and will be changed at zero notice, so be careful
> with adding such stuff.
yes. The internal details of file and inode structs can and will change.
Just like all the sk_buff internals that are exposed to bpf
via the same context rewriting mechanism.
See commit 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
that exposed first 4 fields out of sk_buff to bpf progs via
'struct __sk_buff'.
Other fields were exposed later. Some of them are not even from sk_buff.
For networking programs 'struct __sk_buff' is the context.
For this new file_filter programs 'struct bpf_file_info' is the context.
That's the only thing bpf side can see.
> Another problem is your direct poking in ->i_ino.  It's not
> something directly exposed to userland at the moment and it should
> not become such.
The patch is not making i_ino directly exposed.
Only 'struct bpf_file_info' is exposed to user space / bpf programs.
In 'struct bpf_file_info' the field is called '__u64 inode'.
That is the abi.
The kernel side stays with 'unsigned long i_ino;' field.
Its size is different on 32/64 architectures, but to bpf progs
it's always seen as '__u64 inode'.
If in the future the kernel decides to change 'i_ino' to be u64,
nothing will change on bpf side.
> Filesystem has every right to have ->getattr()
> set ->ino (== ->st_ino value) in whichever way it likes; 
Essentially what cgroup_file_filter_ctx_access() is doing
is the same as generic_fillattr() + cp_statx() work, but via
on the fly rewriting of bpf instructions during
loading of bpf program.
See my other reply to Andy where I argued that certain
fields like uid/gid probably don't make sense to expose
to bpf via ctx rewriter mechanism and instead they can be
done via new bpf_get_file_statx() helper.
That's future work.
> the same
> goes for ->dev.
Notice how single kernel field file->f_inode->i_sb->s_dev
is exposed to bpf via two fields dev_major and dev_minor
inside 'struct bpf_file_info'.
For dev_minor the ctx rewriting is done as:
+       case offsetof(struct bpf_file_info, dev_minor):
+               /* dst = file->f_inode->i_sb->s_dev */
+               inode = LD_1(file->f_inode);
+               sb = LD_n(inode->i_sb);
+               LD_n(sb->s_dev);
+               *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, MINORMASK);
+               break;
If the last AND instruction was not there, the whole 's_dev' field
would have been exposed to bpf side and that would be questionable
design choice, since s_dev has kernel internal encoding of major/minor.
Instead the AND instruction is masking minor bits.
So above ctx rewriting for 'dev_minor' field is equivalent
to line
  stat->dev = inode->i_sb->s_dev;
from generic_fillattr()
plus another line
  tmp.stx_dev_minor = MINOR(stat->dev);
from cp_statx()
This way the kernel internal field 's_dev' is split into
two dev_major/dev_minor bpf visible fields.
The end result is that no new kernel information is exposed.
The same information is seen via statx.
Powered by blists - more mailing lists
 
