netdev - Re: [PATCH v2 bpf-next 1/4] bpf: introduce task_vma bpf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <C4D9D25A-C3DD-4081-9EAD-B7A5B6B74F45@fb.com>
Date:   Thu, 17 Dec 2020 22:08:31 +0000
From:   Song Liu <songliubraving@...com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
CC:     "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "ast@...nel.org" <ast@...nel.org>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "andrii@...nel.org" <andrii@...nel.org>,
        "john.fastabend@...il.com" <john.fastabend@...il.com>,
        "kpsingh@...omium.org" <kpsingh@...omium.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 1/4] bpf: introduce task_vma bpf_iter



> On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:
> 
> On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:
>> +/*
>> + * Key information from vm_area_struct. We need this because we cannot
>> + * assume the vm_area_struct is still valid after each iteration.
>> + */
>> +struct __vm_area_struct {
>> +	__u64 start;
>> +	__u64 end;
>> +	__u64 flags;
>> +	__u64 pgoff;
>> +};
> 
> Where it's inside .c or exposed in uapi/bpf.h it will become uapi
> if it's used this way. Let's switch to arbitrary BTF-based access instead.
> 
>> +static struct __vm_area_struct *
>> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)
>> +{
>> +	struct pid_namespace *ns = info->common.ns;
>> +	struct task_struct *curr_task;
>> +	struct vm_area_struct *vma;
>> +	u32 curr_tid = info->tid;
>> +	bool new_task = false;
>> +
>> +	/* If this function returns a non-NULL __vm_area_struct, it held
>> +	 * a reference to the task_struct. If info->file is non-NULL, it
>> +	 * also holds a reference to the file. If this function returns
>> +	 * NULL, it does not hold any reference.
>> +	 */
>> +again:
>> +	if (info->task) {
>> +		curr_task = info->task;
>> +	} else {
>> +		curr_task = task_seq_get_next(ns, &curr_tid, true);
>> +		if (!curr_task) {
>> +			info->task = NULL;
>> +			info->tid++;
>> +			return NULL;
>> +		}
>> +
>> +		if (curr_tid != info->tid) {
>> +			info->tid = curr_tid;
>> +			new_task = true;
>> +		}
>> +
>> +		if (!curr_task->mm)
>> +			goto next_task;
>> +		info->task = curr_task;
>> +	}
>> +
>> +	mmap_read_lock(curr_task->mm);
> 
> That will hurt. /proc readers do that and it causes all sorts
> of production issues. We cannot take this lock.
> There is no need to take it.
> Switch the whole thing to probe_read style walking.
> And reimplement find_vma with probe_read while omitting vmacache.
> It will be short rbtree walk.
> bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id
> which will use probe_read equivalent underneath.

rw_semaphore is designed to avoid write starvation, so read_lock should not cause
problem unless the lock was taken for extended period. [1] was a recent fix that 
avoids /proc issue by releasing mmap_lock between iterations. We are using similar
mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. 

On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the 
rbtree without taking any lock would not work. We can avoid taking the lock when 
some SPF like mechanism merged (hopefully soonish). 

Did I miss anything? 

We can improve bpf_iter with some mechanism to specify which task to iterate, so 
that we don't have to iterate through all tasks when the user only want to inspect 
vmas in one task. 

Thanks,
Song

[1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")