linux-kernel - Re: [PATCH v6 1/3] perf: add PERF_RECORD_NAMESPACES to include namespaces related info

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <55875cff-fe7f-a468-0f6a-29f569fe995d@linux.vnet.ibm.com>
Date:   Mon, 20 Feb 2017 09:41:48 +0530
From:   Hari Bathini <hbathini@...ux.vnet.ibm.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     ast@...com, lkml <linux-kernel@...r.kernel.org>, acme@...nel.org,
        alexander.shishkin@...ux.intel.com, mingo@...hat.com,
        daniel@...earbox.net, rostedt@...dmis.org,
        Ananth N Mavinakayanahalli <ananth@...ux.vnet.ibm.com>,
        ebiederm@...ssion.com, sargun@...gun.me,
        Aravinda Prasad <aravinda@...ux.vnet.ibm.com>,
        brendan.d.gregg@...il.com, jolsa@...hat.com
Subject: Re: [PATCH v6 1/3] perf: add PERF_RECORD_NAMESPACES to include
 namespaces related info

Hi Peter,


On Thursday 16 February 2017 03:58 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 02:01:24PM +0530, Hari Bathini wrote:
>> With the advert of container technologies like docker, that depend
>> on namespaces for isolation, there is a need for tracing support for
>> namespaces. This patch introduces new PERF_RECORD_NAMESPACES event
>> for tracing based on namespaces related info. This event records
>> the device and inode numbers for every namespace of all processes.
>>
>> While device number is same for all namespaces currently, that may
> 'While device numbers are the same ..." ?
>
>> change in future, to avoid the need for a namespace of namespaces.
> Unfinished sentence

Unnecessary comma spoiled the sentence, I guess.

>> Recording device number along with inode number will take care of such
>> scenario.
> More words on why this is so? Because I've no clue.

This should have read:

    "Each namespace has a combination of device and inode numbers. Though
     every namespace has the same device number currently, that may change
     in future to avoid the need for a namespace of namespaces. Considering
     such possibility, record both device and inode numbers separately for
     each namespace."

Will update changelog accordingly..

>> @@ -6584,6 +6590,135 @@ void perf_event_comm(struct task_struct *task, bool exec)
>>   }
>>   
>>   /*
>> + * namespaces tracking
>> + */
>> +
>> +struct namespaces_event_id {
>> +	struct perf_event_header	header;
>> +
>> +	u32				pid;
>> +	u32				tid;
>> +	u32				nr_namespaces;
>> +	struct perf_ns_link_info	link_info[NAMESPACES_MAX];
>> +};
> Contains the same hole and makes the record depend on architecture
> alignment requirements.

Ugh! Let me change nr_namespaces type to u64 and also drop name field
in perf_ns_link_info struct.

>> +static void perf_fill_ns_link_info(struct perf_ns_link_info *ns_link_info,
>> +				   struct task_struct *task,
>> +				   const struct proc_ns_operations *ns_ops)
>> +{
>> +	struct path ns_path;
>> +	struct inode *ns_inode;
>> +	void *error;
>> +
>> +	error = ns_get_path(&ns_path, task, ns_ops);
>> +	if (!error) {
>> +		snprintf(ns_link_info->name, NS_NAME_SIZE,
>> +			 "%s", ns_path.dentry->d_iname);
>> +
>> +		ns_inode = ns_path.dentry->d_inode;
>> +		ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev);
>> +		ns_link_info->ino = ns_inode->i_ino;
>> +	}
>> +}
>> +
>> +static void perf_event_namespaces_output(struct perf_event *event,
>> +					 void *data)
>> +{
>> +	struct perf_namespaces_event *namespaces_event = data;
>> +	struct perf_output_handle handle;
>> +	struct perf_sample_data sample;
>> +	struct namespaces_event_id *ei;
>> +	struct task_struct *task = namespaces_event->task;
>> +	int ret;
>> +
>> +	if (!perf_event_namespaces_match(event))
>> +		return;
>> +
>> +	ei = &namespaces_event->event_id;
>> +	perf_event_header__init_id(&ei->header, &sample, event);
>> +	ret = perf_output_begin(&handle, event,	ei->header.size);
>> +	if (ret)
>> +		return;
>> +
>> +	ei->pid = perf_event_pid(event, task);
>> +	ei->tid = perf_event_tid(event, task);
>> +
>> +	ei->nr_namespaces = NAMESPACES_MAX;
>> +
>> +	perf_fill_ns_link_info(&ei->link_info[MNT_NS_INDEX],
>> +			       task, &mntns_operations);
>> +
>> +#ifdef CONFIG_USER_NS
>> +	perf_fill_ns_link_info(&ei->link_info[USER_NS_INDEX],
>> +			       task, &userns_operations);
>> +#endif
>> +#ifdef CONFIG_NET_NS
>> +	perf_fill_ns_link_info(&ei->link_info[NET_NS_INDEX],
>> +			       task, &netns_operations);
>> +#endif
>> +#ifdef CONFIG_UTS_NS
>> +	perf_fill_ns_link_info(&ei->link_info[UTS_NS_INDEX],
>> +			       task, &utsns_operations);
>> +#endif
>> +#ifdef CONFIG_IPC_NS
>> +	perf_fill_ns_link_info(&ei->link_info[IPC_NS_INDEX],
>> +			       task, &ipcns_operations);
>> +#endif
>> +#ifdef CONFIG_PID_NS
>> +	perf_fill_ns_link_info(&ei->link_info[PID_NS_INDEX],
>> +			       task, &pidns_operations);
>> +#endif
>> +#ifdef CONFIG_CGROUPS
>> +	perf_fill_ns_link_info(&ei->link_info[CGROUP_NS_INDEX],
>> +			       task, &cgroupns_operations);
>> +#endif
> Two points here, since you've given these thingies a string identifier,
> do you still need to rely on their positional index? If not, you could
> generate smaller events if we lack some of these CONFIG knobs.

Dropping name field. So, I don't think this applies now..

> Secondly, does anything in here depend on @event? Afaict you're
> generating the exact same information for each event.
>
> The reason we set {pid,tid} for each event is because of PID_NS, we
> report the pid number as per the namespace associated with the event.
>
> But from what I can tell, there is no such namespace of namespaces
> dependency here and this should live in the function below.

Right. I will move it to perf_event_namespaces() function.

>> @@ -9667,6 +9804,11 @@ SYSCALL_DEFINE5(perf_event_open,
>>   			return -EACCES;
>>   	}
>>   
>> +	if (attr.namespaces) {
>> +		if (!capable(CAP_SYS_ADMIN))
>> +			return -EACCES;
>> +	}
> Many times we allow users access to such information; should this be
> qualified with something like perf_paranoid_kernel() or somesuch?

Hmmm.. Need access to namespace info of every process. Synthesized events
try and access /proc/1/ns/ and the like needing CAP_SYS_ADMIN. So, the
question is, should running perf tool be allowed with perf_paranoid_kernel()
check and complain later when access to /proc/1/ns and the like fails 
(on running
perf tool without CAP_SYS_ADMIN)?

>> +
>>   	if (attr.freq) {
>>   		if (attr.sample_freq > sysctl_perf_event_sample_rate)
>>   			return -EINVAL;
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 11c5c8a..fd77e67 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2289,6 +2289,9 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>   		free_fs_struct(new_fs);
>>   
>>   bad_unshare_out:
>> +	if (!err)
>> +		perf_event_namespaces(current);
>> +
>>   	return err;
>>   }
> That seems like a very odd place, why not put it right before the
> bad_unshare_cleanup_cred: label?
>

>> @@ -264,6 +265,10 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>>   	switch_task_namespaces(tsk, new_nsproxy);
>>   out:
>>   	fput(file);
>> +
>> +	if (!err)
>> +		perf_event_namespaces(tsk);
>> +
>>   	return err;
>>   }
> Same again..

My bad! Will change that..

Thanks
Hari