linux-kernel - Re: [PATCH 2/3] perf/core: Set data->sample_flags in perf_prepare

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y7x3RUd67smv3EFQ@google.com>
Date:   Mon, 9 Jan 2023 12:21:25 -0800
From:   Namhyung Kim <namhyung@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Jiri Olsa <jolsa@...nel.org>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Ravi Bangoria <ravi.bangoria@....com>, bpf@...r.kernel.org
Subject: Re: [PATCH 2/3] perf/core: Set data->sample_flags in
 perf_prepare_sample()

Hi Peter,

On Mon, Jan 09, 2023 at 01:14:31PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 29, 2022 at 12:41:00PM -0800, Namhyung Kim wrote:
> 
> So I like the general idea; I just think it's turned into a bit of a
> mess. That is code is already overly branchy which is known to hurt
> performance, we should really try and not make it worse than absolutely
> needed.

Agreed.

> 
> >  kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------
> >  1 file changed, 63 insertions(+), 23 deletions(-)
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index eacc3702654d..70bff8a04583 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -7582,14 +7582,21 @@ void perf_prepare_sample(struct perf_event_header *header,
> >  	filtered_sample_type = sample_type & ~data->sample_flags;
> >  	__perf_event_header__init_id(header, data, event, filtered_sample_type);
> >  
> > -	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> > -		data->ip = perf_instruction_pointer(regs);
> > +	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> > +		/* attr.sample_type may not have PERF_SAMPLE_IP */
> 
> Right, but that shouldn't matter, IIRC its OK to have more bits set in
> data->sample_flags than we have set in attr.sample_type. It just means
> we have data available for sample types we're (possibly) not using.
> 
> That is, I think you can simply write this like:
> 
> > +		if (!(data->sample_flags & PERF_SAMPLE_IP)) {
> > +			data->ip = perf_instruction_pointer(regs);
> > +			data->sample_flags |= PERF_SAMPLE_IP;
> > +		}
> > +	}
> 
> 	if (filtered_sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> 		data->ip = perf_instruction_pointer(regs);
> 		data->sample_flags |= PERF_SAMPLE_IP);
> 	}
> 
> 	...
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) {
> 		data->code_page_size = perf_get_page_size(data->ip);
> 		data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE;
> 	}
> 
> Then after a single perf_prepare_sample() run we have:
> 
>   pre			|	post
>   ----------------------------------------
>   0			|	0
>   IP			|	IP
>   CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE
>   IP|CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE
> 
> So while data->sample_flags will have an extra bit set in the 3rd case,
> that will not affect perf_sample_outout() which only looks at data->type
> (== attr.sample_type).
> 
> And since data->sample_flags will have both bits set, a second run will
> filter out both and avoid the extra work (except doing that will mess up
> the branch predictors).

Yeah, it'd be better to check filtered_sample_type in the first place.

Btw, I was thinking about a hypothetical scenario that IP set by a PMU
driver not from the regs.  In this case, having CODE_PAGE_SIZE will
overwrite the IP.  I don't think we need to worry about that for now
since PMU drivers updates the regs (using set_linear_ip).  But it seems
like a possible scenario for something like PEBS or IBS.

> 
> 
> >  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> >  		int size = 1;
> >  
> > -		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN)
> > +		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> >  			data->callchain = perf_callchain(event, regs);
> > +			data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> > +		}
> >  
> >  		size += data->callchain->nr;
> >  
> 
> This, why can't this be:
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> 		data->callchain = perf_callchain(event, regs);
> 		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> 
> 		header->size += (1 + data->callchain->nr) * sizeof(u64);
> 	}
> 
> I suppose this is because perf_event_header lives on the stack of the
> overflow handler and all that isn't available / relevant for the BPF
> thing.

Right, it needs to calculate the data size for each sample data.

> 
> And we can't pull that out into anther function without adding yet
> another branch fest.
> 
> However; inspired by your next patch; we can do something like so:
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> 		data->callchain = perf_callchain(event, regs);
> 		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> 
> 		data->size += (1 + data->callchain->nr) * sizeof(u64);
> 	}

This is fine as long as all other places (like in PMU drivers) set the
callchain update the sample data size accordingly.  If not, we can get
the callchain but the data size will be wrong.

> 
> And then have __perf_event_output() (or something thereabout) do:
> 
> 	perf_prepare_sample(data, event, regs);
> 	perf_prepare_header(&header, data, event);
> 	err = output_begin(&handle, data, event, header.size);
> 	if (err)
> 		goto exit;
> 	perf_output_sample(&handle, &header, data, event);
> 	perf_output_end(&handle);
> 
> With perf_prepare_header() being something like:
> 
> 	header->type = PERF_RECORD_SAMPLE;
> 	header->size = sizeof(*header) + event->header_size + data->size;
> 	header->misc = perf_misc_flags(regs);
> 	...
> 
> Hmm ?
> 
> (same for all the other sites)

Looks good.  But I'm confused by the tip-bot2 messages saying it's
merged.  Do you want me to work on it as a follow up?

Thanks,
Namhyung