linux-kernel - RE: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth Monitoring (MBM) PMU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <06033C969873E840BDE9FC9B584F66B519441830@ORSMSX116.amr.corp.intel.com>
Date:	Thu, 20 Aug 2015 21:54:15 +0000
From:	"Juvva, Kanaka D" <kanaka.d.juvva@...el.com>
To:	Thomas Gleixner <tglx@...utronix.de>,
	Kanaka Juvva <kanaka.d.juvva@...ux.intel.com>
CC:	"Williamson, Glenn P" <glenn.p.williamson@...el.com>,
	"Fleming, Matt" <matt.fleming@...el.com>,
	"Auld, Will" <will.auld@...el.com>,
	Andi Kleen <andi@...stfloor.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"Luck, Tony" <tony.luck@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Tejun Heo <tj@...nel.org>, "x86@...nel.org" <x86@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	"Shivappa, Vikas" <vikas.shivappa@...el.com>
Subject: RE: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth Monitoring
 (MBM) PMU

Hi Thomas,

   Acknowledged. Perhaps some discussions are required in terms of your questions and our solutions.
  Mostly nothing was unresolved except some new things that are brought up for v3.  Seems like some 
 clarification  are required for upstream.

Regards,
-Kanaka

> -----Original Message-----
> From: Thomas Gleixner [mailto:tglx@...utronix.de]
> Sent: Wednesday, August 19, 2015 1:50 PM
> To: Kanaka Juvva
> Cc: Juvva, Kanaka D; Williamson, Glenn P; Fleming, Matt; Auld, Will; Andi
> Kleen; LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x86@...nel.org; Ingo
> Molnar; H. Peter Anvin; Shivappa, Vikas
> Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth
> Monitoring (MBM) PMU
> 
> On Fri, 7 Aug 2015, Kanaka Juvva wrote:
> > +#define MBM_CNTR_MAX		0xffffff
> > +#define MBM_SOCKET_MAX		8
> > +#define MBM_TIME_DELTA_MAX	1000
> > +#define MBM_TIME_DELTA_MIN	100
> 
> What are these constants for and how are they determined? Pulled out of
> thin air?
> 
> > +#define MBM_SCALING_FACTOR2	1
> > +#define MBM_SCALING_FACTOR	1000
> 
> Ditto
> 
> > +#define MBM_FIFO_SIZE_MIN	10
> > +#define MBM_FIFO_SIZE_MAX	300
> 
> Ditto
> 
> >  /**
> >   * struct intel_pqr_state - State cache for the PQR MSR @@ -42,6
> > +52,90 @@ struct intel_pqr_state {
> >   * interrupts disabled, which is sufficient for the protection.
> >   */
> >  static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> > +static DEFINE_PER_CPU(struct mbm_pmu *, mbm_pmu); static
> > +DEFINE_PER_CPU(struct mbm_pmu *, mbm_pmu_to_free);
> > +
> > +/*
> > + * mbm pmu is used for storing mbm_local and mbm_total
> > + * events
> > + */
> > +struct mbm_pmu {
> > +	/*
> > +	 * # of active events
> > +	 */
> 
> I told you last time:
> 
>  "If you want to document your struct members, please use kerneldoc
>   and not these hard to read tail comments."
> 
> I can't see how this documentation style is in any way related to kerneldoc.
> 
> > +	/*
> > +	 * time interval in ms between two consecutive samples
> > +	 */
> > +	ktime_t          timer_interval; /* in ktime_t unit */
> 
> 
> So you stick with your hard to read tail comments? Aside of that this one is
> completely pointless. What's the value of this?
> 
> > +	/*
> > +	 * fifoout  is  the start of the sliding window from which last 'n'
> > +	 *  bw values must be read
> > +	 */
> > +	u32  fifoout; /* mbmfifo out counter for sliding window */
> 
> Sigh
> 
> > +};
> > +
> > +/*
> > + * stats for total bw
> > + */
> > +static struct sample *mbm_total; /* curent stats for  mbm_total
> > +events */
> 
> Oh well...
> 
> >  #define QOS_L3_OCCUP_EVENT_ID	(1 << 0)
> > +#define QOS_MBM_TOTAL_EVENT_ID	(1 << 1)
> > +#define QOS_MBM_LOCAL_EVENT_ID_HW	 0x3
> > +#define QOS_MBM_LOCAL_EVENT_ID	(1 << 2)
> 
> So we have ID values which are built with (1 << X) and then this HW variant in
> the middle with 0x3. Of course without any explanation what the heck this
> stuff is.
> 
> Last review:
> 
>      "So this wants a descriptive ID name and a comment."
> 
> 
> >  #define QOS_EVENT_MASK	QOS_L3_OCCUP_EVENT_ID
> >
> > @@ -176,6 +273,25 @@ static inline struct cqm_rmid_entry
> *__rmid_entry(u32 rmid)
> >  	return entry;
> >  }
> >
> > +/**
> > + * __mbm_reset_stats - reset stats for a given rmid for the current cpu
> > + * @rmid:	rmid value
> > + *
> > + * vrmid: array index for current core for the given rmid
> 
> Array index of what?
> 
> > + * mbs_total[] and mbm_local[] are linearly indexed by core * max
> > + #rmids per
> > + * core
> 
> The code tells a different story.
> 
> > + */
> > +static void __mbm_reset_stats(u32 rmid)
> 
> What's the point of the double underscores?
> 
> > +{
> > +	u32  vrmid =  topology_physical_package_id(smp_processor_id()) *
> > +						   cqm_max_rmid + rmid;
> 
> Again I asked you last time:
> 
>  "Can you please explain that magic calculation? It's far from obvious
>   why this would be correct."
> 
> There is still no explanation.
> 
> > +	if ((!cqm_mbm) || (!mbm_local) || (!mbm_total))
> > +		return;
> > +	memset(&mbm_local[vrmid], 0, sizeof(struct sample));
> > +	memset(&mbm_total[vrmid], 0, sizeof(struct sample)); }
> > +
> >  /*
> >   * Returns < 0 on fail.
> >   *
> > @@ -192,6 +308,7 @@ static u32 __get_rmid(void)
> >
> >  	entry = list_first_entry(&cqm_rmid_free_lru, struct
> cqm_rmid_entry, list);
> >  	list_del(&entry->list);
> > +	__mbm_reset_stats(entry->rmid);
> >
> >  	return entry->rmid;
> >  }
> > @@ -207,6 +324,7 @@ static void __put_rmid(u32 rmid)
> >
> >  	entry->queue_time = jiffies;
> >  	entry->state = RMID_YOUNG;
> > +	__mbm_reset_stats(rmid);
> >
> >  	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);  } @@ -232,6
> > +350,7 @@ static int intel_cqm_setup_rmid_cache(void)
> >
> >  		INIT_LIST_HEAD(&entry->list);
> >  		entry->rmid = r;
> > +
> 
> Stray newline
> 
> >  		cqm_rmid_ptrs[r] = entry;
> >
> >  		list_add_tail(&entry->list, &cqm_rmid_free_lru); @@ -494,6
> +613,165
> > @@ static bool intel_cqm_sched_in_event(u32 rmid)
> >  	return false;
> >  }
> >
> > +
> > +static u32 bw_sum_calc(struct sample *bw_stat, int rmid) {
> > +	u32 val = 0, i, j, index;
> > +
> > +	if (++bw_stat->fifoout >=  mbm_window_size)
> > +		bw_stat->fifoout =  0;
> > +	index =  bw_stat->fifoout;
> > +	for (i = 0; i < mbm_window_size - 1; i++) {
> > +		if (index + i >= mbm_window_size)
> > +			j = index + i - mbm_window_size;
> > +		else
> > +			j = index + i;
> > +		val += bw_stat->mbmfifo[j];
> > +	}
> 
> This math wants a explanatory comment.
> 
> > +	return val;
> > +}
> > +
> > +static u32 __mbm_fifo_sum_lastn_out(int rmid, bool is_localbw) {
> > +	if (is_localbw)
> > +		return bw_sum_calc(&mbm_local[rmid], rmid);
> > +	else
> > +		return bw_sum_calc(&mbm_total[rmid], rmid); }
> > +
> > +static void __mbm_fifo_in(struct sample *bw_stat, u32 val) {
> > +	bw_stat->mbmfifo[bw_stat->fifoin] = val;
> > +	if (++bw_stat->fifoin >= mbm_window_size)
> 
> How does that become greater than mbm_windowsize?
> 
> > +		bw_stat->fifoin = 0;
> > +}
> 
> > +/*
> > + * __rmid_read_mbm checks whether it is LOCAL or GLOBAL MBM event
> and
> > +reads
> > + * its MSR counter. Check whether overflow occurred and handles it.
> > +Calculates
> > + * currenet  BW and updates  running average.
> 
> currenet? And please get rid of the double spaces
> 
> > + *
> > + * Overflow Handling:
> > + * if (MSR current value < MSR previous value) it is an
> > + * overflow. and overflow is handled.
> 
> Wow. That's informative as hell!
> 
> > + *
> > + * Calculation of Current BW value:
> 
> BW == Body Weight?
> 
> > + * If MSR is read within last 100ms, then the value is ignored;
> > + * this will suppress small deltas. We don't process MBM samples that
> > + are
> > + * within 100ms.
> 
> WHY?
> 
> > + * Bandwidth is calculated as:
> > + * bandwidth = difference of last two msr counter values/time difference.
> > + *
> > + * cum_avg = Running Average bandwidth of last 'n' samples that are
> > + processed
> > + *
> > + * Sliding window is used to save the last 'n' samples. Hence,
> > + * n = sliding_window_size
> > + *
> > + *Scaling:
> > + * cum_avg is scaled down by a  factor MBM_SCALING_FACTOR2 and
> > + rounded to
> > + * nearest integer. User interface reads the BW in KB/sec or MB/sec.
> 
> And how is the scaling factor determined?
> 
> > + */
> > +static u64 __rmid_read_mbm(unsigned int rmid, bool read_mbm_local)
> 
> More pointless double underscores
> 
> > +{
> > +	u64 val, tmp, diff_time, cma, bytes, index;
> > +	bool overflow = false, first = false;
> > +	ktime_t cur_time;
> > +	u32 tmp32 = rmid;
> > +	struct sample *mbm_current;
> > +	u32 vrmid = topology_physical_package_id(smp_processor_id()) *
> > +						 cqm_max_rmid + rmid;
> > +
> > +	rmid = vrmid;
> 
> From my previous review:
> 
>   "This is completely backwards.
> 
>  	tmp32 = rmid;
>  	rmid = vrmid;
>  	do_stuff(rmid);
>  	rmid = tmp32;
>  	do_other_stuff(rmid);
> 
>    Why can't you use vrmid for do_stuff() and leave rmid alone? Just
>    because it would make the code simpler to read?"
> 
> Still applies.
> 
> > +	cur_time = ktime_get();
> > +	if (read_mbm_local) {
> > +		cma = mbm_local[rmid].cum_avg;
> > +		diff_time = ktime_ms_delta(cur_time,
> > +				   mbm_local[rmid].prev_time);
> > +		if (diff_time < 100)
> > +			return cma;
> > +		mbm_local[rmid].prev_time = ktime_set(0,
> > +				(unsigned long)ktime_to_ns(cur_time));
> 
> What's the actual purpose of this back and forth conversion? The only
> purpose I can see is to wreckage the time on 32bit machines.
> 
> > +		bytes = mbm_local[rmid].bytes;
> > +		index = mbm_local[rmid].index;
> > +		mbm_current = &mbm_local[rmid];
> > +		rmid = tmp32;
> > +		wrmsr(MSR_IA32_QM_EVTSEL,
> QOS_MBM_LOCAL_EVENT_ID_HW, rmid);
> > +	} else {
> > +		cma = mbm_total[rmid].cum_avg;
> > +		diff_time = ktime_ms_delta(cur_time,
> > +					   mbm_total[rmid].prev_time);
> > +		if (diff_time < 100)
> > +			return cma;
> > +		mbm_total[rmid].prev_time = ktime_set(0,
> > +					(unsigned
> long)ktime_to_ns(cur_time));
> > +		bytes = mbm_total[rmid].bytes;
> > +		index = mbm_total[rmid].index;
> > +		mbm_current = &mbm_total[rmid];
> > +		rmid = tmp32;
> > +		wrmsr(MSR_IA32_QM_EVTSEL,
> QOS_MBM_TOTAL_EVENT_ID, rmid);
> > +	}
> 
> So you got rid of ONE 'if (read_mbm_local)' but this can be done completely
> without this conditional and without the silly code duplication.
> 
> > +	rdmsrl(MSR_IA32_QM_CTR, val);
> > +	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > +		return val;
> > +
> > +	tmp = val;
> 
> Again from the last round of review:
> 
>   "You really have a faible for random storage. tmp gets assigned to
>    mbm_*[rmid].bytes further down. This makes no sense at all."
> 
> > +	/* if current msr value <  previous msr value ,  it means overflow */
> > +	if (val < bytes) {
> > +		val = MBM_CNTR_MAX - bytes + val;
> > +		overflow = true;
> > +	} else
> > +		val = val - bytes;
> > +
> > +	val =  (val * MBM_TIME_DELTA_MAX) / diff_time;
> > +
> > +	if ((diff_time > MBM_TIME_DELTA_MAX) && (!cma))
> > +		/* First sample */
> > +		first = true;
> > +
> > +	rmid = vrmid;
> 
> And another time:
> 
>   "More obfuscation"
> 
> > +	if ((diff_time <= (MBM_TIME_DELTA_MAX +
> MBM_TIME_DELTA_MIN))  ||
> > +			overflow || first) {
> 
> I asked last time about an explanation for this time delta magic. There still is
> none.
> 
> > +		int bw, ret;
> > +
> > +		if (index   & (index < mbm_window_size))
> > +			val = cma * MBM_SCALING_FACTOR2  + val / index -
> > +					cma / index;
> > +		val = DIV_ROUND_UP(val, MBM_SCALING_FACTOR2);
> > +		if (index >= mbm_window_size) {
> > +			/*
> > +			 * Compute the sum of recent n-1 samples and slide
> the
> > +			 * window by 1
> > +			 */
> > +			ret = __mbm_fifo_sum_lastn_out(rmid,
> read_mbm_local);
> > +			/* recalculate the running average with current bw */
> > +			ret = (ret + val) / mbm_window_size;
> > +			bw = val;
> > +			val = ret;
> 
> Yet another time:
> 
>   "Your back and forth assignements of random variables is not making the
>    code any more readable."
> 
> > +		} else
> > +			bw = val;
> 
> Is still missing parentheses
> 
> > +		/* save the recent bw in fifo */
> > +		mbm_fifo_in(rmid, bw, read_mbm_local);
> > +		/* save the current sample */
> > +		mbm_current->index++;
> > +		mbm_current->cum_avg = val;
> > +		mbm_current->bytes = tmp;
> > +		mbm_current->prev_time = ktime_set(0,
> > +					(unsigned
> long)ktime_to_ns(cur_time));
> > +
> > +		return val;
> > +	}
> > +	return  cma;
> > +}
> 
> > +static void __intel_cqm_event_total_bw_count(void *info) {
> > +	struct rmid_read *rr = info;
> > +	u64 val;
> > +
> > +	val = __rmid_read_mbm(rr->rmid, false);
> > +	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > +		return;
> > +	atomic64_add(val, &rr->value);
> > +}
> > +
> > +static void __intel_cqm_event_local_bw_count(void *info) {
> > +	struct rmid_read *rr = info;
> > +	u64 val;
> > +
> > +	val = __rmid_read_mbm(rr->rmid, true);
> > +	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > +		return;
> > +	atomic64_add(val, &rr->value);
> > +}
> 
> And once more:
> 
>   "You're really a fan of copy and paste."
> 
> > +
> >  static void __intel_cqm_event_count(void *info)  {
> >  	struct rmid_read *rr = info;
> > @@ -975,7 +1316,21 @@ static u64 intel_cqm_event_count(struct
> perf_event *event)
> >  	if (!__rmid_valid(rr.rmid))
> >  		goto out;
> >
> > -	on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
> &rr, 1);
> > +	switch (event->attr.config) {
> > +	case  QOS_L3_OCCUP_EVENT_ID:
> > +		on_each_cpu_mask(&cqm_cpumask,
> __intel_cqm_event_count, &rr, 1);
> > +		break;
> > +	case  QOS_MBM_TOTAL_EVENT_ID:
> > +		on_each_cpu_mask(&cqm_cpumask,
> __intel_cqm_event_total_bw_count,
> 
> If you make the function names a little bit longer then this might need 3 lines
> and further enhance unreadability.
> 
> >  static void intel_cqm_event_start(struct perf_event *event, int mode)
> > {
> >  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state); @@
> > -1003,19 +1391,45 @@ static void intel_cqm_event_start(struct perf_event
> *event, int mode)
> >  	}
> >
> >  	state->rmid = rmid;
> > +	if (event->attr.config & QOS_L3_OCCUP_EVENT_ID) {
> > +		struct cqm_rmid_entry *entry;
> > +
> > +		entry = __rmid_entry(rmid);
> > +	}
> >  	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
> > +
> > +	if (((event->attr.config & QOS_MBM_TOTAL_EVENT_ID) ||
> > +	     (event->attr.config & QOS_MBM_LOCAL_EVENT_ID))  &&
> (cqm_mbm)) {
> > +		int cpu = smp_processor_id();
> > +		struct mbm_pmu *pmu = per_cpu(mbm_pmu, cpu);
> 
> So you got rid of get_cpu(), but what's wrong with this_cpu_read() ?
> 
> > @@ -1023,6 +1437,17 @@ static void intel_cqm_event_stop(struct
> perf_event *event, int mode)
> >  	} else {
> >  		WARN_ON_ONCE(!state->rmid);
> >  	}
> > +
> > +	if (pmu) {
> > +		if (pmu->n_active >  0) {
> 
> What's the purpose of this check? In the previous version there was a
> WARN_ON(), which made sense. Did it trigger and you decided to "work"
> around it?
> 
> > +			pmu->n_active--;
> 
> > +		if (pmu->n_active == 0)
> > +			mbm_stop_hrtimer(pmu);
> > +		if (!list_empty(&event->active_entry))
> > +			list_del(&event->active_entry);
> 
> These four lines are at the wrong indentation level.
> 
> > +		}
> > +	}
> > +
> >  }
> 
> > +#if MBM_SCALING_FACTOR2 == 1000
> 
> And how does MBM_SCALING_FACTOR2 change magically?
> 
> > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit,
> > +"MB/sec"); EVENT_ATTR_STR(llc_local_bw.unit,
> > +intel_cqm_llc_local_bw_unit, "MB/sec"); #else
> 
> So any value != 1000 results in KB/sec. Interesting math.
> 
> > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit,
> > +"KB/sec"); EVENT_ATTR_STR(llc_local_bw.unit,
> > +intel_cqm_llc_local_bw_unit, "KB/sec"); #endif
> 
> > +static ssize_t
> > +sliding_window_size_store(struct device *dev,
> > +			  struct device_attribute *attr,
> > +			  const char *buf, size_t count)
> > +{
> > +	unsigned int bytes;
> > +	int ret;
> > +
> > +	ret = kstrtouint(buf, 0, &bytes);
> > +	if (ret)
> > +		return ret;
> > +
> > +	mutex_lock(&cache_mutex);
> > +	if (bytes > 0 && bytes <= MBM_FIFO_SIZE_MAX)
> > +		mbm_window_size = bytes;
> 
> So, it's valid to set the window to X where 0 < X < MBM_FIFO_SIZE_MIN.
> What's the actual purpose of MBM_FIFO_SIZE_MIN?
> 
> > +	mutex_unlock(&cache_mutex);
> > +
> > +	return count;
> > +}
> 
> > -static void intel_cqm_cpu_prepare(unsigned int cpu)
> > +static int intel_cqm_cpu_prepare(unsigned int cpu)
> >  {
> >  	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> >  	struct cpuinfo_x86 *c = &cpu_data(cpu);
> > +	struct mbm_pmu *pmu = per_cpu(mbm_pmu, cpu);
> >
> >  	state->rmid = 0;
> >  	state->closid = 0;
> > @@ -1266,12 +1779,27 @@ static void intel_cqm_cpu_prepare(unsigned
> int
> > cpu)
> >
> >  	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
> >  	WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
> > +
> > +	if ((!pmu) && (cqm_mbm)) {
> > +		pmu = kzalloc_node(sizeof(*mbm_pmu), GFP_KERNEL,
> NUMA_NO_NODE);
> > +		if (!pmu)
> > +			return  -ENOMEM;
> > +		INIT_LIST_HEAD(&pmu->active_list);
> > +		pmu->pmu = &intel_cqm_pmu;
> > +		pmu->n_active = 0;
> 
> It's already zero.
> 
> > +		pmu->timer_interval =
> ms_to_ktime(MBM_TIME_DELTA_MAX);
> > +		per_cpu(mbm_pmu, cpu) = pmu;
> > +		per_cpu(mbm_pmu_to_free, cpu) = NULL;
> 
> What's the point of this? If there is still something to be free'd its leaked.
> Otherwise that's redundant.
> 
> > +		mbm_hrtimer_init(pmu);
> > +	}
> > +	return 0;
> 
> s/0/NOTIFY_OK/ because you return that value directly.
> 
> >  }
> >
> >  static void intel_cqm_cpu_exit(unsigned int cpu)  {
> >  	int phys_id = topology_physical_package_id(cpu);
> >  	int i;
> > +	struct mbm_pmu *pmu = per_cpu(mbm_pmu, cpu);
> >
> >  	/*
> >  	 * Is @cpu a designated cqm reader?
> > @@ -1288,6 +1816,36 @@ static void intel_cqm_cpu_exit(unsigned int cpu)
> >  			break;
> >  		}
> >  	}
> > +
> > +	/* cancel overflow polling timer for CPU */
> > +	if (pmu)
> > +		mbm_stop_hrtimer(pmu);
> > +	kfree(mbm_local);
> > +	kfree(mbm_total);
> 
> Again you insist to ignore a review comment:
> 
>   "So this frees the module global storage if one cpu is unplugged and
>    leaves the pointer initialized so the rest of the code can happily
>    dereference it."
> 
> Oh, well!
> 
> > +}
> > +
> > +static void mbm_cpu_kfree(int cpu)
> > +{
> > +	struct mbm_pmu *pmu = per_cpu(mbm_pmu_to_free, cpu);
> > +
> > +	kfree(pmu);
> > +
> > +	per_cpu(mbm_pmu_to_free, cpu) = NULL; }
> > +
> > +static int mbm_cpu_dying(int cpu)
> > +{
> > +	struct mbm_pmu *pmu = per_cpu(mbm_pmu, cpu);
> 
> this_cpu_read(). It's called on the dying CPU.
> 
> > +static const struct x86_cpu_id intel_mbm_match[] = {
> > +	{ .vendor = X86_VENDOR_INTEL, .feature =
> X86_FEATURE_CQM_MBM_LOCAL },
> > +	{}
> > +};
> > +
> >  static int __init intel_cqm_init(void)  {
> >  	char *str, scale[20];
> > -	int i, cpu, ret;
> > +	int i, cpu, ret, array_size;
> >
> > -	if (!x86_match_cpu(intel_cqm_match))
> > +	if (!x86_match_cpu(intel_cqm_match) &&
> > +	   (!x86_match_cpu(intel_mbm_match)))
> >  		return -ENODEV;
> >
> >  	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> > +	cqm_max_rmid = boot_cpu_data.x86_cache_max_rmid;
> > +
> > +	if (x86_match_cpu(intel_cqm_match)) {
> > +		cqm_llc_occ = true;
> > +		intel_cqm_events_group.attrs = intel_cmt_events_attr;
> > +	}
> > +
> > +	if (x86_match_cpu(intel_mbm_match)) {
> > +		u32 mbm_scale_rounded = 0;
> > +
> > +		cqm_mbm = true;
> > +		cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> > +		/*
> > +		 * MBM counter values are  in bytes. To conver this to KB/sec
> > +		 * or MB/sec, we  scale the MBM scale factor by 1000.
> Another
> > +		 * MBM_SCALING_FACTOR2 factor scaling down is done
> > +		 * after reading the counter val i.e. in the function
> > +		 * __rmid_read_mbm()
> > +		 */
> > +		mbm_scale_rounded =  DIV_ROUND_UP(cqm_l3_scale,
> > +						MBM_SCALING_FACTOR);
> > +		cqm_max_rmid = boot_cpu_data.x86_cache_max_rmid;
> > +		snprintf(scale, sizeof(scale), "%u", mbm_scale_rounded);
> > +		str = kstrdup(scale, GFP_KERNEL);
> > +		if (!str) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		if (cqm_llc_occ)
> > +			intel_cqm_events_group.attrs =
> > +				intel_cmt_mbm_events_attr;
> > +		else
> > +			intel_cqm_events_group.attrs =
> intel_mbm_events_attr;
> > +
> > +		event_attr_intel_cqm_llc_local_bw_scale.event_str = str;
> > +		event_attr_intel_cqm_llc_total_bw_scale.event_str = str;
> > +
> > +		array_size = (cqm_max_rmid + 1) * MBM_SOCKET_MAX;
> > +		mbm_local = kzalloc_node(sizeof(struct sample) * array_size,
> > +					 GFP_KERNEL, NUMA_NO_NODE);
> > +		if (!mbm_local) {
> > +			ret = -ENOMEM;
> 
> Still leaks str.
> 
> > +			goto out_free;
> > +		}
> > +		mbm_total = kzalloc_node(sizeof(struct sample) *
> array_size,
> > +					 GFP_KERNEL, NUMA_NO_NODE);
> > +		if (!mbm_total) {
> > +
> 			ret = -ENOMEM;
> Ditto
> 
> > +			goto out_free;
> > +		}
> 
> I asked you to put the above into a separate function, which saves you one
> indentation level and makes the init function readable, but readability is not
> on your agenda, right?
> 
> > +	} else
> > +		cqm_mbm = false;
> 
> Still a completely useless exercise.
> 
> >
> >  	/*
> >  	 * It's possible that not all resources support the same number @@
> > -1336,44 +1961,48 @@ static int __init intel_cqm_init(void)
> >  	 */
> >  	cpu_notifier_register_begin();
> >
> > -	for_each_online_cpu(cpu) {
> > -		struct cpuinfo_x86 *c = &cpu_data(cpu);
> > +	if (cqm_llc_occ) {
> > +		for_each_online_cpu(cpu) {
> > +			struct cpuinfo_x86 *c = &cpu_data(cpu);
> >
> > -		if (c->x86_cache_max_rmid < cqm_max_rmid)
> > -			cqm_max_rmid = c->x86_cache_max_rmid;
> > +			if (c->x86_cache_max_rmid < cqm_max_rmid)
> > +				cqm_max_rmid = c->x86_cache_max_rmid;
> >
> > -		if (c->x86_cache_occ_scale != cqm_l3_scale) {
> > -			pr_err("Multiple LLC scale values, disabling\n");
> > -			ret = -EINVAL;
> > -			goto out;
> > +			if (c->x86_cache_occ_scale != cqm_l3_scale) {
> > +				pr_err("Multiple LLC scale values,
> disabling\n");
> > +				ret = -EINVAL;
> > +				goto out;
> > +			}
> >  		}
> > -	}
> >
> > -	/*
> > -	 * A reasonable upper limit on the max threshold is the number
> > -	 * of lines tagged per RMID if all RMIDs have the same number of
> > -	 * lines tagged in the LLC.
> > -	 *
> > -	 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
> > -	 */
> > -	__intel_cqm_max_threshold =
> > +		/*
> > +		 * A reasonable upper limit on the max threshold is the
> number
> > +		 * of lines tagged per RMID if all RMIDs have the same
> number of
> > +		 * lines tagged in the LLC.
> > +		 *
> > +		 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
> > +		 */
> > +		__intel_cqm_max_threshold =
> >  		boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid +
> 1);
> >
> > -	snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
> > -	str = kstrdup(scale, GFP_KERNEL);
> > -	if (!str) {
> > -		ret = -ENOMEM;
> > -		goto out;
> > -	}
> > +		snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
> > +		str = kstrdup(scale, GFP_KERNEL);
> > +		if (!str) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> >
> > -	event_attr_intel_cqm_llc_scale.event_str = str;
> > +		event_attr_intel_cqm_llc_scale.event_str = str;
> 
> Seperate function as well. And this splitout wants to be a separate patch.
> 
> > +	}
> >
> >  	ret = intel_cqm_setup_rmid_cache();
> >  	if (ret)
> >  		goto out;
> 
> Still leaks str and mbm_local
> 
> >  	for_each_online_cpu(i) {
> > -		intel_cqm_cpu_prepare(i);
> > +		ret = intel_cqm_cpu_prepare(i);
> > +		if (ret)
> > +			goto out_free;
> 
> Still leaks str and leaves more half initialized stuff around.
> 
> I.e., what happens if you prepared 5 cpus and the 6th fails?  You free
> mbm_local and mbm_total, and how is the other preparation of the cpus
> undone?
> 
> FYI, this is not the way a review cycle works. Either you address the
> comments or you reply to the comment and explain why you think that your
> code is correct. Just ignoring it does not work as you might have noticed.
> 
> I really don't care how much time you waste on this, but I pretty much care
> about the time I waste.
> 
> Please take your time and address all points before you post the next
> version. You have plenty of it as it's not going into 4.3 by any means.
> 
> Yours grumpy
> 
>       tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/