lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20191118203021.jvlerg4k645g43yn@rric.localdomain>
Date:   Mon, 18 Nov 2019 20:30:29 +0000
From:   Robert Richter <rrichter@...vell.com>
To:     Mauro Carvalho Chehab <mchehab@...nel.org>
CC:     Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
        "James Morse" <james.morse@....com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 14/20] EDAC, mc: Remove per layer counters

Hi Mauro,

On 09.11.19 08:40:56, Mauro Carvalho Chehab wrote:
> Em Wed, 6 Nov 2019 09:33:32 +0000
> Robert Richter <rrichter@...vell.com> escreveu:
> 
> > Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
> > turns out that only the leaves in the memory hierarchy are consumed
> > (in sysfs), but not the intermediate layers, e.g.:
> > 
> >  count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];

first of all, this is the only user, where ce_per_layer[][] is
accessed *readable*. Note that n_layers is a constant value per
mci. Thus we could also convert this without any change of
functionality to:

 count = dimm->mci->ce_counts[dimm->idx];

We can also remove the code that writes counter values to inner
layers, those values are never read.

The above is nothing else than storing the count per DIMM, which can
be converted to the following by just adding the count to struct
dimm_info:

 count = dimm->ce_count;

Same applies to ue_count.

As we have the counts in struct dimm_info now, we no longer need to
allocate {ue,ce}_counts arrays and can remove its allocation and
release code including everything around.

> > 
> > These unused counters only add complexity, remove them. The error
> > counter values are directly stored in struct dimm_info now.
> 
> I guess this patch will cause troubles with some memory controllers.
> 
> The problem is that, depending on the memory type and how many bits
> are wrong, it may not be technically possible to pinpoint an error
> to a single DIMM.

If a DIMM can not be identified, the MC has one or more of the pos
values (pos[0] to pos[mci->n_layers-1]) unset (negative values). The
count of the outer layer (mci->ce_per_layer[mci->n_layers][index]) is
not written then. See below in function edac_inc_ce_error().

> 
> I mean, the memory controller can be, for instance, grouping
> DIMM1 and DIMM2. If there's just one bit errored, it is possible to
> assign it to either DIMM1 or DIMM2, but if there are multiple bits
> wrong, most ECC codes won't allow to pinpoint if the error ocurred
> at DIMM1 or at DIMM2.

An error would not be counted for any DIMM then.

> 
> All we know is that the layer has an error.

Right, but this hasn't any effect on DIMM error counters.

This has only effect to csrow/channel counters. The code for this did
not change, see edac_mc_handle_error().

> 
> So, assigning the error to the dimm struct seems plain wrong to me.

I think this is the code in question for you:

> >  static void edac_inc_ce_error(struct mem_ctl_info *mci,
> > -			      bool enable_per_layer_report,
> >  			      const int pos[EDAC_MAX_LAYERS],
> >  			      const u16 count)
> >  {
> > -	int i, index = 0;
> > +	struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
> >  
> >  	mci->ce_mc += count;
> >  
> > -	if (!enable_per_layer_report) {
> > +	if (dimm)
> > +		dimm->ce_count += count;
> > +	else
> >  		mci->ce_noinfo_count += count;
> > -		return;
> > -	}
> > -
> > -	for (i = 0; i < mci->n_layers; i++) {
> > -		if (pos[i] < 0)
> > -			break;
> > -		index += pos[i];
> > -		mci->ce_per_layer[i][index] += count;

No value written here if pos[] < 0.

> > -
> > -		if (i < mci->n_layers - 1)
> > -			index *= mci->layers[i + 1].size;
> > -	}

So in an intermediate step the for loop could be converted to:

	dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);

	if (dimm)
		mci->ce_per_layer[mci->n_layers - 1][dimm->idx] += count;

No change in functionality, right?

> >  }

I hope this explains what this patch does.

It looks sane to me, please review again. If you still think it is
broken, give me an example.

Thanks,

-Robert

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ