lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FBE7755.2080301@redhat.com>
Date:	Thu, 24 May 2012 15:00:53 -0300
From:	Mauro Carvalho Chehab <mchehab@...hat.com>
To:	Borislav Petkov <bp@...64.org>
CC:	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Aristeu Rozanski <arozansk@...hat.com>,
	Doug Thompson <norsk5@...oo.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller
 events

Em 24-05-2012 13:45, Borislav Petkov escreveu:
> On Thu, May 24, 2012 at 01:13:17PM -0300, Mauro Carvalho Chehab wrote:
>>> Why are we even exporting grain actually with each tracepoint
>>> invocation? This is the granularity of reported error in bytes, and it,
>>> as such, is statically assigned to a value in each driver. Userspace can
>>> certainly figure out that value in a different way.
>>
>> The API doesn't export the grain, except via the tracepoint/printk.
> 
> And this is exactly my question: if it is a static value which is set
> once per driver, why do we have to issue it with _every_ tracepoint
> invocation? Room in the per-cpu trace buffers is not for free.

On the current drivers, the grain static. I'm not sure if the grain is really
a per-memory controller or if this is again yet-another-issue with the way
EDAC core handles such information.

I suspect that, on sophisticated memory controllers that can do any type of
DIMM interleaving, including no interleave at all, the grain can vary from
one memory address range to the other.

If we change the API to have an explicit sysfs node to express the grain,
and latter we end by needing a per-address range grain, we'll need to break
the kABI. 

So, keeping the grain information at the tracepoint is more flexible, as it
can cover both situations.

> 
>>> But the more important question is: does the grain help us when handling
>>> the error info in userspace?
>>>
>>> It tells us that at this physical address with "grain" granularity we
>>> had an error. So?
>>
>> While a certain number of corrected errors that happened on different, sparsed,
>> addresses may not mean a damaged memory, the same number of corrected errors
>> happening at the same physical address/grain means that the DRAM chip that
>> contains such address is damaged, so the corresponding DIMM needs to be 
>> replaced.
>>
>> So, the address/grain can be used by userspace algorithms to increase the
>> probability that a DIMM is damaged.
> 
> I have no idea what you're saying here.
> 
> The DIMM can be pinpointed using the address only, why do you need the
> grain too?

You can pinpoint a DIMM but in order to pinpoint the affected MOSFET transistors,
the address and address mask is needed, as most memory controllers can't point
to a single address, because the register that stores the address doesn't have
enough bits to store the full content of the instruction pointer register, or because
of some other internal device issues.

So, two different "addresses" could atually point to the same group of transistors
inside a DIMM.

Also, higher values of grains may affect the error statistics. For example, i3200_edac
driver has a grain that can be 64 MB, while other devices have a grain of 1.

If userspace uses some a stochastics analysis to measure the error distribution,
the grain will affect the parameters for the Probability Distribution Function to 
be used to estimate if the error was just due to a random noise, or if they're due 
to a bad memory.

Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ