lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 28 Feb 2019 20:41:53 +0100
From:   Gerhard Wiesinger <lists@...singer.com>
To:     Maxime Ripard <maxime.ripard@...tlin.com>
Cc:     arm@...ts.fedoraproject.org, Chen-Yu Tsai <wens@...e.org>,
        LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
        Florian Fainelli <f.fainelli@...il.com>, filbar@...trum.cz
Subject: Re: Banana Pi-R1 stabil

On 28.02.2019 10:35, Maxime Ripard wrote:
> On Wed, Feb 27, 2019 at 07:58:14PM +0100, Gerhard Wiesinger wrote:
>> On 27.02.2019 10:20, Maxime Ripard wrote:
>>> On Sun, Feb 24, 2019 at 09:04:57AM +0100, Gerhard Wiesinger wrote:
>>>> Hello,
>>>>
>>>> I've 3 Banana Pi R1, one running with self compiled kernel
>>>> 4.7.4-200.BPiR1.fc24.armv7hl and old Fedora 25 which is VERY STABLE, the 2
>>>> others are running with Fedora 29 latest, kernel 4.20.10-200.fc29.armv7hl. I
>>>> tried a lot of kernels between of around 4.11
>>>> (kernel-4.11.10-200.fc25.armv7hl) until 4.20.10 but all had crashes without
>>>> any output on the serial console or kernel panics after a short time of
>>>> period (minutes, hours, max. days)
>>>>
>>>> Latest known working and stable self compiled kernel: kernel
>>>> 4.7.4-200.BPiR1.fc24.armv7hl:
>>>>
>>>> https://www.wiesinger.com/opensource/fedora/kernel/BananaPi-R1/
>>>>
>>>> With 4.8.x the DSA b53 switch infrastructure has been introduced which
>>>> didn't work (until ca8931948344c485569b04821d1f6bcebccd376b and kernel
>>>> 4.18.x):
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/dsa/b53?h=v4.20.12
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/dsa/b53?h=v4.20.12
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/dsa/b53?h=v4.20.12&id=ca8931948344c485569b04821d1f6bcebccd376b
>>>>
>>>> I has been fixed with kernel 4.18.x:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/dsa/b53?h=linux-4.18.y
>>>>
>>>>
>>>> So current status is, that kernel crashes regularly, see some samples below.
>>>> It is typically a "Unable to handle kernel paging request at virtual addres"
>>>>
>>>> Another interesting thing: A Banana Pro works well (which has also an
>>>> Allwinner A20 in the same revision) running same Fedora 29 and latest
>>>> kernels (e.g. kernel 4.20.10-200.fc29.armv7hl.).
>>>>
>>>> Since it happens on 2 different devices and with different power supplies
>>>> (all with enough power) and also the same type which works well on the
>>>> working old kernel) a hardware issue is very unlikely.
>>>>
>>>> I guess it has something to do with virtual memory.
>>>>
>>>> Any ideas?
>>>> [47322.960193] Unable to handle kernel paging request at virtual addres 5675d0
>>> That line is a bit suspicious
>>>
>>> Anyway, cpufreq is known to cause those kind of errors when the
>>> voltage / frequency association is not correct.
>>>
>>> Given the stack trace and that the BananaPro doesn't have cpufreq
>>> enabled, my first guess would be that it's what's happening. Could you
>>> try using the performance governor and see if it's more stable?
>>>
>>> If it is, then using this:
>>> https://github.com/ssvb/cpuburn-arm/blob/master/cpufreq-ljt-stress-test
>>>
>>> will help you find the offending voltage-frequency couple.
>> For me it looks like they have all the same config regarding cpu governor
>> (Banana Pro, old kernel stable one, new kernel unstable ones)
> The Banana Pro doesn't have a regulator set up, so it will only change
> the frequency, not the voltage.
>
>> They all have the ondemand governor set:
>>
>> I set on the 2 unstable "new kernel Banana Pi R1":
>>
>> # Set to max performance
>> echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
> What are the results?


Stable since more than around 1,5 days. Normally they have been crashed 
for such a long uptime. So it looks that the performance governor fixes it.

I guess crashes occour because of changing CPU voltage and clock changes 
and invalid data (e.g. also invalid RAM contents might be read, register 
problems, etc).

Any ideas how to fix it for ondemand mode, too?

But it doesn't explaing that it works with kernel 4.7.4 without any 
problems.


>
>> Running some stress tests are ok (I did that already in the past, but
>> without setting maximum performance governor).
> Which stress tests have you been running?


Now:

while true; do echo "========================================"; echo -n 
"TEMP     : "; cat /sys/devices/virtual/thermal/thermal_zone0/temp; echo 
-n "VOLTAGE : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/voltage_now; 
echo -n "CURRENT  : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/current_now; 
echo -n "CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq; echo -n 
"CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq; sleep 1; done& 
stress -c 4 -t 900s

In the past also:

while true; do echo "========================================"; echo -n 
"TEMP     : "; cat /sys/devices/virtual/thermal/thermal_zone0/temp; echo 
-n "VOLTAGE : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/voltage_now; 
echo -n "CURRENT  : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/current_now; 
echo -n "CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq; echo -n 
"CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq; sleep 1; done& 
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 1G --timeout 900s --metrics-brief

while true; do echo "========================================"; echo -n 
"TEMP     : "; cat /sys/devices/virtual/thermal/thermal_zone0/temp; echo 
-n "VOLTAGE : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/voltage_now; 
echo -n "CURRENT  : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/current_now; 
echo -n "CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq; echo -n 
"CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq; sleep 1; done& 
./cpuburn-a7

https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
while true; do echo "========================================"; echo -n 
"TEMP     : "; cat /sys/devices/virtual/thermal/thermal_zone0/temp; echo 
-n "VOLTAGE : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/voltage_now; 
echo -n "CURRENT  : "; cat 
/sys/devices/platform/soc@...0000/1c2ac00.i2c/i2c-0/0-0034/axp20x-ac-power-supply/power_supply/axp20x-ac/current_now; 
echo -n "CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq; echo -n 
"CPU_FREQ0: "; cat 
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq; sleep 1; done& 
stress -c 2 -i 1 -m 1 --vm-bytes 128M -t 900s


But I guess that the problems occour nots on full load but on dynamical 
switching loads (when CPU voltage and clock changes). Because the 
Bananas are typically really idle and crash (with the ondemand governor).


Thanx.

Ciao,

Gerhard

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ