lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Apr 2022 10:59:07 -0500
From:   Alex Elder <elder@...aro.org>
To:     Stephen Boyd <swboyd@...omium.org>, djakov@...nel.org,
        okukatla@...eaurora.org, quic_mdtipton@...cinc.com
Cc:     linux-arm-msm@...r.kernel.org, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        Mike Tipton <mdtipton@...eaurora.org>, mka@...omium.org,
        dianders@...omium.org
Subject: Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list
 in pre_aggregate

On 4/5/22 6:00 PM, Stephen Boyd wrote:
> Quoting Georgi Djakov (2021-11-25 09:47:51)
>> From: Mike Tipton <mdtipton@...eaurora.org>
>>
>> We're only adding BCMs to the commit list in aggregate(), but there are
>> cases where pre_aggregate() is called without subsequently calling
>> aggregate(). In particular, in icc_sync_state() when a node with initial
>> BW has zero requests. Since BCMs aren't added to the commit list in
>> these cases, we don't actually send the zero BW request to HW. So the
>> resources remain on unnecessarily.
>>
>> Add BCMs to the commit list in pre_aggregate() instead, which is always
>> called even when there are no requests.
>>
>> Signed-off-by: Mike Tipton <mdtipton@...eaurora.org>
>> [georgi: remove icc_sync_state for platforms with incomplete support]
>> Signed-off-by: Georgi Djakov <djakov@...nel.org>

I'm back from vacation and am finally giving proper attention to
this.  I want to make sure I understand the problem, because there
are (at least) two parts to it.

- The first problem you observe is that you are not seeing XO
   shutdown on suspend on a Lazor device.
- You didn't say this directly but I think you are seeing this
   on Linux v5.15.y (the 5.15 LTS branch), or perhaps on something
   derived from that branch.
- You find that if you back-port (or cherry-pick?) the commit
   that landed upstream as b95b668eaaa2 ("interconnect: qcom:
   icc-rpmh: Add BCMs to commit list in pre_aggregate
"), you
   *do* see XO shutdown on suspend, as desired.

Here's what I understand that commit to do:
- In some cases, the bus clock managers (BCMs) are configured
   by the boot loader so that some interconnects have non-zero
   initial bandwidth.
- There is no sense in keeping an interconnect active if Linux
   has nothing that requires its use.  So we would like Linux to
   ensure the configured bandwidth for an *unused* interconnect
   is zero.
- Prior to that commit, BCM-managed hardware was only queued
   to update its configuration when the ->aggregate interconnect
   provider function was called.  After that commit, updates were
   queued by the ->pre_aggregate provider function.
- Unlike the ->aggregate callback, the ->pre_aggregate provider
   function queues updates to the hardware configuration whether
   or not they have active users.
- The result of this commit is that the hardware configuration
   for all defined BCM-managed interconnects is updated, and in
   particular, the configured bandwidth for unused interconnects
   is set to zero.

When unused interconnects are configured for zero bandwidth, they
do not require an active main XO clock, and so with this commit
it becomes possible for the XO clock to be shut down.

And that's why this commit addresses your XO shutdown problem on
the Linux 5.15 LTS branch.

Is the above an accurate description?

Looking at that branch, I see this commit:  f753067494c27
("Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit
list in pre_aggregate"
").  Which shows that an attempt was made
to include this commit in the 5.15 LTS branch, but it caused
some *other* regressions.  That suggests this might not be
easy to fix.

---

The second problem you have is exhibited by the IPA driver if
the "fix" commit (upstream b95b668eaaa2) is back-ported to the
Linux 5.10.y LTS branch (along with some other prerequisite
commits).  We can conclude that applying the above commit
makes the bandwidth for an unused interconnect (or perhaps
the rate for the IPA core clock) get set to zero.  And in that
case, an attempt to access IPA hardware leads to the crash you
observed.

The IPA driver does not implement runtime power management
until Linux v5.15.  You later said you thought enabling that
might ensure the clock and interconnects were active when
needed by the IPA driver, and I concur (but there could be a
little more to it).

In any case, based on the time stamp in your log, it seems
this problem is likely occurring upon the first access to IPA
hardware.

I have a hunch about what might be happening here.  There is
some synchronization that must occur between the AP and modem
when IPA is starting up.  Until that synchronization step has
completed, we can't allow the IPA network device to be opened.
In later kernels I think this is precluded, but perhaps in
Linux v5.10 it isn't.  Until I look a little more closely I'm
not sure what would happen, but it *could* be this.

I'm going to look a little how the particular access that
caused the crash is prevented in newer kernels.  It could
be that back-porting that (or re-implementing it for the
older kernel) will address the crash you're seeing.

					-Alex

> This patch fixes suspend/resume for me on sc7180-trogdor-lazor. Without
> it I can't achieve XO shutdown. It seems that it fixes the sync_state
> support that was added in commit b1d681d8d324 ("interconnect: Add sync
> state support"). Before that commit suspend worked because the
> interconnect wasn't maxed out at boot. After that commit we started
> maxing out the interconnect state and never dropping it.
> 
> It would be good to pick this back to stable kernels so we have a
> working suspend/resume on LTS kernels. I tried picking it back to
> 5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
> it crashes at boot pretty reliably in the IPA driver. Interestingly I
> can't get it to crash on 5.15.32 when I pick it back, so maybe something
> has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.
> 
> [   23.708432] Internal error: synchronous external abort: 96000010
> [#1] PREEMPT SMP
> [   23.708451] Modules linked in: veth rfcomm algif_hash
> algif_skcipher af_alg uinput xt_MASQUERADE uvcvideo videobuf2_vmalloc
> venus_enc venus_dec videobuf2_dma_sg videobuf2_memops venus_core
> v4l2_mem2mem videobuf2_v4l2 cros_ec_typec videobuf2_common hci_uart
> typec btqca qcom_q6v5_mss ipa qcom_pil_info qcom_q6v5 qcom_common
> rmtfs_mem ip6table_nat fuse 8021q bluetooth ecdh_generic ecc
> ath10k_snoc ath10k_core ath lzo_rle lzo_compress mac80211 zram
> cfg80211 r8152 mii joydev
> [   23.708565] CPU: 5 PID: 3706 Comm: mmdata_mgr Not tainted 5.10.109+ #61
> [   23.708571] Hardware name: Google Lazor (rev1+) with LTE (DT)
> [   23.708578] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [   23.708597] pc : gsi_channel_start+0x78/0x1dc [ipa]
> [   23.708609] lr : gsi_channel_start+0x4c/0x1dc [ipa]
> [   23.708614] sp : ffffffc013d9ba20
> [   23.708619] x29: ffffffc013d9ba20 x28: 0000000000000000
> [   23.708628] x27: 0000000000000000 x26: ffffffc013d9bc20
> [   23.708637] x25: 000000000001c000 x24: 0000000000000000
> [   23.708646] x23: ffffffab00cb9410 x22: 00000000712dcf80
> [   23.708654] x21: ffffffab486bc148 x20: ffffffab486b8a18
> [   23.708663] x19: ffffffab486b8000 x18: 00000000ffff0a00
> [   23.708671] x17: 000000002f7254f1 x16: ffffffeb3db6f344
> [   23.708680] x15: 00000000ffee6094 x14: ffffffffffffffff
> [   23.708689] x13: 0000000000000010 x12: 0101010101010101
> [   23.708697] x11: 000000000000013c x10: 0000000000000000
> [   23.708706] x9 : 000000000001c000 x8 : ffffffc018f1c000
> [   23.708715] x7 : fefefefefeff2f60 x6 : 0000808080808080
> [   23.708724] x5 : 0000000000000000 x4 : 8080808080800000
> [   23.708732] x3 : 0000000000000000 x2 : ffffffab5089eac0
> [   23.708741] x1 : 0000000000000000 x0 : 0000000000000000
> [   23.708750] Call trace:
> [   23.708762]  gsi_channel_start+0x78/0x1dc [ipa]
> [   23.708773]  ipa_endpoint_enable_one+0x34/0xc0 [ipa]
> [   23.708785]  ipa_open+0x30/0x98 [ipa]
> [   23.708795]  __dev_open+0xd8/0x190
> [   23.708803]  __dev_change_flags+0xbc/0x1c8
> [   23.708810]  dev_change_flags+0x30/0x70
> [   23.708818]  devinet_ioctl+0x274/0x500
> [   23.708824]  inet_ioctl+0x10c/0x394
> [   23.708831]  sock_do_ioctl+0x58/0x324
> [   23.708837]  compat_sock_ioctl+0x238/0xdb0
> [   23.708845]  __arm64_compat_sys_ioctl+0xcc/0x104
> [   23.708854]  el0_svc_common+0xec/0x1dc
> [   23.708860]  do_el0_svc_compat+0x28/0x54
> [   23.708868]  el0_svc_compat+0x10/0x1c
> [   23.708874]  el0_sync_compat_handler+0xc0/0xf0
> [   23.708880]  el0_sync_compat+0x184/0x1c0
> [   23.708890] Code: 51286129 53037d29 1b166529 8b090108 (b9400108)
> 
> Note I had to pick a handful of other patches for nvmem to get normal
> boot on 5.10.109. I'll send those over to stable maintainers shortly.
> 
> 	fd3bb8f54a88 ("nvmem: core: Add support for keepout regions")
> 	de0534df9347 ("nvmem: core: fix error handling while validating
> keepout regions")
> 	044ee8f85267 ("nvmem: qfprom: Don't touch certain fuses")
> 	437145dbcdee ("arm64: dts: qcom: sc7180: Add soc-specific qfprom
> compat string")
> 	437cdef515e2 ("arm64: dts: qcom: sc7180:: modified qfprom CORR size
> as per RAW size")

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ