[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CY4PR18MB1576506048A5A1B57D177C6AB7F59@CY4PR18MB1576.namprd18.prod.outlook.com>
Date:   Wed, 4 Jan 2023 09:19:55 +0000
From:   Igor Russkikh <irusskikh@...vell.com>
To:     Jesse <pianohacker@...il.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "Chia-Lin Kao (AceLan)" <acelan.kao@...onical.com>
Subject: RE: [EXT] Bad page after suspend with Innodisk EGPL-T101 [1d6a:14c0]
Hi Jesse,
Adding Chia-Lin Kao, who recently fixed S3 related issues in the driver.
The stacktrace indicates aq_ring_alloc (memory alloc for rings) was failed.
Inside of failure path it called aq_ring_free, who invokes kfree without pointer check.
Blind guess – 
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 25129e723b57..27ecef6cec28 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -917,7 +917,8 @@ void aq_ring_free(struct aq_ring_s *self)
        if (!self)
                return;
 
-       kfree(self->buff_ring);
+       if (self->buff_ring)
+               kfree(self->buff_ring);
 
        if (self->dx_ring)
                dma_free_coherent(aq_nic_get_dev(self->aq_nic),
may help here.
But the question is why allocation is failed on resume. May be memory leak...
Regards,
   Igor
From: Jesse <pianohacker@...il.com> 
Sent: Mittwoch, 4. Januar 2023 07:58
To: Igor Russkikh <irusskikh@...vell.com>
Cc: netdev@...r.kernel.org
Subject: [EXT] Bad page after suspend with Innodisk EGPL-T101 [1d6a:14c0]
External Email 
________________________________________
After resume, I sometimes see the following error and the device hangs:
[36257.935269] BUG: Bad page state in process kworker/u64:33  pfn:10e400
[36257.935269] page:00000000597be4f0 refcount:0 mapcount:0 mapping:00000000eeb38d16 index:0x0 pfn:0x10e400
[36257.935270] aops:anon_aops.1 ino:63a9
[36257.935271] flags: 0x17ffffc0000800(arch_1|node=0|zone=2|lastcpupid=0x1fffff)
[36257.935271] raw: 0017ffffc0000800 0000000000000000 dead000000000122 ffff970d81f08178
[36257.935272] raw: 0000000000000000 0000000000000003 00000000ffffffff 0000000000000000
[36257.935272] page dumped because: non-NULL mapping
[36257.935272] Modules linked in: i2c_dev xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables br_netfilter bridge stp llc wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ctr ccm snd_seq_dummy snd_hrtimer snd_seq nfnetlink tun rfcomm cmac algif_hash algif_skcipher af_alg qrtr overlay bnep binfmt_misc nls_ascii nls_cp437 vfat fat ext4 squashfs mbcache jbd2 loop btusb intel_rapl_msr intel_rapl_common iwlmvm btrtl btbcm btintel btmtk snd_hda_codec_realtek edac_mce_amd bluetooth mac80211 snd_hda_codec_generic uvcvideo snd_hda_codec_hdmi videobuf2_vmalloc snd_hda_intel kvm_amd videobuf2_memops snd_usb_audio snd_intel_dspcfg videobuf2_v4l2 eeepc_wmi snd_intel_sdw_acpi jitterentropy_rng libarc4 asus_wmi videobuf2_common asus_ec_sensors snd_hda_codec drbg snd_usbmidi_lib platform_profile kvm iwlwifi
[36257.935286]  videodev ansi_cprng battery snd_rawmidi snd_hda_core irqbypass sparse_keymap snd_seq_device ecdh_generic rapl ledtrig_audio wmi_bmof pcspkr mc snd_hwdep zenpower(OE) ecc cfg80211 crc16 joydev snd_pcm razermouse(OE) snd_timer cdc_acm snd ccp sp5100_tco soundcore rfkill rng_core watchdog acpi_cpufreq evdev nfsd auth_rpcgss nfs_acl lockd lm92 grace nct6775 nct6775_core hwmon_vid sunrpc msr drivetemp parport_pc ppdev lp parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic dm_crypt dm_mod hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid amdgpu gpu_sched drm_buddy video drm_display_helper cec crc32_pclmul crc32c_intel rc_core ghash_clmulni_intel ahci drm_ttm_helper sha512_ssse3 ttm libahci sha512_generic xhci_pci drm_kms_helper nvme libata xhci_hcd nvme_core atlantic aesni_intel drm t10_pi crypto_simd igb scsi_mod usbcore crc64_rocksoft_generic cryptd macsec dca crc64_rocksoft
[36257.935303]  i2c_piix4 crc_t10dif ptp crct10dif_generic i2c_algo_bit crct10dif_pclmul scsi_common usb_common crc64 crct10dif_common pps_core wmi button
[36257.935305] CPU: 8 PID: 610626 Comm: kworker/u64:33 Tainted: G    B      OE      6.1.0-0-amd64 #1  Debian 6.1.1-1~exp2
[36257.935306] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4408 10/28/2022
[36257.935306] Workqueue: events_unbound async_run_entry_fn
[36257.935307] Call Trace:
[36257.935307]  <TASK>
[36257.935307]  dump_stack_lvl+0x44/0x5c
[36257.935308]  bad_page.cold+0x63/0x8f
[36257.935309]  __free_pages_ok+0x139/0x4f0
[36257.935310]  ? force_dma_unencrypted+0x27/0xa0
[36257.935311]  aq_ring_alloc+0xa4/0xb0 [atlantic]
[36257.935315]  aq_vec_ring_alloc+0xea/0x1a0 [atlantic]
[36257.935320]  aq_nic_init+0x114/0x1d0 [atlantic]
[36257.935324]  atl_resume_common+0x40/0xd0 [atlantic]
[36257.935328]  ? pci_legacy_resume+0x80/0x80
[36257.935329]  dpm_run_callback+0x4a/0x150
[36257.935330]  device_resume+0x88/0x190
[36257.935331]  async_resume+0x19/0x30
[36257.935331]  async_run_entry_fn+0x30/0x130
[36257.935332]  process_one_work+0x1c7/0x380
[36257.935333]  worker_thread+0x4d/0x380
[36257.935335]  ? rescuer_thread+0x3a0/0x3a0
[36257.935336]  kthread+0xe9/0x110
[36257.935336]  ? kthread_complete_and_exit+0x20/0x20
[36257.935337]  ret_from_fork+0x22/0x30
[36257.935339]  </TASK>
[36257.935445] atlantic 0000:01:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -12
[36257.935447] atlantic 0000:01:00.0: PM: failed to resume async: error -12
This error occurs inconsistently; sometimes after a single sleep/wake cycle, sometimes after multiple. I have tried all of the random kernel flags I can find from the most reputable stackexchange posts, including pci=nommconf.
Note that this is with iommu=pt. Without this flag there are iommu errors before a crash with a similar traceback.
On kernel 6.1.1 (not latest, but don't see relevant changes in Git since). Apologies if this is the wrong path for reporting bugs.
-- 
Jesse Weaver
Powered by blists - more mailing lists
 
