lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250704101233.347506-1-guoqing.zhang@amd.com>
Date: Fri, 4 Jul 2025 18:12:28 +0800
From: Samuel Zhang <guoqing.zhang@....com>
To: <alexander.deucher@....com>, <christian.koenig@....com>,
	<rafael@...nel.org>, <len.brown@...el.com>, <pavel@...nel.org>,
	<gregkh@...uxfoundation.org>, <dakr@...nel.org>, <airlied@...il.com>,
	<simona@...ll.ch>, <ray.huang@....com>, <matthew.auld@...el.com>,
	<matthew.brost@...el.com>, <maarten.lankhorst@...ux.intel.com>,
	<mripard@...nel.org>, <tzimmermann@...e.de>
CC: <mario.limonciello@....com>, <lijo.lazar@....com>, <victor.zhao@....com>,
	<haijun.chang@....com>, <Qing.Ma@....com>, <linux-pm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <amd-gfx@...ts.freedesktop.org>,
	<dri-devel@...ts.freedesktop.org>, Samuel Zhang <guoqing.zhang@....com>
Subject: [PATCH v2 0/5] reduce system memory requirement for hibernation

Modern data center dGPUs are usually equipped with very large VRAM. On
server with such dGPUs(192GB VRAM * 8) and 2TB system memory, hibernate
will fail due to no enough free memory.

The root cause is that during hibernation all VRAM memory get evicted to
GTT or shmem. In both case, it is in system memory and kernel will try to 
copy the pages to hibernation image. In the worst case, this causes 2 
copies of VRAM memory in system memory, 2TB is not enough for the 
hibernation image. 192GB * 8 * 2 = 3TB > 2TB.

The fix includes following changes. With these changes, there's much less
pages needed to be copied to hibernate image and hibernation can succeed.
* patch 1 and 2: move GTT to shmem after evicting VRAM. so that the GTT 
  pages can be freed.
* patch 3: force write shmem pages to swap disk and free shmem pages.

After swapout GTT to shmem in hibernation prepare stage, the GPU will be
resumed again in thaw stage. The swapin and restore BOs of resume takes
lots of time (50 mintues observed for 8 dGPUs). And it's unnecessary since
writing hibernation image do not need GPU for hibernate successful case.
* patch 4 and 5: skip resume of device in thaw stage for successful
  hibernation case to reduce the hibernation time.

v2:
* split first patch to 2 patches, 1 for ttm, 1 for amdgpu
* refined the new ttm api
* add more comments for shrink_shmem_memory() and its callsite
* export variable pm_transition in kernel
* skip resume in thaw() for successful hibernation case

Samuel Zhang (5):
1. drm/ttm: add ttm_device_prepare_hibernation() api
2. drm/amdgpu: move GTT to shmem after eviction for hibernation
3. PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
4. PM: hibernate: export variable pm_transition
5. drm/amdgpu: do not resume device in thaw for normal hibernation

 drivers/base/power/main.c                    |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c      | 10 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      | 13 ++++++++-
 drivers/gpu/drm/amd/dkms/config/config.h     |  3 ++
 drivers/gpu/drm/amd/dkms/m4/pm_transition.m4 | 15 ++++++++++
 drivers/gpu/drm/ttm/ttm_device.c             | 29 ++++++++++++++++++++
 include/drm/ttm/ttm_device.h                 |  1 +
 include/linux/pm.h                           |  2 ++
 kernel/power/hibernate.c                     | 26 ++++++++++++++++++
 9 files changed, 100 insertions(+), 2 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/dkms/m4/pm_transition.m4

-- 
2.43.5


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ