[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20221117161951.845454-18-ogabbay@kernel.org>
Date: Thu, 17 Nov 2022 18:19:49 +0200
From: Oded Gabbay <ogabbay@...nel.org>
To: linux-kernel@...r.kernel.org
Cc: Tomer Tayar <ttayar@...ana.ai>
Subject: [PATCH 18/20] habanalabs: extend process wait timeout in device fine
Processes that use our device are likely to use at the same time other
devices such as remote storage.
In case our device is removed and a user process is still using the
device, we need to kill the user process. However, if that process
has a thread waiting for i/o to complete on remote storage, for example,
the process won't terminate.
Let's give it enough time to terminate before giving up.
Signed-off-by: Oded Gabbay <ogabbay@...nel.org>
Reviewed-by: Tomer Tayar <ttayar@...ana.ai>
---
drivers/misc/habanalabs/common/device.c | 6 ++++--
drivers/misc/habanalabs/common/habanalabs.h | 11 ++++++++---
2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index 0650e511a0f5..63d0cb7087e8 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -2300,14 +2300,16 @@ void hl_device_fini(struct hl_device *hdev)
*/
dev_info(hdev->dev,
"Waiting for all processes to exit (timeout of %u seconds)",
- HL_PENDING_RESET_LONG_SEC);
+ HL_WAIT_PROCESS_KILL_ON_DEVICE_FINI);
- rc = device_kill_open_processes(hdev, HL_PENDING_RESET_LONG_SEC, false);
+ hdev->process_kill_trial_cnt = 0;
+ rc = device_kill_open_processes(hdev, HL_WAIT_PROCESS_KILL_ON_DEVICE_FINI, false);
if (rc) {
dev_crit(hdev->dev, "Failed to kill all open processes\n");
device_disable_open_processes(hdev, false);
}
+ hdev->process_kill_trial_cnt = 0;
rc = device_kill_open_processes(hdev, 0, true);
if (rc) {
dev_crit(hdev->dev, "Failed to kill all control device open processes\n");
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 0781b8698f74..e7f89868428d 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -50,9 +50,14 @@ struct hl_fpriv;
#define HL_MMAP_OFFSET_VALUE_MASK (0x1FFFFFFFFFFFull >> PAGE_SHIFT)
#define HL_MMAP_OFFSET_VALUE_GET(off) (off & HL_MMAP_OFFSET_VALUE_MASK)
-#define HL_PENDING_RESET_PER_SEC 10
-#define HL_PENDING_RESET_MAX_TRIALS 60 /* 10 minutes */
-#define HL_PENDING_RESET_LONG_SEC 60
+#define HL_PENDING_RESET_PER_SEC 10
+#define HL_PENDING_RESET_MAX_TRIALS 60 /* 10 minutes */
+#define HL_PENDING_RESET_LONG_SEC 60
+/*
+ * In device fini, wait 10 minutes for user processes to be terminated after we kill them.
+ * This is needed to prevent situation of clearing resources while user processes are still alive.
+ */
+#define HL_WAIT_PROCESS_KILL_ON_DEVICE_FINI 600
#define HL_HARD_RESET_MAX_TIMEOUT 120
#define HL_PLDM_HARD_RESET_MAX_TIMEOUT (HL_HARD_RESET_MAX_TIMEOUT * 3)
--
2.25.1
Powered by blists - more mailing lists