[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200328085238.3428-6-oded.gabbay@gmail.com>
Date: Sat, 28 Mar 2020 11:52:38 +0300
From: Oded Gabbay <oded.gabbay@...il.com>
To: linux-kernel@...r.kernel.org, oshpigelman@...ana.ai,
ttayar@...ana.ai
Cc: gregkh@...uxfoundation.org
Subject: [PATCH 6/6] habanalabs: increase timeout during reset
When doing training, the DL framework (e.g. tensorflow) performs hundreds
of thousands of memory allocations and mappings. In case the driver needs
to perform hard-reset during training, the driver kills the application and
unmaps all those memory allocations. Unfortunately, because of that large
amount of mappings, the driver isn't able to do that in the current timeout
(5 seconds). Therefore, increase the timeout significantly to 30 seconds
to avoid situation where the driver resets the device with active mappings,
which sometime can cause a kernel bug.
BTW, it doesn't mean we will spend all the 30 seconds because the reset
thread checks every one second if the unmap operation is done.
Signed-off-by: Oded Gabbay <oded.gabbay@...il.com>
---
drivers/misc/habanalabs/habanalabs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 199f7835ae46..6c54d0ba0a1d 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,7 +23,7 @@
#define HL_MMAP_CB_MASK (0x8000000000000000ull >> PAGE_SHIFT)
-#define HL_PENDING_RESET_PER_SEC 5
+#define HL_PENDING_RESET_PER_SEC 30
#define HL_DEVICE_TIMEOUT_USEC 1000000 /* 1 s */
--
2.17.1
Powered by blists - more mailing lists