[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AM0PR02MB5523AC0E528B32AD791AC9A9B8CB0@AM0PR02MB5523.eurprd02.prod.outlook.com>
Date: Mon, 30 Mar 2020 06:14:07 +0000
From: Omer Shpigelman <oshpigelman@...ana.ai>
To: Oded Gabbay <oded.gabbay@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Tomer Tayar <ttayar@...ana.ai>
CC: "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>
Subject: RE: [PATCH 6/6] habanalabs: increase timeout during reset
On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <oded.gabbay@...il.com> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <oded.gabbay@...il.com>
Reviewed-by: Omer Shpigelman <oshpigelman@...ana.ai>
Powered by blists - more mailing lists