linux-kernel - Re: [PATCH v4] PM: Support aborting sleep during filesystem sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251020021228.2336745-1-tuhaowen@uniontech.com>
Date: Mon, 20 Oct 2025 10:12:28 +0800
From: tuhaowen <tuhaowen@...ontech.com>
To: rafael@...nel.org
Cc: dakr@...nel.org,
	gregkh@...uxfoundation.org,
	kernel-team@...roid.com,
	lenb@...nel.org,
	linux-kernel@...r.kernel.org,
	linux-pm@...r.kernel.org,
	pavel@...nel.org,
	saravanak@...gle.com,
	wusamuel@...gle.com
Subject: Re: [PATCH v4] PM: Support aborting sleep during filesystem sync

Hi Rafael,

Thank you for your attention to this matter. I'd like to clarify the 
difference between our approach and the Google team's solution, as they 
address fundamentally different use cases and environments.

On Mon, Oct 13, 2025 at 8:02 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> > No, it's different. We don't mind a long filesystem sync if we don't
> > have a need to abort a suspend. If it takes 25 seconds to sync the
> > filesystem but there's no need to abort it, that's totally fine. So,
> > this patch is just about allowing abort to happen without waiting for
> > file system sync to finish.

Saravana is correct that our problems are different. Let me explain the 
key distinction:

**Google's approach (Mobile/Android focus)**:
- Problem: Unnecessary wake-ups during sync operations waste battery
- Solution: Abort sync only when wakeup events occur (user interaction)
- Philosophy: Wait indefinitely for sync if no user action required
- Use case: Mobile devices where users expect to press power button to wake

**Our approach (Desktop/PC focus)**:
- Problem: Indefinite sync hangs leave users with unresponsive black screen
- Solution: Proactive timeout to prevent system appearing frozen
- Philosophy: Provide user feedback and system recovery within reasonable time
- Use case: Desktop/laptop where users expect immediate system response

> > The other patch's requirement is to always abort if suspend takes 25
> > seconds (or whatever the timeout is). IIRC, in his case, it's because
> > of a bad disk or say a USB disk getting unplugged. I'm not convinced a
> > suspend timeout is the right thing to do, but I'm not going to nack
> > it. But to implement his requirement, he can put a patch on top of
> > ours where he sets a timer and then aborts suspends if it fires.

The key difference is **when** we need to abort:
- Google: Abort when user wants to wake up (reactive)
- UnionTech: Abort when sync becomes pathologically slow (proactive)

For desktop users, a 25-second black screen with no feedback creates the 
impression of a system freeze, especially when caused by removed USB 
devices or failing storage. Users cannot distinguish between "system is 
syncing" and "system has crashed" without feedback.

**Question about integration**:

Since both approaches serve legitimate but different needs, could we 
implement a unified solution that supports both mechanisms? For example:

1. **Combined approach**: Implement both wakeup-based abort (Google's patch) 
   and timeout-based abort (our patch) in the same framework

2. **Configuration via sysfs**: Add a node to control the behavior:
   - `/sys/power/sync_abort_mode`:
     - "wakeup-only": Use Google's approach (abort only on wakeup events)
     - "timeout": Use our approach (abort on timeout)
     - "both": Use both mechanisms (abort on either condition)

3. **Default behavior**: Could default to "wakeup-only" for mobile/embedded 
   systems and "timeout" for desktop systems, or let distributions choose

This would allow different systems to choose appropriate behavior based 
on their needs.

Would this unified approach be acceptable? We're happy to work on 
implementation details with the Google team to ensure both use cases 
are properly addressed.

**Additional concern about integration timing**:

I noticed that Samuel Wu previously mentioned in his response to you 
(Sep 30, 2025) that our approaches could be "decoupled" and that I could 
"build changes on top of theirs." However, after reviewing their v4 patch 
implementation, I'm concerned that if their approach lands first, it may 
make our timeout-based solution significantly more difficult to integrate.

Their current implementation:
- Uses workqueue + completion for sync operations
- Introduces pm_sleep_fs_sync() as the main interface
- Adds complex state management for back-to-back sleep attempts

This architecture makes it challenging to integrate our timeout approach 
and add mode switching functionality. If their patch lands first, adding 
our timeout mechanism would require:
- Modifying their workqueue-based sync mechanism to support timeout
- Adding logic to coordinate between workqueue completion and timeout
- Implementing mode switching between wakeup-abort and timeout-abort
- Ensuring proper interaction between the two abort mechanisms

The main challenge is: how do we add timeout functionality to their 
workqueue + completion design? And how do we implement clean switching 
between "abort on wakeup events" mode versus "abort on timeout" mode? 
Their current design focuses solely on wakeup-based abort, so retrofitting 
timeout support and mode selection would require significant changes to 
their implementation.

Would it be possible to consider both approaches simultaneously to ensure 
a clean integration path? This might result in a better unified solution 
than trying to retrofit timeout functionality into their workqueue-based 
implementation.

Best regards,
Haowen Tu