Commit Graph

858 Commits

Author SHA1 Message Date
Rafael J. Wysocki
8d770de299 sched: idle: Make skipping governor callbacks more consistent
[ Upstream commit d557640e4c ]

If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.

Fixes: e5c9ffc6ae ("cpuidle: Skip governor when only one idle state is available")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25 11:08:29 +01:00
Aboorva Devarajan
8f6833d919 cpuidle: Skip governor when only one idle state is available
[ Upstream commit e5c9ffc6ae ]

On certain platforms (PowerNV systems without a power-mgt DT node),
cpuidle may register only a single idle state. In cases where that
single state is a polling state (state 0), the ladder governor may
incorrectly treat state 1 as the first usable state and pass an
out-of-bounds index. This can lead to a NULL enter callback being
invoked, ultimately resulting in a system crash.

[   13.342636] cpuidle-powernv : Only Snooze is available
[   13.351854] Faulting instruction address: 0x00000000
[   13.376489] NIP [0000000000000000] 0x0
[   13.378351] LR  [c000000001e01974] cpuidle_enter_state+0x2c4/0x668

Fix this by adding a bail-out in cpuidle_select() that returns state 0
directly when state_count <= 1, bypassing the governor and keeping the
tick running.

Fixes: dc2251bf98 ("cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260216185005.1131593-2-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:35 -05:00
Rafael J. Wysocki
101cc8e550 cpuidle: governors: menu: Always check timers with tick stopped
[ Upstream commit 80606f4eb8 ]

After commit 5484e31bbb ("cpuidle: menu: Skip tick_nohz_get_sleep_length()
call in some cases"), if the return value of get_typical_interval()
multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS,
the menu governor will skip computing the time till the closest timer.
If that happens when the tick has been stopped already, the selected
idle state may be too deep due to the subsequent check comparing
predicted_ns with TICK_NSEC and causing its value to be replaced with
the expected time till the closest timer, which is KTIME_MAX in that
case.  That will cause the deepest enabled idle state to be selected,
but the time till the closest timer very well may be shorter than the
target residency of that state, in which case a shallower state should
be used.

Address this by making menu_select() always compute the time till the
closest timer when the tick has been stopped.

Also move the predicted_ns check mentioned above into the branch in
which the time till the closest timer is determined because it only
needs to be done in that case.

Fixes: 5484e31bbb ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:19:28 -05:00
Rafael J. Wysocki
6e606c2673 cpuidle: governors: teo: Drop misguided target residency check
commit a03b201180 upstream.

When the target residency of the current candidate idle state is
greater than the expected time till the closest timer (the sleep
length), it does not matter whether or not the tick has already been
stopped or if it is going to be stopped.  The closest timer will
trigger anyway at its due time, so if an idle state with target
residency above the sleep length is selected, energy will be wasted
and there may be excess latency.

Of course, if the closest timer were canceled before it could trigger,
a deeper idle state would be more suitable, but this is not expected
to happen (generally speaking, hrtimers are not expected to be
canceled as a rule).

Accordingly, the teo_state_ok() check done in that case causes energy to
be wasted more often than it allows any energy to be saved (if it allows
any energy to be saved at all), so drop it and let the governor use the
teo_find_shallower_state() return value as the new candidate idle state
index.

Fixes: 21d28cd2fa ("cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5955081.DvuYhMxLoT@rafael.j.wysocki
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:19 +01:00
Aboorva Devarajan
93735b3a72 cpuidle: menu: Use residency threshold in polling state override decisions
[ Upstream commit 07d8157012 ]

On virtualized PowerPC (pseries) systems, where only one polling state
(Snooze) and one deep state (CEDE) are available, selecting CEDE when
the predicted idle duration is less than the target residency of CEDE
state can hurt performance. In such cases, the entry/exit overhead of
CEDE outweighs the power savings, leading to unnecessary state
transitions and higher latency.

Menu governor currently contains a special-case rule that prioritizes
the first non-polling state over polling, even when its target residency
is much longer than the predicted idle duration. On PowerPC/pseries,
where the gap between the polling state (Snooze) and the first non-polling
state (CEDE) is large, this behavior causes performance regressions.

Refine that special case by adding an extra requirement: the first
non-polling state can only be chosen if its target residency is below
the defined RESIDENCY_THRESHOLD_NS. If this condition is not satisfied,
polling is allowed instead, avoiding suboptimal non-polling state
entries.

This change is limited to the single special-case rule for the first
non-polling state. The general non-polling state selection logic in the
menu governor remains unchanged.

Performance improvement observed with pgbench on PowerPC (pseries)
system:
+---------------------------+------------+------------+------------+
| Metric                    | Baseline   | Patched    | Change (%) |
+---------------------------+------------+------------+------------+
| Transactions/sec (TPS)    | 495,210    | 536,982    | +8.45%     |
| Avg latency (ms)          | 0.163      | 0.150      | -7.98%     |
+---------------------------+------------+------------+------------+

CPUIdle state usage:
+--------------+--------------+-------------+
| Metric       | Baseline     | Patched     |
+--------------+--------------+-------------+
| Total usage  | 12,735,820   | 13,918,442  |
| Above usage  | 11,401,520   | 1,598,210   |
| Below usage  | 20,145       | 702,395     |
+--------------+--------------+-------------+

Above/Total and Below/Total usage percentages:
+------------------------+-----------+---------+
| Metric                 | Baseline  | Patched |
+------------------------+-----------+---------+
| Above % (Above/Total)  | 89.56%    | 11.49%  |
| Below % (Below/Total)  | 0.16%     | 5.05%   |
| Total cpuidle miss (%) | 89.72%    | 16.54%  |
+------------------------+-----------+---------+

The results indicate that restricting CEDE selection to cases where
its residency matches the predicted idle time reduces mispredictions,
lowers unnecessary state transitions, and improves overall throughput.

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
[ rjw: Changelog edits, rebase ]
Link: https://patch.msgid.link/20251006013954.17972-1-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08 10:13:56 +01:00
Rafael J. Wysocki
c876a729a4 cpuidle: Fail cpuidle device registration if there is one already
[ Upstream commit 7b1b796117 ]

Refuse to register a cpuidle device if the given CPU has a cpuidle
device already and print a message regarding it.

Without this, an attempt to register a new cpuidle device without
unregistering the existing one leads to the removal of the existing
cpuidle device without removing its sysfs interface.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:06 -05:00
Rafael J. Wysocki
acbbd683b3 cpuidle: governors: menu: Select polling state in some more cases
[ Upstream commit db86f55bf8 ]

A throughput regression of 11% introduced by commit 779b1a1cb1 ("cpuidle:
governors: menu: Avoid selecting states with too much latency") has been
reported and it is related to the case when the menu governor checks if
selecting a proper idle state instead of a polling one makes sense.

In particular, it is questionable to do so if the exit latency of the
idle state in question exceeds the predicted idle duration, so add a
check for that, which is sufficient to make the reported regression go
away, and update the related code comment accordingly.

Fixes: 779b1a1cb1 ("cpuidle: governors: menu: Avoid selecting states with too much latency")
Closes: https://lore.kernel.org/linux-pm/004501dc43c9$ec8aa930$c59ffb90$@telus.net/
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/12786727.O9o76ZdvQC@rafael.j.wysocki
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13 15:34:01 -05:00
Rafael J. Wysocki
e3853abbba cpuidle: governors: menu: Rearrange main loop in menu_select()
[ Upstream commit 17224c1d25 ]

Reduce the indentation level in the main loop of menu_select() by
rearranging some checks and assignments in it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2389215.ElGaqSPkdT@rafael.j.wysocki
Stable-dep-of: db86f55bf8 ("cpuidle: governors: menu: Select polling state in some more cases")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13 15:34:01 -05:00
Rafael J. Wysocki
19f4e86b2c Revert "cpuidle: menu: Avoid discarding useful information"
commit 10fad40122 upstream.

It is reported that commit 85975daeaa ("cpuidle: menu: Avoid discarding
useful information") led to a performance regression on Intel Jasper Lake
systems because it reduced the time spent by CPUs in idle state C7 which
is correlated to the maximum frequency the CPUs can get to because of an
average running power limit [1].

Before that commit, get_typical_interval() would have returned UINT_MAX
whenever it had been unable to make a high-confidence prediction which
had led to selecting the deepest available idle state too often and
both power and performance had been inadequate as a result of that on
some systems.  However, this had not been a problem on systems with
relatively aggressive average running power limits, like the Jasper Lake
systems in question, because on those systems it was compensated by the
ability to run CPUs faster.

It was addressed by causing get_typical_interval() to return a number
based on the recent idle duration information available to it even if it
could not make a high-confidence prediction, but that clearly did not
take the possible correlation between idle power and available CPU
capacity into account.

For this reason, revert most of the changes made by commit 85975daeaa,
except for one cosmetic cleanup, and add a comment explaining the
rationale for returning UINT_MAX from get_typical_interval() when it
is unable to make a high-confidence prediction.

Fixes: 85975daeaa ("cpuidle: menu: Avoid discarding useful information")
Closes: https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/ [1]
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3663603.iIbC2pHGDl@rafael.j.wysocki
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29 14:08:56 +01:00
Johan Hovold
7dc3b7ac8e cpuidle: qcom-spm: fix device and OF node leaks at probe
[ Upstream commit cdc06f9126 ]

Make sure to drop the reference to the saw device taken by
of_find_device_by_node() after retrieving its driver data during
probe().

Also drop the reference to the CPU node sooner to avoid leaking it in
case there is no saw node or device.

Fixes: 60f3692b5f ("cpuidle: qcom_spm: Detach state machine from main SPM handling")
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15 12:00:00 +02:00
Rafael J. Wysocki
87dca4c6a6 cpuidle: governors: menu: Avoid selecting states with too much latency
[ Upstream commit 779b1a1cb1 ]

Occasionally, the exit latency of the idle state selected by the menu
governor may exceed the PM QoS CPU wakeup latency limit.  Namely, if the
scheduler tick has been stopped already and predicted_ns is greater than
the tick period length, the governor may return an idle state whose exit
latency exceeds latency_req because that decision is made before
checking the current idle state's exit latency.

For instance, say that there are 3 idle states, 0, 1, and 2.  For idle
states 0 and 1, the exit latency is equal to the target residency and
the values are 0 and 5 us, respectively.  State 2 is deeper and has the
exit latency and target residency of 200 us and 2 ms (which is greater
than the tick period length), respectively.

Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
limit is 20 us.  After the first two iterations of the main loop in
menu_select(), idx becomes 1 and in the third iteration of it the target
residency of the current state (state 2) is greater than predicted_ns.
State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
so the check on whether or not the tick has been stopped is done.  Say
that the tick has been stopped already and there are no imminent timers
(that is, delta_tick is greater than the target residency of state 2).
In that case, idx becomes 2 and it is returned immediately, but the exit
latency of state 2 exceeds the latency limit.

Address this issue by modifying the code to compare the exit latency of
the current idle state (idle state i) with the latency limit before
comparing its target residency with predicted_ns, which allows one
more exit_latency_ns check that becomes redundant to be dropped.

However, after the above change, latency_req cannot take the predicted_ns
value any more, which takes place after commit 38f83090f5 ("cpuidle:
menu: Remove iowait influence"), because it may cause a polling state
to be returned prematurely.

In the context of the previous example say that predicted_ns is 3000 and
the PM QoS latency limit is still 20 us.  Additionally, say that idle
state 0 is a polling one.  Moving the exit_latency_ns check before the
target_residency_ns one causes the loop to terminate in the second
iteration, before the target_residency_ns check, so idle state 0 will be
returned even though previously state 1 would be returned if there were
no imminent timers.

For this reason, remove the assignment of the predicted_ns value to
latency_req from the code.

Fixes: 5ef499cd57 ("cpuidle: menu: Handle stopped tick more aggressively")
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28 16:31:08 +02:00
Christian Loehle
b39b62075a cpuidle: menu: Remove iowait influence
[ Upstream commit 38f83090f5 ]

Remove CPU iowaiters influence on idle state selection.

Remove the menu notion of performance multiplier which increased with
the number of tasks that went to iowait sleep on this CPU and haven't
woken up yet.

Relying on iowait for cpuidle is problematic for a few reasons:

 1. There is no guarantee that an iowaiting task will wake up on the
    same CPU.

 2. The task being in iowait says nothing about the idle duration, we
    could be selecting shallower states for a long time.

 3. The task being in iowait doesn't always imply a performance hit
    with increased latency.

 4. If there is such a performance hit, the number of iowaiting tasks
    doesn't directly correlate.

 5. The definition of iowait altogether is vague at best, it is
    sprinkled across kernel code.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com
[ rjw: Minor edits in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Stable-dep-of: 779b1a1cb1 ("cpuidle: governors: menu: Avoid selecting states with too much latency")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28 16:31:08 +02:00
Rafael J. Wysocki
1e25d8051b cpuidle: governors: menu: Avoid using invalid recent intervals data
[ Upstream commit fa3fa55de0 ]

Marc has reported that commit 85975daeaa ("cpuidle: menu: Avoid
discarding useful information") caused the number of wakeup interrupts
to increase on an idle system [1], which was not expected to happen
after merely allowing shallower idle states to be selected by the
governor in some cases.

However, on the system in question, all of the idle states deeper than
WFI are rejected by the driver due to a firmware issue [2].  This causes
the governor to only consider the recent interval duriation data
corresponding to attempts to enter WFI that are successful and the
recent invervals table is filled with values lower than the scheduler
tick period.  Consequently, the governor predicts an idle duration
below the scheduler tick period length and avoids stopping the tick
more often which leads to the observed symptom.

Address it by modifying the governor to update the recent intervals
table also when entering the previously selected idle state fails, so
it knows that the short idle intervals might have been the minority
had the selected idle states been actually entered every time.

Fixes: 85975daeaa ("cpuidle: menu: Avoid discarding useful information")
Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1]
Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20 18:30:18 +02:00
Daniel Lezcano
573f1e5902 cpuidle: psci: Fix cpuhotplug routine with PREEMPT_RT=y
commit 621a88dbfe upstream.

Currently cpu hotplug with the PREEMPT_RT option set in the kernel is
not supported because the underlying generic power domain functions
used in the cpu hotplug callbacks are incompatible from a lock point
of view. This situation prevents the suspend to idle to reach the
deepest idle state for the "cluster" as identified in the
undermentioned commit.

Use the compatible ones when PREEMPT_RT is enabled and remove the
boolean disabling the hotplug callbacks with this option.

With this change the platform can reach the deepest idle state
allowing at suspend time to consume less power.

Tested-on Lenovo T14s with the following script:

echo 0 > /sys/devices/system/cpu/cpu3/online
BEFORE=$(cat /sys/kernel/debug/pm_genpd/power-domain-cpu-cluster0/idle_states | grep S0 | awk '{ print $3 }') ;
rtcwake -s 1 -m mem;
AFTER=$(cat /sys/kernel/debug/pm_genpd/power-domain-cpu-cluster0/idle_states | grep S0 | awk '{ print $3 }');
if [ $BEFORE -lt $AFTER ]; then
    echo "Test successful"
else
    echo "Test failed"
fi
echo 1 > /sys/devices/system/cpu/cpu3/online

Fixes: 1c4b2932bd ("cpuidle: psci: Enable the hierarchical topology for s2idle on PREEMPT_RT")
Cc: Raghavendra Kakarla <quic_rkakarla@quicinc.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250709154728.733920-1-daniel.lezcano@linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-24 08:56:25 +02:00
Rafael J. Wysocki
083383aba0 cpuidle: menu: Avoid discarding useful information
[ Upstream commit 85975daeaa ]

When giving up on making a high-confidence prediction,
get_typical_interval() always returns UINT_MAX which means that the
next idle interval prediction will be based entirely on the time till
the next timer.  However, the information represented by the most
recent intervals may not be completely useless in those cases.

Namely, the largest recent idle interval is an upper bound on the
recently observed idle duration, so it is reasonable to assume that
the next idle duration is unlikely to exceed it.  Moreover, this is
still true after eliminating the suspected outliers if the sample
set still under consideration is at least as large as 50% of the
maximum sample set size.

Accordingly, make get_typical_interval() return the current maximum
recent interval value in that case instead of UINT_MAX.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29 11:02:38 +02:00
Rafael J. Wysocki
1654578a3b cpuidle: teo: Update documentation after previous changes
[ Upstream commit 5a597a19a2 ]

After previous changes, the description of the teo governor in the
documentation comment does not match the code any more, so update it
as appropriate.

Fixes: 4499143980 ("cpuidle: teo: Remove recent intercepts metric")
Fixes: 2662342079 ("cpuidle: teo: Gather statistics regarding whether or not to stop the tick")
Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6120335.lOV4Wx5bFT@rjwysocki.net
[ rjw: Corrected 3 typos found by Christian ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-23 17:22:49 +01:00
Javier Carrasco
0ea7a3e442 cpuidle: riscv-sbi: fix device node release in early exit of for_each_possible_cpu
[ Upstream commit 7e25044b80 ]

The 'np' device_node is initialized via of_cpu_device_node_get(), which
requires explicit calls to of_node_put() when it is no longer required
to avoid leaking the resource.

Instead of adding the missing calls to of_node_put() in all execution
paths, use the cleanup attribute for 'np' by means of the __free()
macro, which automatically calls of_node_put() when the variable goes
out of scope. Given that 'np' is only used within the
for_each_possible_cpu(), reduce its scope to release the nood after
every iteration of the loop.

Fixes: 6abf32f1d9 ("cpuidle: Add RISC-V SBI CPU idle driver")
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Link: https://lore.kernel.org/r/20241116-cpuidle-riscv-sbi-cleanup-v3-1-a3a46372ce08@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-17 13:40:44 +01:00
Linus Torvalds
200289db26 Merge tag 'pmdomain-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain updates from Ulf Hansson:
 "pmdomain core:
   - Add support for s2idle for CPU PM domains on PREEMPT_RT
   - Add device managed version of dev_pm_domain_attach|detach_list()
   - Improve layout of the debugfs summary table

  pmdomain providers:
   - amlogic: Remove obsolete vpu domain driver
   - bcm: raspberrypi: Add support for devices used as wakeup-sources
   - imx: Fixup clock handling for imx93 at driver remove
   - rockchip: Add gating support for RK3576
   - rockchip: Add support for RK3576 SoC
   - Some OF parsing simplifications
   - Some simplifications by using dev_err_probe() and guard()

  pmdomain consumers:
   - qcom/media/venus: Convert to the device managed APIs for PM domains

  cpuidle-psci:
   - Add support for s2idle/s2ram for the hierarchical topology on
     PREEMPT_RT
   - Some OF parsing simplifications"

* tag 'pmdomain-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: (39 commits)
  pmdomain: core: Reduce debug summary table width
  pmdomain: core: Move mode_status_str()
  pmdomain: core: Fix "managed by" alignment in debug summary
  pmdomain: core: Harden inter-column space in debug summary
  pmdomain: rockchip: Add gating masks for rk3576
  pmdomain: rockchip: Add gating support
  pmdomain: rockchip: Simplify dropping OF node reference
  pmdomain: mediatek: make use of dev_err_cast_probe()
  pmdomain: imx93-pd: drop the context variable "init_off"
  pmdomain: imx93-pd: don't unprepare clocks on driver remove
  pmdomain: imx93-pd: replace dev_err() with dev_err_probe()
  pmdomain: qcom: rpmpd: Simplify locking with guard()
  pmdomain: qcom: rpmhpd: Simplify locking with guard()
  pmdomain: qcom: cpr: Simplify locking with guard()
  pmdomain: qcom: cpr: Simplify with dev_err_probe()
  pmdomain: imx: gpcv2: Simplify with scoped for each OF child loop
  pmdomain: imx: gpc: Simplify with scoped for each OF child loop
  pmdomain: rockchip: SimplUlf Hanssonify locking with guard()
  pmdomain: rockchip: Simplify with scoped for each OF child loop
  pmdomain: qcom-cpr: Use scope based of_node_put() to simplify code.
  ...
2024-09-18 10:49:45 +02:00
Dhruva Gole
6baacf9391 cpuidle: remove dead code from cpuidle_enter_state()
Checking for index < 0 is useless because the find_deepest_state()
function never really returns a negative value.

Since this hasn't been reported in over 9 years it's dead code, so
remove it.

Signed-off-by: Dhruva Gole <d-gole@ti.com>
Link: https://patch.msgid.link/20240821114250.1416421-1-d-gole@ti.com
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-22 21:03:52 +02:00
Krzysztof Kozlowski
d5c667e049 cpuidle: riscv-sbi: Simplify with scoped for each OF child loop
Use scoped for_each_child_of_node_scoped() when iterating over device
nodes to make code a bit simpler.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://patch.msgid.link/20240820094023.61155-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20 21:46:43 +02:00
Krzysztof Kozlowski
a309320ddb cpuidle: riscv-sbi: Use scoped device node handling to fix missing of_node_put
Two return statements in sbi_cpuidle_dt_init_states() did not drop the
OF node reference count.  Solve the issue and simplify entire error
handling with scoped/cleanup.h.

Fixes: 6abf32f1d9 ("cpuidle: Add RISC-V SBI CPU idle driver")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://patch.msgid.link/20240820094023.61155-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20 21:46:43 +02:00
Krzysztof Kozlowski
157519c026 cpuidle: dt_idle_genpd: Simplify with scoped for each OF child loop
Use scoped for_each_child_of_node_scoped() when iterating over device
nodes to make code a bit simpler.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240816150931.142208-4-krzysztof.kozlowski@linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-08-20 11:30:39 +02:00
Krzysztof Kozlowski
7aa1204d08 cpuidle: psci: Simplify with scoped for each OF child loop
Use scoped for_each_child_of_node_scoped() when iterating over device
nodes to make code a bit simpler.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20240816150931.142208-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-08-20 11:30:39 +02:00
Ulf Hansson
1c4b2932bd cpuidle: psci: Enable the hierarchical topology for s2idle on PREEMPT_RT
To enable the domain-idle-states to be used during s2idle on a PREEMPT_RT
based configuration, let's allow the re-assignment of the ->enter_s2idle()
callback to psci_enter_s2idle_domain_idle_state().

Similar to s2ram, let's leave the support for CPU hotplug outside
PREEMPT_RT, as it's depending on using runtime PM. For s2idle, this means
that an offline CPU's PM domain will remain powered-on. In practise this
may lead to that a shallower idle-state than necessary gets selected, which
shouldn't be an issue (besides wasting power).

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com>  # qcm6490 with PREEMPT_RT set
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240527142557.321610-8-ulf.hansson@linaro.org
2024-08-05 13:25:45 +02:00
Ulf Hansson
4517b1c383 cpuidle: psci: Enable the hierarchical topology for s2ram on PREEMPT_RT
The hierarchical PM domain topology are currently disabled on a PREEMPT_RT
based configuration. As a first step to enable it to be used, let's try to
attach the CPU devices to their PM domains on PREEMPT_RT. In this way the
syscore ops becomes available, allowing the PM domain topology to be
managed during s2ram.

For the moment let's leave the support for CPU hotplug outside PREEMPT_RT,
as it's depending on using runtime PM. For s2ram, this isn't a problem as
all CPUs are managed via the syscore ops.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com>  # qcm6490 with PREEMPT_RT set
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240527142557.321610-7-ulf.hansson@linaro.org
2024-08-05 13:24:47 +02:00
Ulf Hansson
88bf68b766 cpuidle: psci: Drop redundant assignment of CPUIDLE_FLAG_RCU_IDLE
When using the hierarchical topology and PSCI OSI-mode we may end up
overriding the deepest idle-state's ->enter|enter_s2idle() callbacks, but
there is no point to also re-assign the CPUIDLE_FLAG_RCU_IDLE for the
idle-state in question, as that has already been set when parsing the
states from DT. See init_state_node().

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com>  # qcm6490 with PREEMPT_RT set
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240527142557.321610-6-ulf.hansson@linaro.org
2024-08-05 13:24:28 +02:00
Ulf Hansson
c7b45284ab cpuidle: psci-domain: Enable system-wide suspend on PREEMPT_RT
The domain-idle-states are currently disabled on a PREEMPT_RT based
configuration for the cpuidle-psci-domain. To enable them to be used for
system-wide suspend and in particular during s2idle, let's set the
GENPD_FLAG_RPM_ALWAYS_ON instead of GENPD_FLAG_ALWAYS_ON for the
corresponding genpd provider.

In this way, the runtime PM path remains disabled in genpd for its attached
devices, while powering-on/off the PM domain during system-wide suspend
becomes allowed.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com>  # qcm6490 with PREEMPT_RT set
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240527142557.321610-5-ulf.hansson@linaro.org
2024-08-05 13:23:34 +02:00
Christian Loehle
4b20b07ce7 cpuidle: teo: Don't count non-existent intercepts
When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so query the sleep
length instead for teo to recognize if it still is in an
intercept-likely scenario without alternating between the two modes.

Fundamentally we can only do one of the two:
 1. Skip sleep_length_ns query when we think intercept is likely.
 2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.

Previously teo did the former while this patch chooses the latter as
the additional time it takes to query the sleep length was found to be
negligible and the variants of option 1 (count all unknowns as misses
or count all unknown as hits) had significant regressions (as misses
had lots of too shallow idle state selections and as hits had terrible
performance in intercept-heavy workloads).

Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Link: https://patch.msgid.link/c40acf72-010f-4a8b-80e4-33f133ba266b@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-01 18:58:55 +02:00
Christian Loehle
4499143980 cpuidle: teo: Remove recent intercepts metric
The logic for recent intercepts didn't work, there is an underflow
of the 'recent' value that can be observed during boot already, which
teo usually doesn't recover from, making the entire logic pointless.
Furthermore the recent intercepts also were never reset, thus not
actually being very 'recent'.

Having underflowed 'recent' values lead to teo always acting as if
we were in a scenario were expected sleep length based on timers is
too high and it therefore unnecessarily selecting shallower states.

Experiments show that the remaining 'intercept' logic is enough to
quickly react to scenarios in which teo cannot rely on the timer
expected sleep length.

See also here:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/

Fixes: 77577558f2 ("cpuidle: teo: Rework most recent idle duration values treatment")
Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00
Christian Loehle
0a2998fa48 Revert: "cpuidle: teo: Introduce util-awareness"
This reverts commit 9ce0f7c4bc.

Util-awareness was reported to be too aggressive in selecting shallower
states. Additionally a single threshold was found to not be suitable
for reasoning about sleep length as, for all practical purposes,
almost arbitrary sleep lengths are still possible for any load value.

Fixes: 9ce0f7c4bc ("cpuidle: teo: Introduce util-awareness")
Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com
Reported-by: Qais Yousef <qyousef@layalina.io>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00
Atul Kumar Pant
63e6b02f57 cpuidle: governors: teo: Fix a typo in a comment
"terget" -> "target"

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Link: https://patch.msgid.link/20240616124025.16477-1-atulpant.linux@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-21 14:57:07 +02:00
Jeff Johnson
085aa9f58c cpuidle: haltpoll: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/cpuidle/cpuidle-haltpoll.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-14 18:15:38 +02:00
Christian Loehle
bf18311384 cpuidle: menu: Cleanup after loadavg removal
The performance impact of loadavg was removed with commit a7fe5190c0
("cpuidle: menu: Remove get_loadavg() from the performance multiplier")
With only iowait remaining the description can be simplified, remove
also the no longer needed includes.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-07 20:55:00 +02:00
Linus Torvalds
8b35a3bb33 Merge tag 'pmdomain-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain updates from Ulf Hansson:
 "pmdomain core:
   - Don't clear suspended_count at genpd_prepare()
   - Update the rejected/usage counters at system suspend too

  pmdomain providers:
   - ti-sci: Fix duplicate PD referrals
   - mediatek: Add MT8188 buck isolation setting
   - renesas: Add R-Car M3-W power-off delay quirk
   - renesas: Split R-Car M3-W and M3-W+ sub-drivers

  cpuidle-psci:
   - Update MAINTAINERS to set a git for DT IDLE PM DOMAIN/ARM PSCI PM
     DOMAIN
   - Update init level to core_initcall()
   - Drop superfluous wrappers psci_dt_attach|detach_cpu()"

* tag 'pmdomain-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  pmdomain: ti-sci: Fix duplicate PD referrals
  pmdomain: core: Don't clear suspended_count at genpd_prepare()
  pmdomain: core: Update the rejected/usage counters at system suspend too
  pmdomain: renesas: rcar-sysc: Add R-Car M3-W power-off delay quirk
  pmdomain: renesas: rcar-sysc: Remove rcar_sysc_nullify() helper
  pmdomain: renesas: rcar-sysc: Split R-Car M3-W and M3-W+ sub-drivers
  pmdomain: renesas: rcar-sysc: Absorb rcar_sysc_ch into rcar_sysc_pd
  MAINTAINERS: Add a git for the DT IDLE PM DOMAIN
  MAINTAINERS: Add a git for the ARM PSCI PM DOMAIN
  cpuidle: psci: Update init level to core_initcall()
  cpuidle: psci: Drop superfluous wrappers psci_dt_attach|detach_cpu()
  pmdomain: mediatek: Add MT8188 buck isolation setting
  pmdomain: mediatek: scpsys: drop driver owner assignment
2024-05-16 08:50:32 -07:00
Linus Torvalds
0c181b1d97 Merge tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These are mostly cpufreq updates, including a significant intel-pstate
  driver update and several amd-pstate improvements plus some updates of
  ARM cpufreq drivers, general fixes and cleanups.

  Also included are changes related to system sleep, power capping
  updates adding support for a new platform and a new hardware feature
  (among other things), a Samsung exynos-asv driver update allowing it
  to change its Energy Model after adjusting voltage, minor cpuidle and
  devfreq updates and a small documentation cleanup.

  Specifics:

   - Rework the handling of disabled turbo in the intel_pstate driver
     and make it update the maximum CPU frequency consistently
     regardless of the reason on top of a number of cleanups (Rafael
     Wysocki)

   - Add missing checks for NULL .exit() cpufreq driver callback to the
     cpufreq core (Viresh Kumar)

   - Prevent pulicy->max from going above the frequency QoS maximum
     value when cpufreq_frequency_table_verify() is used (Xuewen Yan)

   - Prevent a negative CPU number or frequency value from being printed
     if they are really large (Joshua Yeong)

   - Update MAINTAINERS entry for amd-pstate to add two new
     submaintainers and a designated reviewer (Huang Rui)

   - Clean up the amd-pstate driver and update its documentation
     (Gautham Shenoy)

   - Fix the highest frequency issue in the amd-pstate driver which
     limits performance (Perry Yuan)

   - Enable CPPC v2 for certain processors in the family 17H, as
     requested by TR40 processor users who expect improved performance
     and lower system temperature (Perry Yuan)

   - Change latency and delay values to be read from platform firmware
     firstly for more accurate timing (Perry Yuan)

   - A new quirk is introduced for supporting amd-pstate on legacy
     processors which either lack CPPC capability, or only only have
     CPPC v2 capability (Perry Yuan)

   - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and
     general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi,
     Dan Carpenter, Viresh Kumar)

   - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr
     Mishin)

   - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco,
     Shivani Gupta)

   - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens)

   - mediatek cpufreq: Add support for MT7988A (Sam Shih)

   - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei
     Fan)

   - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate
     cpufreq driver (Jeff Johnson)

   - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson)

   - Convert the cpuidle kirkwood driver to platform remove callback
     returning void (Yangtao Li)

   - Replace deprecated strncpy() with strscpy() in the hibernation core
     code (Justin Stitt)

   - Use %ps to simplify debug output in the core system-wide suspend
     and resume code (Len Brown)

   - Remove unnecessary else from device_init_wakeup() and make
     device_wakeup_disable() return void (Dhruva Gole)

   - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui)

   - Add support for ArrowLake-H platform to the Intel RAPL driver
     (Zhang Rui)

   - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li)

   - Make the Samsung exynos-asv driver update the Energy Model after
     adjusting voltage on top of some preliminary changes of the OPP and
     Enery Model generic code (Lukasz Luba)

   - Remove a reference to a function that has been dropped from the
     power management documentation (Bjorn Helgaas)

   - Convert the platfrom remove callback to .remove_new for the
     exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and
     rk3399_dmc devfreq drivers (Uwe Kleine-König)

   - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon)"

* tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (68 commits)
  PM / devfreq: exynos: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
  PM / devfreq: rk3399_dmc: Convert to platform remove callback returning void
  PM / devfreq: sun8i-a33-mbus: Convert to platform remove callback returning void
  PM / devfreq: mtk-cci: Convert to platform remove callback returning void
  PM / devfreq: exynos-ppmu: Convert to platform remove callback returning void
  PM / devfreq: exynos-nocp: Convert to platform remove callback returning void
  cpufreq: amd-pstate: fix the highest frequency issue which limits performance
  cpufreq: intel_pstate: fix struct cpudata::epp_cached kernel-doc
  cpuidle: ladder: fix ladder_do_selection() kernel-doc
  powercap: intel_rapl_tpmi: Enable PMU support
  powercap: intel_rapl: Introduce APIs for PMU support
  PM: hibernate: replace deprecated strncpy() with strscpy()
  cpufreq: Fix up printing large CPU numbers and frequency values
  MAINTAINERS: cpufreq: amd-pstate: Add co-maintainers and reviewer
  cpufreq: amd-pstate: remove unused variable lowest_nonlinear_freq
  cpufreq: amd-pstate: fix code format problems
  cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missing
  cppc_acpi: print error message if CPPC is unsupported
  cpufreq: amd-pstate: get transition delay and latency value from ACPI tables
  cpufreq: amd-pstate: Bail out if min/max/nominal_freq is 0
  ...
2024-05-14 13:19:15 -07:00
Jeff Johnson
774459238f cpuidle: ladder: fix ladder_do_selection() kernel-doc
make C=1 reports:

warning: Function parameter or struct member 'dev' not described in 'ladder_do_selection'

Document 'dev' for this function.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-05-07 13:54:58 +02:00
Dawei Li
6f28c4a852 cpuidle: Avoid explicit cpumask allocation on stack
In general it's preferable to avoid placing cpumasks on the stack, as
for large values of NR_CPUS these can consume significant amounts of
stack space and make stack overflows more likely.

Use cpumask_first_and_and() and cpumask_weight_and() to avoid the need
for a temporary cpumask on the stack.

Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240416085454.3547175-8-dawei.li@shingroup.cn
2024-04-24 21:23:49 +02:00
Yangtao Li
f9059eb5d7 cpuidle: kirkwood: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20230712094014.41787-1-frank.li@vivo.com
2024-04-23 09:21:48 +02:00
Maulik Shah
02e2a4b363 cpuidle: psci: Update init level to core_initcall()
Clients like regulators, interconnects and clocks depend on rpmh-rsc to
vote on resources and rpmh-rsc depends on psci power-domains to complete
probe. All of them are in core_initcall().

Change psci domain init level to core_initcall() to avoid probe defer from
all of the above.

Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com>
Link: https://lore.kernel.org/r/20240217-init_level-v1-2-bde9e11f8317@quicinc.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-04-04 13:02:28 +02:00
Ulf Hansson
d72d7d6cf7 cpuidle: psci: Drop superfluous wrappers psci_dt_attach|detach_cpu()
To simplify the code, let's drop psci_dt_attach|detach_cpu() and use the
common dt_idle_attach|detach_cpu() directly instead.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20240228151139.2650258-1-ulf.hansson@linaro.org
2024-04-04 13:00:39 +02:00
Linus Torvalds
c150b809f7 Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:

 - Support for various vector-accelerated crypto routines

 - Hibernation is now enabled for portable kernel builds

 - mmap_rnd_bits_max is larger on systems with larger VAs

 - Support for fast GUP

 - Support for membarrier-based instruction cache synchronization

 - Support for the Andes hart-level interrupt controller and PMU

 - Some cleanups around unaligned access speed probing and Kconfig
   settings

 - Support for ACPI LPI and CPPC

 - Various cleanus related to barriers

 - A handful of fixes

* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
  riscv: Fix syscall wrapper for >word-size arguments
  crypto: riscv - add vector crypto accelerated AES-CBC-CTS
  crypto: riscv - parallelize AES-CBC decryption
  riscv: Only flush the mm icache when setting an exec pte
  riscv: Use kcalloc() instead of kzalloc()
  riscv/barrier: Add missing space after ','
  riscv/barrier: Consolidate fence definitions
  riscv/barrier: Define RISCV_FULL_BARRIER
  riscv/barrier: Define __{mb,rmb,wmb}
  RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
  cpufreq: Move CPPC configs to common Kconfig and add RISC-V
  ACPI: RISC-V: Add CPPC driver
  ACPI: Enable ACPI_PROCESSOR for RISC-V
  ACPI: RISC-V: Add LPI driver
  cpuidle: RISC-V: Move few functions to arch/riscv
  riscv: Introduce set_compat_task() in asm/compat.h
  riscv: Introduce is_compat_thread() into compat.h
  riscv: add compile-time test into is_compat_task()
  riscv: Replace direct thread flag check with is_compat_task()
  riscv: Improve arch_get_mmap_end() macro
  ...
2024-03-22 10:41:13 -07:00
Sunil V L
6649182a38 cpuidle: RISC-V: Move few functions to arch/riscv
To support ACPI Low Power Idle (LPI), few functions are required which
are currently static functions in the DT based cpuidle driver. Hence,
move them under arch/riscv so that ACPI driver also can use them. Since
they are no longer static functions, append "riscv_" prefix to the
function name.

Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20240118062930.245937-2-sunilvl@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-03-19 17:51:38 -07:00
Linus Torvalds
902861e34c Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Yosry Ahmed
7dbbc8f57d x86/mm: delete unused cpu argument to leave_mm()
The argument is unused since commit 3d28ebceaf ("x86/mm: Rework lazy
TLB to track the actual loaded mm"), delete it.

Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:41 -08:00
C Cheng
88390dd788 cpuidle: Avoid potential overflow in integer multiplication
In detail:

In C language, when you perform a multiplication operation, if
both operands are of int type, the multiplication operation is
performed on the int type, and then the result is converted to
the target type. This means that if the product of int type
multiplication exceeds the range that int type can represent,
an overflow will occur even if you store the result in a
variable of int64_t type.

For a multiplication of two int values, it is better to use
mul_u32_u32() rather than s->exit_latency_ns = s->exit_latency *
NSEC_PER_USEC to avoid potential overflow happenning.

Signed-off-by: C Cheng <C.Cheng@mediatek.com>
Signed-off-by: Bo Ye <bo.ye@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
[ rjw: New subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-12 17:07:49 +01:00
Parshuram Sangle
496d0a6485 cpuidle: haltpoll: do not shrink guest poll_limit_ns below grow_start
While adjusting guest halt poll limit, grow block starts at
guest_halt_poll_grow_start without taking intermediate values.
Similar behavior is expected while shrinking the value. This
avoids short interval values which are really not required.

VCPU1 trace (guest_halt_poll_shrink equals 2):

VCPU1 grow 10000
VCPU1 shrink 5000
VCPU1 shrink 2500
VCPU1 shrink 1250
VCPU1 shrink 625
VCPU1 shrink 312
VCPU1 shrink 156
VCPU1 shrink 78
VCPU1 shrink 39
VCPU1 shrink 19
VCPU1 shrink 9
VCPU1 shrink 4

Similar change is done in KVM halt poll flow with below patch:
Link: https://lore.kernel.org/kvm/20211006133021.271905-3-sashal@kernel.org/

Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com>
Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com>
Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-12 17:02:51 +01:00
Borislav Petkov (AMD)
c8f5caec3d cpuidle: haltpoll: Do not enable interrupts when entering idle
The cpuidle drivers' ->enter() methods are supposed to be IRQ invariant:

  5e26aa9339 ("cpuidle/poll: Ensure IRQs stay disabled after cpuidle_state::enter() calls")
  bb7b112585 ("cpuidle: Move IRQ state validation")

Do that in the haltpoll driver too.

Fixes: 5e26aa9339 ("cpuidle/poll: Ensure IRQs stay disabled after cpuidle_state::enter() calls")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218245
Reported-by: <forza@tnonline.net>
Tested-by: <forza@tnonline.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-29 18:08:18 +01:00
Justin Stitt
b545465e22 cpuidle: dt: Replace deprecated strncpy() with strscpy()
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

We should prefer more robust and less ambiguous string interfaces.

A suitable replacement is `strscpy` [2] due to the fact that it guarantees
NUL-termination on the destination buffer. With this, we can also drop
the now unnecessary `CPUIDLE_(NAME|DESC)_LEN - 1` pieces.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230913-strncpy-drivers-cpuidle-dt_idle_states-c-v1-1-d16a0dbe5658@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-29 14:48:31 -07:00
Linus Torvalds
4ad0a4c234 Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:

 - Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
   configured SMT state when hotplugging CPUs into the system

 - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the
   Radix MMU to avoid a broadcast TLBIE flush on exit

 - Drop the exclusion between ptrace/perf watchpoints, and drop the now
   unused associated arch hooks

 - Add support for the "nohlt" command line option to disable CPU idle

 - Add support for -fpatchable-function-entry for ftrace, with GCC >=
   13.1

 - Rework memory block size determination, and support 256MB size on
   systems with GPUs that have hotpluggable memory

 - Various other small features and fixes

Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam
Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel
Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof
Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar,
Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor,
Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar
Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh
Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain,
Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai.

* tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits)
  macintosh/ams: linux/platform_device.h is needed
  powerpc/xmon: Reapply "Relax frame size for clang"
  powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached
  powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled
  powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
  powerpc/mpc5xxx: Add missing fwnode_handle_put()
  powerpc/config: Disable SLAB_DEBUG_ON in skiroot
  powerpc/pseries: Remove unused hcall tracing instruction
  powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n
  powerpc: dts: add missing space before {
  powerpc/eeh: Use pci_dev_id() to simplify the code
  powerpc/64s: Move CPU -mtune options into Kconfig
  powerpc/powermac: Fix unused function warning
  powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT
  powerpc: Don't include lppaca.h in paca.h
  powerpc/pseries: Move hcall_vphn() prototype into vphn.h
  powerpc/pseries: Move VPHN constants into vphn.h
  cxl: Drop unused detach_spa()
  powerpc: Drop zalloc_maybe_bootmem()
  powerpc/powernv: Use struct opal_prd_msg in more places
  ...
2023-08-31 12:43:10 -07:00
Rafael J. Wysocki
1201c50c1e Merge branches 'pm-cpuidle' and 'pm-cpufreq'
Merge CPU power management updates for 6.6-rc1:

 - Rework the menu and teo cpuidle governors to avoid calling
   tick_nohz_get_sleep_length(), which is likely to become quite
   expensive going forward, too often and improve making decisions
   regarding whether or not to stop the scheduler tick in the teo
   governor (Rafael Wysocki).

 - Improve the performance of cpufreq_stats_create_table() in some
   cases (Liao Chang).

 - Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal).

 - Use clamp() helper macro to improve the code readability in
   cpufreq_verify_within_limits() (Liao Chang).

 - Set stale CPU frequency to minimum in intel_pstate (Doug Smythies).

* pm-cpuidle:
  cpuidle: teo: Avoid unnecessary variable assignments
  cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases
  cpuidle: teo: Gather statistics regarding whether or not to stop the tick
  cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases
  cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront
  cpuidle: teo: Drop utilized from struct teo_cpu
  cpuidle: teo: Avoid stopping the tick unnecessarily when bailing out
  cpuidle: teo: Update idle duration estimate when choosing shallower state

* pm-cpufreq:
  cpufreq: amd-pstate-ut: Fix kernel panic when loading the driver
  cpufreq: amd-pstate-ut: Remove module parameter access
  cpufreq: Use clamp() helper macro to improve the code readability
  cpufreq: intel_pstate: set stale CPU frequency to minimum
  cpufreq: stats: Improve the performance of cpufreq_stats_create_table()
2023-08-25 21:15:56 +02:00