[ Upstream commit 5ec820fc28 ]
The 9FGV0841 has 8 outputs and registers 8 struct clk_hw, make sure
there are 8 slots for those newly registered clk_hw pointers, else
there is going to be out of bounds write when pointers 4..7 are set
into struct rs9_driver_data .clk_dif[4..7] field.
Since there are other structure members past this struct clk_hw
pointer array, writing to .clk_dif[4..7] fields corrupts both
the struct rs9_driver_data content and data around it, sometimes
without crashing the kernel. However, the kernel does surely
crash when the driver is unbound or during suspend.
Fix this, increase the struct clk_hw pointer array size to the
maximum output count of 9FGV0841, which is the biggest chip that
is supported by this driver.
Cc: stable@vger.kernel.org
Fixes: f0e5e18002 ("clk: rs9: Add support for 9FGV0841")
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Closes: https://lore.kernel.org/CAMuHMdVyQpOBT+Ho+mXY07fndFN9bKJdaaWGn91WOFnnYErLyg@mail.gmail.com
Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bb0c99e08a ]
The implementation of __READ_ONCE() under CONFIG_LTO=y incorrectly
qualified the fallback "once" access for types larger than 8 bytes,
which are not atomic but should still happen "once" and suppress common
compiler optimizations.
The cast `volatile typeof(__x)` applied the volatile qualifier to the
pointer type itself rather than the pointee. This created a volatile
pointer to a non-volatile type, which violated __READ_ONCE() semantics.
Fix this by casting to `volatile typeof(*__x) *`.
With a defconfig + LTO + debug options build, we see the following
functions to be affected:
xen_manage_runstate_time (884 -> 944 bytes)
xen_steal_clock (248 -> 340 bytes)
^-- use __READ_ONCE() to load vcpu_runstate_info structs
Fixes: e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y")
Cc: stable@vger.kernel.org
Reviewed-by: Boqun Feng <boqun@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a5338e365c ]
Commit 05703271c3 ("PCI/IOV: Add PCI rescan-remove locking when
enabling/disabling SR-IOV") tried to fix a race between the VF removal
inside sriov_del_vfs() and concurrent hot unplug by taking the PCI
rescan/remove lock in sriov_del_vfs(). Similarly the PCI rescan/remove lock
was also taken in sriov_add_vfs() to protect addition of VFs.
This approach however causes deadlock on trying to remove PFs with SR-IOV
enabled because PFs disable SR-IOV during removal and this removal happens
under the PCI rescan/remove lock. So the original fix had to be reverted.
Instead of taking the PCI rescan/remove lock in sriov_add_vfs() and
sriov_del_vfs(), fix the race that occurs with SR-IOV enable and disable vs
hotplug higher up in the callchain by taking the lock in
sriov_numvfs_store() before calling into the driver's sriov_configure()
callback.
Fixes: 05703271c3 ("PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV")
Reported-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251216-revert_sriov_lock-v3-2-dac4925a7621@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2fa119c0e5 ]
This reverts commit 05703271c3 ("PCI/IOV: Add PCI rescan-remove locking
when enabling/disabling SR-IOV"), which causes a deadlock by recursively
taking pci_rescan_remove_lock when sriov_del_vfs() is called as part of
pci_stop_and_remove_bus_device(). For example with the following sequence
of commands:
$ echo <NUM> > /sys/bus/pci/devices/<pf>/sriov_numvfs
$ echo 1 > /sys/bus/pci/devices/<pf>/remove
A trimmed trace of the deadlock on a mlx5 device is as below:
zsh/5715 is trying to acquire lock:
000002597926ef50 (pci_rescan_remove_lock){+.+.}-{3:3}, at: sriov_disable+0x34/0x140
but task is already holding lock:
000002597926ef50 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x24/0x80
...
Call Trace:
[<00000259778c4f90>] dump_stack_lvl+0xc0/0x110
[<00000259779c844e>] print_deadlock_bug+0x31e/0x330
[<00000259779c1908>] __lock_acquire+0x16c8/0x32f0
[<00000259779bffac>] lock_acquire+0x14c/0x350
[<00000259789643a6>] __mutex_lock_common+0xe6/0x1520
[<000002597896413c>] mutex_lock_nested+0x3c/0x50
[<00000259784a07e4>] sriov_disable+0x34/0x140
[<00000258f7d6dd80>] mlx5_sriov_disable+0x50/0x80 [mlx5_core]
[<00000258f7d5745e>] remove_one+0x5e/0xf0 [mlx5_core]
[<00000259784857fc>] pci_device_remove+0x3c/0xa0
[<000002597851012e>] device_release_driver_internal+0x18e/0x280
[<000002597847ae22>] pci_stop_bus_device+0x82/0xa0
[<000002597847afce>] pci_stop_and_remove_bus_device_locked+0x5e/0x80
[<00000259784972c2>] remove_store+0x72/0x90
[<0000025977e6661a>] kernfs_fop_write_iter+0x15a/0x200
[<0000025977d7241c>] vfs_write+0x24c/0x300
[<0000025977d72696>] ksys_write+0x86/0x110
[<000002597895b61c>] __do_syscall+0x14c/0x400
[<000002597896e0ee>] system_call+0x6e/0x90
This alone is not a complete fix as it restores the issue the cited commit
tried to solve. A new fix will be provided as a follow on.
Fixes: 05703271c3 ("PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV")
Reported-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Acked-by: Gerd Bayer <gbayer@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251216-revert_sriov_lock-v3-1-dac4925a7621@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b4af7d194d ]
Macronix serial NAND devices with continuous read support do not
clear the configuration register on soft reset and lack a hardware
reset pin. When continuous read is interrupted (e.g., during reboot),
the feature remains enabled at the device level.
With continuous read enabled, the OOB area becomes inaccessible and
all reads are instead directed to the main area. As a result, during
partition allocation as part of MTD device registration, the first two
bytes of the main area for the master block are read and indicate that
the block is bad. This process repeats for every subsequent block for
the partition.
All reads and writes that reference the BBT find no good blocks and
fail.
The only paths for recovery from this state are triggering the
continuous read feature by way of raw MTD reads or through a NAND
device power drain.
Disable continuous read explicitly during spinand probe to ensure
quiescent feature state.
Fixes: 631cfdd052 ("mtd: spi-nand: Add continuous read support")
Cc: stable@vger.kernel.org
Signed-off-by: David LaPorte <dalaport@amazon.com>
Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com>
Reviewed-by: Mikhail Kshevetskiy <mikhail.kshevetskiy@iopsys.eu>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b79b24f578 ]
The return value from itg3200_read_reg_s16() is stored in ret but
never checked. The function unconditionally returns IIO_VAL_INT,
ignoring potential I2C read failures. This causes garbage data to
be returned to userspace when the read fails, with no error reported.
Add proper error checking to propagate the failure to callers.
Fixes: 9dbf091da0 ("iio: gyro: Add itg3200")
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a537c0da16 ]
A perf build failure was reported by Thomas Voegtle on stable kernel
v6.6.120:
CC tests/sample-parsing.o
CC util/intel-pt-decoder/intel-pt-pkt-decoder.o
CC util/perf-regs-arch/perf_regs_csky.o
CC util/arm-spe-decoder/arm-spe-pkt-decoder.o
CC util/perf-regs-arch/perf_regs_loongarch.o
In file included from util/arm-spe-decoder/arm-spe-pkt-decoder.h:10,
from util/arm-spe-decoder/arm-spe-pkt-decoder.c:14:
/local/git/linux-stable-rc/tools/include/linux/bitfield.h: In function ‘le16_encode_bits’:
/local/git/linux-stable-rc/tools/include/linux/bitfield.h:166:31: error: implicit declaration of
function ‘cpu_to_le16’; did you mean ‘htole16’? [-Werror=implicit-function-declaration]
____MAKE_OP(le##size,u##size,cpu_to_le##size,le##size##_to_cpu) \
^~~~~~~~~
/local/git/linux-stable-rc/tools/include/linux/bitfield.h:149:9: note: in definition of macro
‘____MAKE_OP’
return to((v & field_mask(field)) * field_multiplier(field)); \
^~
/local/git/linux-stable-rc/tools/include/linux/bitfield.h:170:1: note: in expansion of macro
‘__MAKE_OP’
__MAKE_OP(16)
Fix this by including linux/kernel.h, which provides the required
definitions.
The issue was not found on the mainline due to the relevant C files have
included kernel.h. It'd be good to merge this change on mainline
as well for robustness.
Closes: https://lore.kernel.org/stable/3a44500b-d7c8-179f-61f6-e51cb50d3512@lio96.de/
Fixes: 64d86c03e1 ("perf arm-spe: Extend branch operations")
Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Reported-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: Leo Yan <leo.yan@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
To: Sasha Levin <sashal@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1f3b950492 ]
If a process wrote to POR_EL0 and then crashed before a context switch
happened, the coredump would contain an incorrect value for POR_EL0.
The value read in poe_get() would be a stale value left in thread.por_el0. Fix
this by reading the value from the system register, if the target thread is the
current thread.
This matches what gcs/fpsimd do.
Fixes: 1751981992 ("arm64/ptrace: add support for FEAT_POE")
Reported-by: David Spickett <david.spickett@arm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 218b16992a ]
"pg_init_delay_msecs X" can be passed as a feature in the multipath
table and is used to set m->pg_init_delay_msecs in parse_features().
However, alloc_multipath_stage2(), which is called after
parse_features(), resets m->pg_init_delay_msecs to its default value.
Instead, set m->pg_init_delay_msecs in alloc_multipath(), which is
called before parse_features(), to avoid overwriting a value passed in
by the table.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 806ae939c4 ]
If a send bundle has picked a bunch of buffers, then it needs to send
all of those to be complete. This may require poll arming, if the send
buffer ends up being full. Once a send bundle has been poll armed, no
further bundles should be attempted.
This allows a current bundle to complete even though it needs to go
through polling to do so, but it will not allow another bundle to be
started once that has happened. Ideally we would abort a bundle if it
was only partially sent, but as some parts of it already went out on the
wire, this obviously isn't feasible. Not continuing more bundle attempts
post encountering a full socket buffer is the second best thing.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1132e90840 ]
The COREPLL_PWRDN bit in the BLCG register must be set when the XUSB
device controller is powergated and cleared when it is unpowergated.
If this bit is not explicitly controlled, the core PLL may remain in an
incorrect power state across suspend/resume or ELPG transitions.
Therefore, update the driver to explicitly control this bit during
powergate transitions.
Fixes: 49db427232 ("usb: gadget: Add UDC driver for tegra XUSB device mode controller")
Cc: stable <stable@kernel.org>
Signed-off-by: Haotien Hsu <haotienh@nvidia.com>
Signed-off-by: Wayne Chang <waynec@nvidia.com>
Link: https://patch.msgid.link/20260123173121.4093902-1-waynec@nvidia.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5747435e0f ]
When CONFIG_PAGE_OWNER is enabled, freeing KASAN shadow pages during
vmalloc cleanup triggers expensive stack unwinding that acquires RCU read
locks. Processing a large purge_list without rescheduling can cause the
task to hold CPU for extended periods (10+ seconds), leading to RCU stalls
and potential OOM conditions.
The issue manifests in purge_vmap_node() -> kasan_release_vmalloc_node()
where iterating through hundreds or thousands of vmap_area entries and
freeing their associated shadow pages causes:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P6229/1:b..l
...
task:kworker/0:17 state:R running task stack:28840 pid:6229
...
kasan_release_vmalloc_node+0x1ba/0xad0 mm/vmalloc.c:2299
purge_vmap_node+0x1ba/0xad0 mm/vmalloc.c:2299
Each call to kasan_release_vmalloc() can free many pages, and with
page_owner tracking, each free triggers save_stack() which performs stack
unwinding under RCU read lock. Without yielding, this creates an
unbounded RCU critical section.
Add periodic cond_resched() calls within the loop to allow:
- RCU grace periods to complete
- Other tasks to run
- Scheduler to preempt when needed
The fix uses need_resched() for immediate response under load, with a
batch count of 32 as a guaranteed upper bound to prevent worst-case stalls
even under light load.
Link: https://lkml.kernel.org/r/20260112103612.627247-1-kartikey406@gmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+d8d4c31d40f868eaea30@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d8d4c31d40f868eaea30
Link: https://lore.kernel.org/all/20260112084723.622910-1-kartikey406@gmail.com/T/ [v1]
Suggested-by: Uladzislau Rezki <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 404d779466 ]
idmap lookups can time out while the cache is waiting for a userspace
upcall reply. In that case cache_check() returns -ETIMEDOUT to callers.
The nfsd_map_name_to_[ug]id functions currently proceed with attempting
to map the id to a kuid despite a potentially temporary failure to
perform the idmap lookup. This results in the code returning the error
NFSERR_BADOWNER which can cause client operations to return to userspace
with failure.
Fix this by returning the failure status before attempting kuid mapping.
This will return NFSERR_JUKEBOX on idmap lookup timeout so that clients
can retry the operation instead of aborting it.
Fixes: 65e10f6d0a ("nfsd: Convert idmap to use kuids and kgids")
Cc: stable@vger.kernel.org
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 55e03b8cbe ]
The free space and inode btree repair functions will rebuild both btrees
at the same time, after which it needs to evaluate both btrees to
confirm that the corruptions are gone.
However, Jiaming Zhang ran syzbot and produced a crash in the second
xchk_allocbt call. His root-cause analysis is as follows (with minor
corrections):
In xrep_revalidate_allocbt(), xchk_allocbt() is called twice (first
for BNOBT, second for CNTBT). The cause of this issue is that the
first call nullified the cursor required by the second call.
Let's first enter xrep_revalidate_allocbt() via following call chain:
xfs_file_ioctl() ->
xfs_ioc_scrubv_metadata() ->
xfs_scrub_metadata() ->
`sc->ops->repair_eval(sc)` ->
xrep_revalidate_allocbt()
xchk_allocbt() is called twice in this function. In the first call:
/* Note that sc->sm->sm_type is XFS_SCRUB_TYPE_BNOPT now */
xchk_allocbt() ->
xchk_btree() ->
`bs->scrub_rec(bs, recp)` ->
xchk_allocbt_rec() ->
xchk_allocbt_xref() ->
xchk_allocbt_xref_other()
since sm_type is XFS_SCRUB_TYPE_BNOBT, pur is set to &sc->sa.cnt_cur.
Kernel called xfs_alloc_get_rec() and returned -EFSCORRUPTED. Call
chain:
xfs_alloc_get_rec() ->
xfs_btree_get_rec() ->
xfs_btree_check_block() ->
(XFS_IS_CORRUPT || XFS_TEST_ERROR), the former is false and the latter
is true, return -EFSCORRUPTED. This should be caused by
ioctl$XFS_IOC_ERROR_INJECTION I guess.
Back to xchk_allocbt_xref_other(), after receiving -EFSCORRUPTED from
xfs_alloc_get_rec(), kernel called xchk_should_check_xref(). In this
function, *curpp (points to sc->sa.cnt_cur) is nullified.
Back to xrep_revalidate_allocbt(), since sc->sa.cnt_cur has been
nullified, it then triggered null-ptr-deref via xchk_allocbt() (second
call) -> xchk_btree().
So. The bnobt revalidation failed on a cross-reference attempt, so we
deleted the cntbt cursor, and then crashed when we tried to revalidate
the cntbt. Therefore, check for a null cntbt cursor before that
revalidation, and mark the repair incomplete. Also we can ignore the
second tree entirely if the first tree was rebuilt but is already
corrupt.
Apply the same fix to xrep_revalidate_iallocbt because it has the same
problem.
Cc: r772577952@gmail.com
Link: https://lore.kernel.org/linux-xfs/CANypQFYU5rRPkTy=iG5m1Lp4RWasSgrHXAh3p8YJojxV0X15dQ@mail.gmail.com/T/#m520c7835fad637eccf843c7936c200589427cc7e
Cc: <stable@vger.kernel.org> # v6.8
Fixes: dbfbf3bdf6 ("xfs: repair inode btrees")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jiaming Zhang <r772577952@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ca27313fb3 ]
Fix this function to return NULL instead of a mangled ENOMEM, then fix
the callers to actually check for a null pointer and return ENOMEM.
Most of the corrections here are for code merged between 6.2 and 6.10.
Cc: r772577952@gmail.com
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 1a5f6e08d4 ("xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jiaming Zhang <r772577952@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ba408d299a ]
Only call the xfarray and xfblob destructor if we have a valid pointer,
and be sure to null out that pointer afterwards. Note that this patch
fixes a large number of commits, most of which were merged between 6.9
and 6.10.
Cc: r772577952@gmail.com
Cc: <stable@vger.kernel.org> # v6.12
Fixes: ab97f4b1c0 ("xfs: repair AGI unlinked inode bucket lists")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jiaming Zhang <r772577952@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8754dd7639 ]
struct configfs_item_operations callbacks are defined like the following:
int (*allow_link)(struct config_item *src, struct config_item *target);
void (*drop_link)(struct config_item *src, struct config_item *target);
While pci_primary_epc_epf_link() and pci_secondary_epc_epf_link() specify
the parameters in the correct order, pci_primary_epc_epf_unlink() and
pci_secondary_epc_epf_unlink() specify the parameters in the wrong order,
leading to the below kernel crash when using the unlink command in
configfs:
Unable to handle kernel paging request at virtual address 0000000300000857
Mem abort info:
...
pc : string+0x54/0x14c
lr : vsnprintf+0x280/0x6e8
...
string+0x54/0x14c
vsnprintf+0x280/0x6e8
vprintk_default+0x38/0x4c
vprintk+0xc4/0xe0
pci_epf_unbind+0xdc/0x108
configfs_unlink+0xe0/0x208+0x44/0x74
vfs_unlink+0x120/0x29c
__arm64_sys_unlinkat+0x3c/0x90
invoke_syscall+0x48/0x134
do_el0_svc+0x1c/0x30prop.0+0xd0/0xf0
Fixes: e85a2d7837 ("PCI: endpoint: Add support in configfs to associate two EPCs with EPF")
Signed-off-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
[mani: cced stable, changed commit message as per https://lore.kernel.org/linux-pci/aV9joi3jF1R6ca02@ryzen]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260108062747.1870669-1-mmaddireddy@nvidia.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bd3138e891 ]
In debugging other problems with generic/753, it turns out that it's
possible for the system go to down in the middle of a remote xattr set
operation such that the leaf block entry is marked incomplete and
valueblk is set to zero. Make this no longer a failure.
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 13791d3b83 ("xfs: scrub extended attribute leaf space")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6fed827044 ]
In the previous patches, we observed that it's possible for there to be
freemap entries with zero size but a nonzero base. This isn't an
inconsistency per se, but older kernels can get confused by this and
corrupt the block, leading to corruption.
If we see this, flag the xattr structure for optimization so that it
gets rebuilt.
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 13791d3b83 ("xfs: scrub extended attribute leaf space")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3eefc0c2b7 ]
xfs/592 and xfs/794 both trip this assertion in the leaf block freemap
adjustment code after ~20 minutes of running on my test VMs:
ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t)
+ xfs_attr3_leaf_hdr_size(leaf));
Upon enabling quite a lot more debugging code, I narrowed this down to
fsstress trying to set a local extended attribute with namelen=3 and
valuelen=71. This results in an entry size of 80 bytes.
At the start of xfs_attr3_leaf_add_work, the freemap looks like this:
i 0 base 448 size 0 rhs 448 count 46
i 1 base 388 size 132 rhs 448 count 46
i 2 base 2120 size 4 rhs 448 count 46
firstused = 520
where "rhs" is the first byte past the end of the leaf entry array.
This is inconsistent -- the entries array ends at byte 448, but
freemap[1] says there's free space starting at byte 388!
By the end of the function, the freemap is in worse shape:
i 0 base 456 size 0 rhs 456 count 47
i 1 base 388 size 52 rhs 456 count 47
i 2 base 2120 size 4 rhs 456 count 47
firstused = 440
Important note: 388 is not aligned with the entries array element size
of 8 bytes.
Based on the incorrect freemap, the name area starts at byte 440, which
is below the end of the entries array! That's why the assertion
triggers and the filesystem shuts down.
How did we end up here? First, recall from the previous patch that the
freemap array in an xattr leaf block is not intended to be a
comprehensive map of all free space in the leaf block. In other words,
it's perfectly legal to have a leaf block with:
* 376 bytes in use by the entries array
* freemap[0] has [base = 376, size = 8]
* freemap[1] has [base = 388, size = 1500]
* the space between 376 and 388 is free, but the freemap stopped
tracking that some time ago
If we add one xattr, the entries array grows to 384 bytes, and
freemap[0] becomes [base = 384, size = 0]. So far, so good. But if we
add a second xattr, the entries array grows to 392 bytes, and freemap[0]
gets pushed up to [base = 392, size = 0]. This is bad, because
freemap[1] hasn't been updated, and now the entries array and the free
space claim the same space.
The fix here is to adjust all freemap entries so that none of them
collide with the entries array. Note that this fix relies on commit
2a2b5932db ("xfs: fix attr leaf header freemap.size underflow") and
the previous patch that resets zero length freemap entries to have
base = 0.
Cc: <stable@vger.kernel.org> # v2.6.12
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6f13c1d2a6 ]
Back in commit 2a2b5932db ("xfs: fix attr leaf header freemap.size
underflow"), Brian Foster observed that it's possible for a small
freemap at the end of the end of the xattr entries array to experience
a size underflow when subtracting the space consumed by an expansion of
the entries array. There are only three freemap entries, which means
that it is not a complete index of all free space in the leaf block.
This code can leave behind a zero-length freemap entry with a nonzero
base. Subsequent setxattr operations can increase the base up to the
point that it overlaps with another freemap entry. This isn't in and of
itself a problem because the code in _leaf_add that finds free space
ignores any freemap entry with zero size.
However, there's another bug in the freemap update code in _leaf_add,
which is that it fails to update a freemap entry that begins midway
through the xattr entry that was just appended to the array. That can
result in the freemap containing two entries with the same base but
different sizes (0 for the "pushed-up" entry, nonzero for the entry
that's actually tracking free space). A subsequent _leaf_add can then
allocate xattr namevalue entries on top of the entries array, leading to
data loss. But fixing that is for later.
For now, eliminate the possibility of confusion by zeroing out the base
of any freemap entry that has zero size. Because the freemap is not
intended to be a complete index of free space, a subsequent failure to
find any free space for a new xattr will trigger block compaction, which
regenerates the freemap.
It looks like this bug has been in the codebase for quite a long time.
Cc: <stable@vger.kernel.org> # v2.6.12
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c1b1401522 ]
The interrupt handler reads FIFO entries in batches of N samples, where N
is the number of scan elements that have been enabled. However, the sensor
fills the FIFO one sample at a time, even when more than one channel is
enabled. Therefore,the number of entries reported by the FIFO status
registers may not be a multiple of N; if this number is not a multiple, the
number of entries read from the FIFO may exceed the number of entries
actually present.
To fix the above issue, round down the number of FIFO entries read from the
status registers so that it is always a multiple of N.
Fixes: df36de1367 ("iio: accel: add ADXL380 driver")
Signed-off-by: Francesco Lavra <flavra@baylibre.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 24804ba508 ]
Since commit c6e126de43 ("of: Keep track of populated platform
devices") child devices will not be created by of_platform_populate()
if the devices had previously been deregistered individually so that the
OF_POPULATED flag is still set in the corresponding OF nodes.
Switch to using of_platform_depopulate() instead of open coding so that
the child devices are created if the driver is rebound.
Fixes: c6e126de43 ("of: Keep track of populated platform devices")
Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andreas Kemnade <andreas@kemnade.info>
Link: https://patch.msgid.link/20251219110714.23919-1-johan@kernel.org
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 10e60d8781 ]
Commit 4fc82cd907 ("iommu/vt-d: Don't issue ATS Invalidation
request when device is disconnected") relies on
pci_dev_is_disconnected() to skip ATS invalidation for
safely-removed devices, but it does not cover link-down caused
by faults, which can still hard-lock the system.
For example, if a VM fails to connect to the PCIe device,
"virsh destroy" is executed to release resources and isolate
the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
intel_pasid_tear_down_entry
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput
Although pci_device_is_present() is slower than
pci_dev_is_disconnected(), it still takes only ~70 µs on a
ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the
paths below, which are far less frequent than memory map/unmap.
1. mm-struct release
2. {attach,release}_dev
3. set/remove PASID
4. dirty-tracking setup
The gain in system stability far outweighs the negligible cost
of using pci_device_is_present() instead of pci_dev_is_disconnected()
to decide when to skip ATS invalidation, especially under GDR
high-load conditions.
Fixes: 4fc82cd907 ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a65ea768b ]
The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.
Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.
Cc: stable@vger.kernel.org # v5.19+
Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Reported-by: Mark Tinguely <mark.tinguely@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f39854a3fb ]
I learned a few things this year: first, blk_status_to_errno can return
ENODATA for critical media errors; and second, the scrub code doesn't
mark data structures as corrupt on ENODATA or EIO.
Currently, scrub failing to capture these errors isn't all that
impactful -- the checking code will exit to userspace with EIO/ENODATA,
and xfs_scrub will log a complaint and exit with nonzero status. Most
people treat fsck tools failing as a sign that the fs is corrupt, but
online fsck should mark the metadata bad and keep moving.
Cc: stable@vger.kernel.org # v4.15
Fixes: 4700d22980 ("xfs: create helpers to record and deal with scrub problems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5488a29596 ]
When DRM_BUDDY_CONTIGUOUS_ALLOCATION is set, the requested size is
rounded up to the next power-of-two via roundup_pow_of_two().
Similarly, for non-contiguous allocations with large min_block_size,
the size is aligned up via round_up(). Both operations can produce a
rounded size that exceeds mm->size, which later triggers
BUG_ON(order > mm->max_order).
Example scenarios:
- 9G CONTIGUOUS allocation on 10G VRAM memory:
roundup_pow_of_two(9G) = 16G > 10G
- 9G allocation with 8G min_block_size on 10G VRAM memory:
round_up(9G, 8G) = 16G > 10G
Fix this by checking the rounded size against mm->size. For
non-contiguous or range allocations where size > mm->size is invalid,
return -EINVAL immediately. For contiguous allocations without range
restrictions, allow the request to fall through to the existing
__alloc_contig_try_harder() fallback.
This ensures invalid user input returns an error or uses the fallback
path instead of hitting BUG_ON.
v2: (Matt A)
- Add Fixes, Cc stable, and Closes tags for context
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6712
Fixes: 0a1844bf0b ("drm/buddy: Improve contiguous memory allocation")
Cc: <stable@vger.kernel.org> # v6.7+
Cc: Christian König <christian.koenig@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://patch.msgid.link/20260108113227.2101872-5-sanjay.kumar.yadav@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1aa1dd9cc5 ]
charge_reserved_hugetlb.sh mounts a hugetlbfs instance at /mnt/huge with a
fixed size of 256M. On systems with large base hugepages (e.g. 512MB),
this is smaller than a single hugepage, so the hugetlbfs mount ends up
with zero capacity (often visible as size=0 in mount output).
As a result, write_to_hugetlbfs fails with ENOMEM and the test can hang
waiting for progress.
=== Error log ===
# uname -r
6.12.0-xxx.el10.aarch64+64k
#./charge_reserved_hugetlb.sh -cgroup-v2
# -----------------------------------------
...
# nr hugepages = 10
# writing cgroup limit: 5368709120
# writing reseravation limit: 5368709120
...
# write_to_hugetlbfs: Error mapping the file: Cannot allocate memory
# Waiting for hugetlb memory reservation to reach size 2684354560.
# 0
# Waiting for hugetlb memory reservation to reach size 2684354560.
# 0
...
# mount |grep /mnt/huge
none on /mnt/huge type hugetlbfs (rw,relatime,seclabel,pagesize=512M,size=0)
# grep -i huge /proc/meminfo
...
HugePages_Total: 10
HugePages_Free: 10
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 524288 kB
Hugetlb: 5242880 kB
Drop the mount args with 'size=256M', so the filesystem capacity is sufficient
regardless of HugeTLB page size.
Link: https://lkml.kernel.org/r/20251221122639.3168038-3-liwang@redhat.com
Fixes: 29750f71a9 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests")
Signed-off-by: Li Wang <liwang@redhat.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9c9828d3ea ]
Since commit cc638f329e ("mm, thp: tweak reclaim/compaction effort of
local-only and all-node allocations"), THP page fault allocations have
settled on the following scheme (from the commit log):
1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
allocation from any node with effort determined by global defrag setting
and VMA madvise
3. fallback to base pages on any node
Recent customer reports however revealed we have a gap in step 1 above.
What we have seen is excessive reclaim due to THP page faults on a NUMA
node that's close to its high watermark, while other nodes have plenty of
free memory.
The problem with step 1 is that it promises no reclaim after the
compaction attempt, however reclaim is only avoided for certain compaction
outcomes (deferred, or skipped due to insufficient free base pages), and
not e.g. when compaction is actually performed but fails (we did see
compact_fail vmstat counter increasing).
THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
which is not the intention.
Thus add a check for __GFP_THISNODE that corresponds to this exact
situation and prevents continuing with reclaim/compaction once the initial
compaction attempt isn't successful in allocating the page.
Note that commit cc638f329e has not introduced this over-reclaim
possibility; it appears to exist in some form since commit 2f0799a0ff
("mm, thp: restore node-local hugepage allocations"). Followup commits
b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when compaction may
not succeed") and cc638f329e have moved in the right direction, but left
the abovementioned gap.
Link: https://lkml.kernel.org/r/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz
Fixes: 2f0799a0ff ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 63c072e293 ]
On SM8250 (IRIS2) with firmware older than 1.0.087, the firmware could
not handle a dummy device address for EOS buffers, so a NULL device
address is sent instead. The existing check used IS_V6() alongside a
firmware version gate:
if (IS_V6(core) && is_fw_rev_or_older(core, 1, 0, 87))
fdata.device_addr = 0;
else
fdata.device_addr = 0xdeadb000;
However, SC7280 which is also V6, uses a firmware string of the form
"1.0.<commit-hash>", which the version parser translates to 1.0.0. This
unintentionally satisfies the `is_fw_rev_or_older(..., 1, 0, 87)`
condition on SC7280. Combined with IS_V6() matching there as well, the
quirk is incorrectly applied to SC7280, causing VP9 decode failures.
Constrain the check to IRIS2 (SM8250) only, which is the only platform
that needed this quirk, by replacing IS_V6() with IS_IRIS2(). This
restores correct behavior on SC7280 (no forced NULL EOS buffer address).
Fixes: 47f867cb1b ("media: venus: fix EOS handling in decoder stop command")
Cc: stable@vger.kernel.org
Reported-by: Mecid <mecid@mecomediagroup.de>
Closes: https://github.com/qualcomm-linux/kernel-topics/issues/222
Co-developed-by: Renjiang Han <renjiang.han@oss.qualcomm.com>
Signed-off-by: Renjiang Han <renjiang.han@oss.qualcomm.com>
Signed-off-by: Dikshita Agarwal <dikshita.agarwal@oss.qualcomm.com>
Tested-by: Renjiang Han <renjiang.han@oss.qualcomm.com>
Signed-off-by: Bryan O'Donoghue <bod@kernel.org>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 93ecd6ee95 ]
When hfi_session_flush is issued, all queued buffers are returned to
the V4L2 driver. Some of these buffers are not processed and have
bytesused = 0. Currently, the driver marks such buffers as error even
during drain operations, which can incorrectly flag EOS buffers.
Only capture buffers with zero payload (and not EOS) should be marked
with VB2_BUF_STATE_ERROR. The check is performed inside the non-EOS
branch to ensure correct handling.
Fixes: 51df3c81ba ("media: venus: vdec: Mark flushed buffers with error state")
Signed-off-by: Renjiang Han <renjiang.han@oss.qualcomm.com>
Reviewed-by: Vikash Garodia <vikash.garodia@oss.qualcomm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bod@kernel.org>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 83c10e8dd4 ]
The "unstriped" device-mapper target incorrectly calculates the sector
offset on the mapped device when the target's origin is not zero.
Take for example this hypothetical concatenation of the members of a
two-disk RAID0:
linearized: 0 2097152 unstriped 2 128 0 /dev/md/raid0 0
linearized: 2097152 2097152 unstriped 2 128 1 /dev/md/raid0 0
The intent in this example is to create a single device named
/dev/mapper/linearized that comprises all of the chunks of the first disk
of the RAID0 set, followed by all of the chunks of the second disk of the
RAID0 set.
This fails because dm-unstripe.c's map_to_core function does its
computations based on the sector number within the mapper device rather
than the sector number within the target. The bug turns invisible when
the target's origin is at sector zero of the mapper device, as is the
common case. In the example above, however, what happens is that the
first half of the mapper device gets mapped correctly to the first disk
of the RAID0, but the second half of the mapper device gets mapped past
the end of the RAID0 device, and accesses to any of those sectors return
errors.
Signed-off-by: Matt Whitlock <kernel@mattwhitlock.name>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 18a5bf2705 ("dm: add unstriped target")
Signed-off-by: Sasha Levin <sashal@kernel.org>