mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2026-04-29 12:28:27 +02:00
08a10002bed151f6df201715adb80c1c5e7fe7ca
25383 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
08a10002be |
Merge branch 'pm-cpufreq-sched'
* pm-cpufreq-sched: cpufreq: schedutil: Always process remote callback with slow switching cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily cpufreq: Return 0 from ->fast_switch() on errors cpufreq: Simplify cpufreq_can_do_remote_dvfs() cpufreq: Process remote callbacks from any CPU if the platform permits sched: cpufreq: Allow remote cpufreq callbacks cpufreq: schedutil: Use unsigned int for iowait boost cpufreq: schedutil: Make iowait boost more energy efficient |
||
|
|
bd87c8fb9d |
Merge branch 'pm-cpufreq'
* pm-cpufreq: (33 commits) cpufreq: imx6q: Fix imx6sx low frequency support cpufreq: speedstep-lib: make several arrays static, makes code smaller cpufreq: ti: Fix 'of_node_put' being called twice in error handling path cpufreq: dt-platdev: Drop few entries from whitelist cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2 ARM: ux500: don't select CPUFREQ_DT cpufreq: Convert to using %pOF instead of full_name cpufreq: Cap the default transition delay value to 10 ms cpufreq: dbx500: Delete obsolete driver mfd: db8500-prcmu: Get rid of cpufreq dependency cpufreq: enable the DT cpufreq driver on the Ux500 cpufreq: Loongson2: constify platform_device_id cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver cpufreq: remove setting of policy->cpu in policy->cpus during init cpufreq: mediatek: add support of cpufreq to MT7622 SoC cpufreq: mediatek: add cleanups with the more generic naming cpufreq: rcar: Add support for R8A7795 SoC cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver cpufreq: s5pv210: add missing of_node_put() cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency ... |
||
|
|
45a7953c83 |
Merge branches 'pm-core', 'pm-opp', 'pm-domains', 'pm-cpu' and 'pm-avs'
* pm-core: PM / wakeup: Set power.can_wakeup if wakeup_sysfs_add() fails * pm-opp: PM / OPP: Fix get sharing CPUs when hotplug is used PM / OPP: OF: Use pr_debug() instead of pr_err() while adding OPP table * pm-domains: PM / Domains: Convert to using %pOF instead of full_name PM / Domains: Extend generic power domain debugfs PM / Domains: Add time accounting to various genpd states * pm-cpu: PM / CPU: replace raw_notifier with atomic_notifier * pm-avs: PM / AVS: rockchip-io: add io selectors and supplies for RV1108 |
||
|
|
3b62dc6c38 |
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner: "A single fix for a thinko in the raw timekeeper update which causes clock MONOTONIC_RAW to run with erratically increased frequency" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix ktime_get_raw() incorrect base accumulation |
||
|
|
e92d51aff5 |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner: - Prevent a potential inconistency in the perf user space access which might lead to evading sanity checks. - Prevent perf recording function trace entries twice * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/ftrace: Fix double traces of perf on ftrace:function perf/core: Fix potential double-fetch bug |
||
|
|
8cf9f2a29f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel
Borkmann.
2) IPSEC ESP error paths leak memory, from Steffen Klassert.
3) We need an RCU grace period before freeing fib6_node objects, from
Wei Wang.
4) Must check skb_put_padto() return value in HSR driver, from FLorian
Fainelli.
5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew
Jeffery.
6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric
Dumazet.
7) Use after free when tcf_chain_destroy() called multiple times, from
Jiri Pirko.
8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli.
9) Fix leak of uninitialized memory in sctp_get_sctp_info(),
inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill().
From Stefano Brivio.
10) L2TP tunnel refcount fixes from Guillaume Nault.
11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi
Kauperman.
12) Revert a PHY layer change wrt. handling of PHY_HALTED state in
phy_stop_machine(), it causes regressions for multiple people. From
Florian Fainelli.
13) When packets are sent out of br0 we have to clear the
offload_fwdq_mark value.
14) Several NULL pointer deref fixes in packet schedulers when their
->init() routine fails. From Nikolay Aleksandrov.
15) Aquantium devices cannot checksum offload correctly when the packet
is <= 60 bytes. From Pavel Belous.
16) Fix vnet header access past end of buffer in AF_PACKET, from
Benjamin Poirier.
17) Double free in probe error paths of nfp driver, from Dan Carpenter.
18) QOS capability not checked properly in DCB init paths of mlx5
driver, from Huy Nguyen.
19) Fix conflicts between firmware load failure and health_care timer in
mlx5, also from Huy Nguyen.
20) Fix dangling page pointer when DMA mapping errors occur in mlx5,
from Eran Ben ELisha.
21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly,
from Michael Chan.
22) Missing MSIX vector free in bnxt_en, also from Michael Chan.
23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo
Colitti.
24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann.
25) bpf_setsockopts() erroneously always returns -EINVAL even on
success. Fix from Yuchung Cheng.
26) tipc_rcv() needs to linearize the SKB before parsing the inner
headers, from Parthasarathy Bhuvaragan.
27) Fix deadlock between link status updates and link removal in netvsc
driver, from Stephen Hemminger.
28) Missed locking of page fragment handling in ESP output, from Steffen
Klassert.
29) Fix refcnt leak in ebpf congestion control code, from Sabrina
Dubroca.
30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value,
from Christophe Jaillet.
31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during
early demux, from Paolo Abeni.
32) Several info leaks in xfrm_user layer, from Mathias Krause.
33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio.
34) Properly propagate obsolete state of route upwards in ipv6 so that
upper holders like xfrm can see it. From Xin Long.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits)
udp: fix secpath leak
bridge: switchdev: Clear forward mark when transmitting packet
mlxsw: spectrum: Forbid linking to devices that have uppers
wl1251: add a missing spin_lock_init()
Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
kcm: do not attach PF_KCM sockets to avoid deadlock
sch_tbf: fix two null pointer dereferences on init failure
sch_sfq: fix null pointer dereference on init failure
sch_netem: avoid null pointer deref on init failure
sch_fq_codel: avoid double free on init failure
sch_cbq: fix null pointer dereferences on init failure
sch_hfsc: fix null pointer deref and double free on init failure
sch_hhf: fix null pointer dereference on init failure
sch_multiq: fix double free on init failure
sch_htb: fix crash on init failure
net/mlx5e: Fix CQ moderation mode not set properly
net/mlx5e: Fix inline header size for small packets
net/mlx5: E-Switch, Unload the representors in the correct order
net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address
...
|
||
|
|
355627f518 |
mm, uprobes: fix multiple free of ->uprobes_state.xol_area
Commit |
||
|
|
22cf8bc6cb |
kernel/kthread.c: kthread_worker: don't hog the cpu
If the worker thread continues getting work, it will hog the cpu and rcu stall complains. Make it a good citizen. This is triggered in a loop block device test. Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
36fde05f3f |
Merge branch 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo: "A late but obvious fix for cgroup. I broke the 'cpuset.memory_pressure' file a long time ago (v4.4) by accidentally deleting its file index, which made it a duplicate of the 'cpuset.memory_migrate' file. Spotted and fixed by Waiman" * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix incorrect memory_pressure control file mapping |
||
|
|
75e8387685 |
perf/ftrace: Fix double traces of perf on ftrace:function
When running perf on the ftrace:function tracepoint, there is a bug
which can be reproduced by:
perf record -e ftrace:function -a sleep 20 &
perf record -e ftrace:function ls
perf script
ls 10304 [005] 171.853235: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853237: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853239: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853240: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853242: ftrace:function:
__task_pid_nr_ns
ls 10304 [005] 171.853244: ftrace:function:
__task_pid_nr_ns
We can see that all the function traces are doubled.
The problem is caused by the inconsistency of the register
function perf_ftrace_event_register() with the probe function
perf_ftrace_function_call(). The former registers one probe
for every perf_event. And the latter handles all perf_events
on the current cpu. So when two perf_events on the current cpu,
the traces of them will be doubled.
So this patch adds an extra parameter "event" for perf_tp_event,
only send sample data to this event when it's not NULL.
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: huawei.libin@huawei.com
Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
f12f42acdb |
perf/core: Fix potential double-fetch bug
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where the same
userspace memory region are fetched twice into kernel with sanity checks after
the first fetch while missing checks after the second fetch.
1. The first fetch happens in line 9573 get_user(size, &uattr->size).
2. Subsequently the 'size' variable undergoes a few sanity checks and
transformations (line 9577 to 9584).
3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)
4. Given that 'uattr' can be fully controlled in userspace, an attacker can
race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
after the first fetch but before the second fetch. The changed value will be
copied to 'attr->size'.
5. There is no further checks on 'attr->size' until the end of this function,
and once the function returns, we lose the context to verify that 'attr->size'
conforms to the sanity checks performed in step 2 (line 9577 to 9584).
6. My manual analysis shows that 'attr->size' is not used elsewhere later,
so, there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
'attr->size' later.
To fix this, override 'attr->size' from the second fetch to the one from the
first fetch, regardless of what is actually copied in.
In this way, it is assured that 'attr->size' is consistent with the checks
performed after the first fetch.
Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: meng.xu@gatech.edu
Cc: sanidhya@gatech.edu
Cc: taesoo@gatech.edu
Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
3510ca20ec |
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
0adb8f3d31 |
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar: "Fix a timer granularity handling race+bug, which would manifest itself by spuriously increasing timeouts of some timers (from 1 jiffy to ~500 jiffies in the worst case measured) in certain nohz states" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix excessive granularity of new timers after a nohz idle |
||
|
|
53ede64de3 |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
"A single fix to not allow nonsensical event groups that result in
kernel warnings"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix group {cpu,task} validation
|
||
|
|
0bcdc0987c |
time: Fix ktime_get_raw() incorrect base accumulation
In comqit |
||
|
|
2b7e8665b4 |
fork: fix incorrect fput of ->exe_file causing use-after-free
Commit |
||
|
|
64aee2a965 |
perf/core: Fix group {cpu,task} validation
Regardless of which events form a group, it does not make sense for the
events to target different tasks and/or CPUs, as this leaves the group
inconsistent and impossible to schedule. The core perf code assumes that
these are consistent across (successfully intialised) groups.
Core perf code only verifies this when moving SW events into a HW
context. Thus, we can violate this requirement for pure SW groups and
pure HW groups, unless the relevant PMU driver happens to perform this
verification itself. These mismatched groups subsequently wreak havoc
elsewhere.
For example, we handle watchpoints as SW events, and reserve watchpoint
HW on a per-CPU basis at pmu::event_init() time to ensure that any event
that is initialised is guaranteed to have a slot at pmu::add() time.
However, the core code only checks the group leader's cpu filter (via
event_filter_match()), and can thus install follower events onto CPUs
violating thier (mismatched) CPU filters, potentially installing them
into a CPU without sufficient reserved slots.
This can be triggered with the below test case, resulting in warnings
from arch backends.
#define _GNU_SOURCE
#include <linux/hw_breakpoint.h>
#include <linux/perf_event.h>
#include <sched.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
}
char watched_char;
struct perf_event_attr wp_attr = {
.type = PERF_TYPE_BREAKPOINT,
.bp_type = HW_BREAKPOINT_RW,
.bp_addr = (unsigned long)&watched_char,
.bp_len = 1,
.size = sizeof(wp_attr),
};
int main(int argc, char *argv[])
{
int leader, ret;
cpu_set_t cpus;
/*
* Force use of CPU0 to ensure our CPU0-bound events get scheduled.
*/
CPU_ZERO(&cpus);
CPU_SET(0, &cpus);
ret = sched_setaffinity(0, sizeof(cpus), &cpus);
if (ret) {
printf("Unable to set cpu affinity\n");
return 1;
}
/* open leader event, bound to this task, CPU0 only */
leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
if (leader < 0) {
printf("Couldn't open leader: %d\n", leader);
return 1;
}
/*
* Open a follower event that is bound to the same task, but a
* different CPU. This means that the group should never be possible to
* schedule.
*/
ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
if (ret < 0) {
printf("Couldn't open mismatched follower: %d\n", ret);
return 1;
} else {
printf("Opened leader/follower with mismastched CPUs\n");
}
/*
* Open as many independent events as we can, all bound to the same
* task, CPU0 only.
*/
do {
ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
} while (ret >= 0);
/*
* Force enable/disble all events to trigger the erronoeous
* installation of the follower event.
*/
printf("Opened all events. Toggling..\n");
for (;;) {
prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
}
return 0;
}
Fix this by validating this requirement regardless of whether we're
moving events.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhou Chengming <zhouchengming1@huawei.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
415be6c256 |
Merge tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various bug fixes:
- Two small memory leaks in error paths.
- A missed return error code on an error path.
- A fix to check the tracing ring buffer CPU when it doesn't exist
(caused by setting maxcpus on the command line that is less than
the actual number of CPUs, and then onlining them manually).
- A fix to have the reset of boot tracers called by lateinit_sync()
instead of just lateinit(). As some of the tracers register via
lateinit(), and if the clear happens before the tracer is
registered, it will never start even though it was told to via the
kernel command line"
* tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix freeing of filter in create_filter() when set_str is false
tracing: Fix kmemleak in tracing_map_array_free()
ftrace: Check for null ret_stack on profile function graph entry function
ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU
tracing: Missing error code in tracer_alloc_buffers()
tracing: Call clear_boot_tracer() at lateinit_sync
|
||
|
|
1c08c22c87 |
cpuset: Fix incorrect memory_pressure control file mapping
The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes:
|
||
|
|
8b0db1a5bd |
tracing: Fix freeing of filter in create_filter() when set_str is false
Performing the following task with kmemleak enabled:
# cd /sys/kernel/tracing/events/irq/irq_handler_entry/
# echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
# echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800b9290308 (size 32):
comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
[<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
[<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
[<ffffffff812639c9>] create_filter+0xa9/0x160
[<ffffffff81263bdc>] create_event_filter+0xc/0x10
[<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
[<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
[<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
[<ffffffff8138cf87>] __vfs_write+0xd7/0x380
[<ffffffff8138f421>] vfs_write+0x101/0x260
[<ffffffff8139187b>] SyS_write+0xab/0x130
[<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
[<ffffffffffffffff>] 0xffffffffffffffff
The function create_filter() is passed a 'filterp' pointer that gets
allocated, and if "set_str" is true, it is up to the caller to free it, even
on error. The problem is that the pointer is not freed by create_filter()
when set_str is false. This is a bug, and it is not up to the caller to free
the filter on error if it doesn't care about the string.
Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
475bb3c69a |
tracing: Fix kmemleak in tracing_map_array_free()
kmemleak reported the below leak when I was doing clear of the hist
trigger. With this patch, the kmeamleak is gone.
unreferenced object 0xffff94322b63d760 (size 32):
comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
hex dump (first 32 bytes):
00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00 ................
10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff ..........z.1...
backtrace:
[<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
[<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
[<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
[<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
[<ffffffff9e38b935>] create_hist_data+0x535/0x750
[<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
[<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
[<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
[<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
[<ffffffff9e450b85>] SyS_write+0x55/0xc0
[<ffffffff9e203857>] do_syscall_64+0x67/0x150
[<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff9431f27aa880 (size 128):
comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
hex dump (first 32 bytes):
00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff ...*2......*2...
00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff ...*2......*2...
backtrace:
[<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
[<ffffffff9e425348>] __kmalloc+0xe8/0x220
[<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
[<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
[<ffffffff9e38b935>] create_hist_data+0x535/0x750
[<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
[<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
[<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
[<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
[<ffffffff9e450b85>] SyS_write+0x55/0xc0
[<ffffffff9e203857>] do_syscall_64+0x67/0x150
[<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
a8f0f9e499 |
ftrace: Check for null ret_stack on profile function graph entry function
There's a small race when function graph shutsdown and the calling of the registered function graph entry callback. The callback must not reference the task's ret_stack without first checking that it is not NULL. Note, when a ret_stack is allocated for a task, it stays allocated until the task exits. The problem here, is that function_graph is shutdown, and a new task was created, which doesn't have its ret_stack allocated. But since some of the functions are still being traced, the callbacks can still be called. The normal function_graph code handles this, but starting with commit |
||
|
|
2fe59f507a |
timers: Fix excessive granularity of new timers after a nohz idle
When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.
However there are several problems:
- If an existing timer is modified, the base is forwarded only after
the index is calculated.
- The base is not forwarded by add_timer_on.
- There is a window after a timer is restarted from a nohz idle, after
it is marked not-idle and before the timer tick on this CPU, where a
timer may be added but the ancient base does not get forwarded.
These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.
Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.
There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.
This has been tested and found to fix the RCU softlockup messages.
Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:
max avg std
upstream 506.0 1.20 4.68
patched 2.0 1.08 0.15
The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:
max avg std
upstream 9.0 1.05 0.19
patched 2.0 1.04 0.11
Fixes: Fixes:
|
||
|
|
33ba43ed0a |
bpf: fix map value attribute for hash of maps
Currently, iproute2's BPF ELF loader works fine with array of maps
when retrieving the fd from a pinned node and doing a selfcheck
against the provided map attributes from the object file, but we
fail to do the same for hash of maps and thus refuse to get the
map from pinned node.
Reason is that when allocating hash of maps, fd_htab_map_alloc() will
set the value size to sizeof(void *), and any user space map creation
requests are forced to set 4 bytes as value size. Thus, selfcheck
will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
returns the value size used to create the map.
Fix it by handling it the same way as we do for array of maps, which
means that we leave value size at 4 bytes and in the allocation phase
round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
calling into htab_map_update_elem() with the pointer of the map
pointer as value. Unlike array of maps where we just xchg(), we're
using the generic htab_map_update_elem() callback also used from helper
calls, which published the key/value already on return, so we need
to ensure to memcpy() the right size.
Fixes:
|
||
|
|
dd1c1f2f20 |
pids: make task_tgid_nr_ns() safe
This was reported many times, and this was even mentioned in commit
|
||
|
|
e46db8d2ef |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Two fixes for the perf subsystem:
- Fix an inconsistency of RDPMC mm struct tagging across exec() which
causes RDPMC to fault.
- Correct the timestamp mechanics across IOC_DISABLE/ENABLE which
causes incorrect timestamps and total time calculations"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix time on IOC_ENABLE
perf/x86: Fix RDPMC vs. mm_struct tracking
|
||
|
|
9dae41a238 |
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A pile of smallish changes all over the place:
- Add a missing ISB in the GIC V1 driver
- Remove an ACPI version check in the GIC V3 ITS driver
- Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid
spurious wakeups
- Remove the artifical limitation of ITS instances to the number of
NUMA nodes which prevents utilizing the ITS hardware correctly
- Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code
- Honour the force affinity argument in the GIC-V3 driver which is
required to make perf work correctly
- Correctly report allocation failures in GIC-V2/V3 to avoid using
half allocated and initialized interrupts.
- Fixup checks against nr_cpu_ids in the generic IPI code"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/ipi: Fixup checks against nr_cpu_ids
genirq: Restore trigger settings in irq_modify_status()
MAINTAINERS: Remove Jason Cooper's irqchip git tree
irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop
irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES
irqchip: brcmstb-l2: Define an irq_pm_shutdown function
irqchip/gic: Ensure we have an ISB between ack and ->handle_irq
irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA
irqchip/gic-v3: Honor forced affinity setting
irqchip/gic-v3: Report failures in gic_irq_domain_alloc
irqchip/gic-v2: Report failures in gic_irq_domain_alloc
irqchip/atmel-aic: Remove root argument from ->fixup() prototype
irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
|
||
|
|
e18a5ebc2d |
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes |
||
|
|
8fbbe2d7cc |
genirq/ipi: Fixup checks against nr_cpu_ids
Valid CPU ids are [0, nr_cpu_ids-1] inclusive. Fixes: |
||
|
|
eb61b5911b |
signal: don't remove SIGNAL_UNKILLABLE for traced tasks.
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive faults, but this is undesirable when tracing. For example, debugging an init process (whether global or namespace), hitting a breakpoint and SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE. Everything continues fine, but then once debugging has finished, the init process is left killable which is unlikely what the user expects, resulting in either an accidentally killed init or an init that stops reaping zombies. Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.com Signed-off-by: Jamie Iles <jamie.iles@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2ba293c9e7 |
kmod: fix wait on recursive loop
Recursive loops with module loading were previously handled in kmod by restricting the number of modprobe calls to 50 and if that limit was breached request_module() would return an error and a user would see the following on their kernel dmesg: request_module: runaway loop modprobe binfmt-464c Starting init:/sbin/init exists but couldn't execute it (error -8) This issue could happen for instance when a 64-bit kernel boots a 32-bit userspace on some architectures and has no 32-bit binary format hanlders. This is visible, for instance, when a CONFIG_MODULES enabled 64-bit MIPS kernel boots a into o32 root filesystem and the binfmt handler for o32 binaries is not built-in. After commit |
||
|
|
7edaeb6841 |
kernel/watchdog: Prevent false positives with turbo modes
The hardlockup detector on x86 uses a performance counter based on unhalted
CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
performance counter period, so the hrtimer should fire 2-3 times before the
performance counter NMI fires. The NMI code checks whether the hrtimer
fired since the last invocation. If not, it assumess a hard lockup.
The calculation of those periods is based on the nominal CPU
frequency. Turbo modes increase the CPU clock frequency and therefore
shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
nominal frequency) the perf/NMI period is shorter than the hrtimer period
which leads to false positives.
A simple fix would be to shorten the hrtimer period, but that comes with
the side effect of more frequent hrtimer and softlockup thread wakeups,
which is not desired.
Implement a low pass filter, which checks the perf/NMI period against
kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
elapsed then the event is ignored and postponed to the next perf/NMI.
That solves the problem and avoids the overhead of shorter hrtimer periods
and more frequent softlockup thread wakeups.
Fixes:
|
||
|
|
e8f241893d |
genirq: Restore trigger settings in irq_modify_status()
irq_modify_status starts by clearing the trigger settings from
irq_data before applying the new settings, but doesn't restore them,
leaving them to IRQ_TYPE_NONE.
That's pretty confusing to the potential request_irq() that could
follow. Instead, snapshot the settings before clearing them, and restore
them if the irq_modify_status() invocation was not changing the trigger.
Fixes:
|
||
|
|
c49cbc19b3 |
cpufreq: schedutil: Always process remote callback with slow switching
The frequency update from the utilization update handlers can be divided into two parts: (A) Finding the next frequency (B) Updating the frequency While any CPU can do (A), (B) can be restricted to a group of CPUs only, depending on the current platform. For platforms where fast cpufreq switching is possible, both (A) and (B) are always done from the same CPU and that CPU should be capable of changing the frequency of the target CPU. But for platforms where fast cpufreq switching isn't possible, after doing (A) we wake up a kthread which will eventually do (B). This kthread is already bound to the right set of CPUs, i.e. only those which can change the frequency of CPUs of a cpufreq policy. And so any CPU can actually do (A) in this case, as the frequency is updated from the right set of CPUs only. Check cpufreq_can_do_remote_dvfs() only for the fast switching case. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
e2cabe48c2 |
cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
Utilization update callbacks are now processed remotely, even on the CPUs that don't share cpufreq policy with the target CPU (if dvfs_possible_from_any_cpu flag is set). But in non-fast switch paths, the frequency is changed only from one of policy->related_cpus. This happens because the kthread which does the actual update is bound to a subset of CPUs (i.e. related_cpus). Allow frequency to be remotely updated as well (i.e. call __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
422ce075f9 |
Merge tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore: "Two small fixes to the audit code, both explained well in the respective patch descriptions, but the quick summary is one use-after-free fix, and one silly fanotify notification flag fix" * tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Receive unmount event audit: Fix use after free in audit_remove_watch_rule() |
||
|
|
510c8a899c |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel
Grumbach.
2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From
Vivien Didelot.
3) Fix two regressions with bonding on wireless, from Andreas Born.
4) Fix build when busypoll is disabled, from Daniel Borkmann.
5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet.
6) Set SKB cached route properly in inet_rtm_getroute(), from Florian
Westphal.
7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding
Tianhong.
8) Fix module refcnt leak in ULP code, from Sabrina Dubroca.
9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric
Dumazet.
10) Need to purge socket write queue in dccp_destroy_sock(), also from
Eric Dumazet.
11) Make bpf_trace_printk() work properly on 32-bit architectures, from
Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
bpf: fix bpf_trace_printk on 32 bit archs
PCI: fix oops when try to find Root Port for a PCI device
sfc: don't try and read ef10 data on non-ef10 NIC
net_sched: remove warning from qdisc_hash_add
net_sched/sfq: update hierarchical backlog when drop packet
net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
ipv4: fix NULL dereference in free_fib_info_rcu()
net: Fix a typo in comment about sock flags.
ipv6: fix NULL dereference in ip6_route_dev_notify()
tcp: fix possible deadlock in TCP stack vs BPF filter
dccp: purge write queue in dccp_destroy_sock()
udp: fix linear skb reception with PEEK_OFF
ipv6: release rt6->rt6i_idev properly during ifdown
af_key: do not use GFP_KERNEL in atomic contexts
tcp: ulp: avoid module refcnt leak in tcp_set_ulp
net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
PCI: Disable Relaxed Ordering Attributes for AMD A1100
PCI: Disable Relaxed Ordering for some Intel processors
PCI: Disable PCIe Relaxed Ordering if unsupported
...
|
||
|
|
88a5c690b6 |
bpf: fix bpf_trace_printk on 32 bit archs
James reported that on MIPS32 bpf_trace_printk() is currently
broken while MIPS64 works fine:
bpf_trace_printk() uses conditional operators to attempt to
pass different types to __trace_printk() depending on the
format operators. This doesn't work as intended on 32-bit
architectures where u32 and long are passed differently to
u64, since the result of C conditional operators follows the
"usual arithmetic conversions" rules, such that the values
passed to __trace_printk() will always be u64 [causing issues
later in the va_list handling for vscnprintf()].
For example the samples/bpf/tracex5 test printed lines like
below on MIPS32, where the fd and buf have come from the u64
fd argument, and the size from the buf argument:
[...] 1180.941542: 0x00000001: write(fd=1, buf= (null), size=6258688)
Instead of this:
[...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
One way to get it working is to expand various combinations
of argument types into 8 different combinations for 32 bit
and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
as well that it resolves the issue.
Fixes:
|
||
|
|
b5fed474b9 |
audit: Receive unmount event
Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is not part of AUDIT_FS_WATCH mask and thus such event never gets to audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify subsystem on unmount without audit being notified about that which leads to a strange state of existing audit rules with dead fsnotify marks. Add FS_UNMOUNT to the mask of events to be received so that audit can clean up its state accordingly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com> |
||
|
|
d76036ab47 |
audit: Fix use after free in audit_remove_watch_rule()
audit_remove_watch_rule() drops watch's reference to parent but then continues to work with it. That is not safe as parent can get freed once we drop our reference. The following is a trivial reproducer: mount -o loop image /mnt touch /mnt/file auditctl -w /mnt/file -p wax umount /mnt auditctl -D <crash in fsnotify_destroy_mark()> Grab our own reference in audit_remove_watch_rule() earlier to make sure mark does not get freed under us. CC: stable@vger.kernel.org Reported-by: Tony Jones <tonyj@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Tony Jones <tonyj@suse.de> Signed-off-by: Paul Moore <paul@paul-moore.com> |
||
|
|
16af97dc5a |
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes:
|
||
|
|
d507e2ebd2 |
mm: fix global NR_SLAB_.*CLAIMABLE counter reads
As Tetsuo points out: "Commit |
||
|
|
9b231d9f47 |
perf/core: Fix time on IOC_ENABLE
Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task is SIGSTOP'ed state the timestamps go wobbly. It turns out we indeed fail to correctly account time while in 'OFF' state and doing IOC_ENABLE without getting scheduled in exposes the problem. Further thinking about this problem, it occurred to me that we can suffer a similar fate when we migrate an uncore event between CPUs. The perf_event_install() on the 'new' CPU will do add_event_to_ctx() which will reset all the time stamp, resulting in a subsequent update_event_times() to overwrite the total_time_* fields with smaller values. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
bfe334924c |
perf/x86: Fix RDPMC vs. mm_struct tracking
Vince reported the following rdpmc() testcase failure:
> Failing test case:
>
> fd=perf_event_open();
> addr=mmap(fd);
> exec() // without closing or unmapping the event
> fd=perf_event_open();
> addr=mmap(fd);
> rdpmc() // GPFs due to rdpmc being disabled
The problem is of course that exec() plays tricks with what is
current->mm, only destroying the old mappings after having
installed the new mm.
Fix this confusion by passing along vma->vm_mm instead of relying on
current->mm.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
209887e6b9 |
cpufreq: Return 0 from ->fast_switch() on errors
CPUFREQ_ENTRY_INVALID is a special symbol which is used to specify that an entry in the cpufreq table is invalid. But using it outside of the scope of the cpufreq table looks a bit incorrect. We can represent an invalid frequency by writing it as 0 instead if we need. Note that it is already done that way for the return value of the ->get() callback. Lets do the same for ->fast_switch() and not use CPUFREQ_ENTRY_INVALID outside of the scope of cpufreq table. Also update the comment over cpufreq_driver_fast_switch() to clearly mention what this returns. None of the drivers return CPUFREQ_ENTRY_INVALID as of now from ->fast_switch() callback and so we don't need to update any of those. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
48fb6f4db9 |
futex: Remove unnecessary warning from get_futex_key
Commit |
||
|
|
fbb77611e9 |
Fix compat_sys_sigpending breakage
The latest change of compat_sys_sigpending in commit |
||
|
|
d1faa3e78a |
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner: "A single fix for a multiplication overflow in the timer code on 32bit systems" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix overflow in get_next_timer_interrupt |
||
|
|
89affbf5d9 |
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes:
|
||
|
|
27e37d84e5 |
pid: kill pidhash_size in pidhash_init()
After commit
|