This change fix widevine playback issue for ION failing to secure mm heap
after flashing image into device or do factory reset.
Bug: 9051476
Reference from:
commit 15962c2598f71e0fed921ae441fede5df2bb9e1c
Author: Laura Abbott <lauraa@codeaurora.org>
Date: Fri Apr 5 12:39:18 2013 -0700
mm: Retry original migrate type if CMA failed
Currently, __rmqueue_cma will disregard the original migrate type
and only try MIGRATE_CMA for allocations. If the MIGRATE_CMA
allocation fails, the fallback types of the original migrate type
are used. Note that in this current path we never try to actually
allocate from the original migrate type. If the only pages left
in the system are the original migrate type, we will fail the
allocation since we never actually try the original migrate type.
This may lead to infinite looping since the system still (correctly)
calculates there are pages available for allocation and will keep
trying to allocate pages. Fix this degenerate case by allocating
from the original migrate type if the MIGRATE_CMA allocation fails.
Change-Id: I62ab293dc694955eaf88e790131a8565395ba8cb
CRs-Fixed: 470615
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: ted_lin <ted_lin@asus.com>
CMA pages are designed to be used as fallback for movable allocations
and cannot be used for non-movable allocations. If CMA pages are
utilized poorly, non-movable allocations may end up getting starved if
all regular movable pages are allocated and the only pages left are
CMA. Always using CMA pages first creates unacceptable performance
problems. As a midway alternative, use CMA pages for certain
userspace allocations. The userspace pages can be migrated or dropped
quickly which giving decent utilization.
Change-Id: I6165dda01b705309eebabc6dfa67146b7a95c174
CRs-Fixed: 452508
[lauraa@codeaurora.org: Missing CONFIG_CMA guards, add commit text]
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
This reverts commit b5662d64fa5ee483b985b351dec993402422fee3.
Using CMA pages first creates good utilization but has some
unfortunate side effects. Many movable allocations come from
the filesystem layer which can hold on to pages for long periods
of time which causes high allocation times (~200ms) and high
rates of failure. Revert this patch and use alternate allocation
strategies to get better utilization.
Change-Id: I917e137d5fb292c9f8282506f71a799a6451ccfa
CRs-Fixed: 452508
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
CMA allocations rely on being able to migrate pages out
quickly to fulfill the allocations. Most use cases for
movable allocations meet this requirement. File system
allocations may take an unaccpetably long time to
migrate, which creates delays from CMA. Prevent CMA
pages from ending up on the per-cpu lists to avoid
code paths grabbing CMA pages on the fast path. CMA
pages can still be allocated as a fallback under tight
memory pressure.
CRs-Fixed: 452508
Change-Id: I79a28f697275a2a1870caabae53c8ea345b4b47d
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
(from Minchan Kim).
* During watermark check decrease available free pages number by
free CMA pages number if necessary (unmovable allocations cannot
use pages from CMA areas).
CRs-Fixed: 446321
Change-Id: Ibd069b028eb80b70701c1b81cb28a503d8265be0
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lauraa@codeaurora.org: context fixups]
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
This is useful to diagnose the reason for page allocation failure for
cases where there appear to be several free pages.
Example, with this alloc_pages(GFP_ATOMIC) failure:
swapper/0: page allocation failure: order:0, mode:0x0
...
Mem-info:
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 48
CPU 1: hi: 90, btch: 15 usd: 21
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:84 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:4026 slab_reclaimable:75 slab_unreclaimable:484
mapped:0 shmem:0 pagetables:0 bounce:0
Normal free:16104kB min:2296kB low:2868kB high:3444kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:336kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:331776kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:300kB
slab_unreclaimable:1936kB kernel_stack:328kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Before the patch, it's hard (for me, at least) to say why all these free
chunks weren't considered for allocation:
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 3*4096kB = 16128kB
After the patch, it's obvious that the reason is that all of these are
in the MIGRATE_CMA (C) freelist:
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB
(C) 1*1024kB (C) 1*2048kB (C) 3*4096kB (C) = 16128kB
Change-Id: Ic5fe77d762e0c03715bfb917774e7c4f03ac43f5
Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
It has been observed that system tends to keep a lot of CMA free pages
even in very high memory pressure use cases. The CMA fallback for
movable pages is used very rarely, only when system is completely
pruned from MOVABLE pages. This means that the out-of-memory is
triggered for unmovable allocations even when there are many CMA pages
available. This problem was not observed previously since movable
pages were used as a fallback for unmovable allocations.
To avoid such situation this commit changes the allocation order so
that on movable allocations the MIGRATE_CMA pageblocks are used first.
This change means that the MIGRATE_CMA can be removed from fallback
path of the MIGRATE_MOVABLE type. This means that the
__rmqueue_fallback() function will never deal with CMA pages and thus
all the checks around MIGRATE_CMA can be removed from that function.
Change-Id: Ie13312d62a6af12d7aa78b4283ed25535a6d49fd
CRs-Fixed: 435287
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
In certain memory configurations there can be a large number of
CMA pages which are not suitable to satisfy certain memory
requests.
This large number of unsuitable pages can cause the
lowmemorykiller to not kill any tasks because the
lowmemorykiller counts all free pages.
In order to ensure the lowmemorykiller properly evaluates the
free memory only count the free CMA pages if they are suitable
for satisfying the memory request.
Change-Id: I7f06d53e2d8cfe7439e5561fe6e5209ce73b1c90
CRs-fixed: 437016
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Both patches needed, second patch (among other things) fixes
a bug in the first.
commit 2139cbe627b8910ded55148f87ee10f7485408ed
Author: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Date: Mon Oct 8 16:32:00 2012 -0700
cma: fix counting of isolated pages
Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix
it by properly decreasing/increasing NR_FREE_PAGES counter in
set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
adjustment for isolated pages from free_one_page() and split_free_page().
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lbassel@codeaurora.org: backport from 3.7, small changes needed]
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
commit d1ce749a0db12202b711d1aba1d29e823034648d
Author: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Date: Mon Oct 8 16:32:02 2012 -0700
cma: count free CMA pages
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).
[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lbassel@codeaurora.org: backport from 3.7, small changes needed]
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
Change-Id: I7d4f5fe0b6931192706337e0b730f43e7cccd031
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
Signed-off-by: Neha Pandey <nehap@codeaurora.org>
The current calculation in pfn_to_bitidx assumes that
(pfn - zone->zone_start_pfn) >> pageblock_order will return the
same bit for all pfn in a pageblock. If zone_start_pfn is not
aligned to pageblock_nr_pages, this may not always be correct.
Consider the following with pageblock order = 10, zone start 2MB:
pfn | pfn - zone start | (pfn - zone start) >> page block order
----------------------------------------------------------------
0x26000 | 0x25e00 | 0x97
0x26100 | 0x25f00 | 0x97
0x26200 | 0x26000 | 0x98
0x26300 | 0x26100 | 0x98
This means that calling {get,set}_pageblock_migratetype on a single
page will not set the migratetype for the full block. Fix this by
rounding down zone_start_pfn when doing the bitidx calculation.
Change-Id: I13e2f53f50db294f38ec86138c17c6fe29f0ee82
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.
Change-Id: I70cc0864c57dd49e89f57797122a3fd0f300647a
Signed-off-by: woojoong.lee <woojoong.lee@samsung.com>
Reviewed-on: http://165.213.202.130:8080/43063
Tested-by: System S/W SCM <scm.systemsw@samsung.com>
Reviewed-by: daeho jeong <daeho.jeong@samsung.com>
Reviewed-by: Jeong-Ho Kim <jammer@samsung.com>
Tested-by: Jeong-Ho Kim <jammer@samsung.com>
[lauraa@codeaurora.org: Applied to correct file]
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Bring back the is_cma_pageblock definition for determining if a
page is CMA or not.
Change-Id: I39fd546e22e240b752244832c79514f109c8e84b
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Memory watermarks were sometimes preventing CMA allocations
in low memory.
Change-Id: I550ec987cbd6bc6dadd72b4a764df20cd0758479
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
The filesystem layer expects pages in the block device's mapping to not
be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
currently replace lowmem pages with highmem pages, leading to crashes in
filesystem code such as the one below:
Unable to handle kernel NULL pointer dereference at virtual address 00000400
pgd = c0c98000
[00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
Internal error: Oops: 817 [#1] PREEMPT SMP ARM
CPU: 0 Not tainted (3.5.0-rc5+ #80)
PC is at __memzero+0x24/0x80
...
Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
Backtrace:
[<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
[<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
r4:c15337f0
[<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
[<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
[<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
r6:beccdcf0 r5:00074000 r4:beccdbbc
[<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Fix this by replacing only highmem pages with highmem.
Change-Id: I6af2d509af48b5a586037be14bd3593b3f269d95
Reported-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.
Change-Id: I2d68d9ac2cfcd32ca6f515fc7e44e8d9d850dff1
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.
Change-Id: I31786592a8cc03e579ee01d99d7eba76e926263f
[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.
This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).
It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.
To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.
Change-Id: I2bb0954de8be4f212b03dea0e5a508048684bda2
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
which was missing from it. It also, changes the array traversal logic
a little making MIGRATE_RESERVE an end marker. The letter change,
removes the implicit MIGRATE_UNMOVABLE from the end of each row which
was read by __rmqueue_fallback() function.
Change-Id: Icdbbebb9eece2468c0b963964be9a4c579cbc775
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
This commit adds the alloc_contig_range() function which tries
to allocate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.
Change-Id: I659b133b1c9991568bfb6bd09c7792e15f2a2bfb
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Fix compiler warning for a variable not initialized.
Change-Id: Ieedeb1cfb5a22eb5f671e6bfd1361315347a49af
Signed-off-by: Shashank Mittal <mittals@codeaurora.org>
Fix NR_IPI to be 7 instead of 6 because both googly and core add
an IPI.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Conflicts:
arch/arm/Kconfig
arch/arm/common/Makefile
arch/arm/include/asm/hardware/cache-l2x0.h
arch/arm/mm/cache-l2x0.c
arch/arm/mm/mmu.c
include/linux/wakelock.h
kernel/power/Kconfig
kernel/power/Makefile
kernel/power/main.c
kernel/power/power.h
If FIX_MOVABLE_ZONE is enabled, we want a specific size and
location of ZONE_MOVABLE.
Change-Id: I0b858c7310cd328e1118abc9d5fe6f364bb4ffad
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
(cherry picked from commit e6436864f939e77226c8e971610539649ed5f869)
Vmalloc will exit if the amount it needs to allocate is
greater than totalram_pages. Vmalloc cannot allocate
from the movable zone, so pages in the movable zone should
not be counted.
This change adds a new global variable: total_unmovable_pages.
It is calculated in init.c, based on totalram_pages minus
the pages in the movable zone. Vmalloc now looks at this new
global instead of totalram_pages.
total_unmovable_pages can be modified during memory_hotplug.
If the zone you are offlining/onlining is unmovable, then
you modify it similar to totalram_pages. If the zone is
movable, then no change is needed.
Change-Id: Ie55c41051e9ad4b921eb04ecbb4798a8bd2344d6
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
(cherry picked from commit 59f9f1c9ae463a3d4499cd9353619f8b1993371b)
Conflicts:
arch/arm/mm/init.c
mm/memory_hotplug.c
mm/page_alloc.c
mm/vmalloc.c
z->lowmem_reserve[classzone_idx] is an unsigned long but
free_pages and min are longs. If free_pages is
negative, the function will incorrectly return true
because it will treat the negative long as a large,
positive unsigned long.
This change casts z->lowmem_reserve to a long and
fixes a typo in the comment.
Change-Id: Icada1fa5ca650fbcdb0656f637adbb98f223eec5
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
(cherry picked from commit 9f41da81017657a194a4e145bab337f13a4d7fd9)
Why is there less MemFree than there used to be? It perturbed a test,
so I've just been bisecting linux-next, and now find the offender went
upstream yesterday.
Commit 93278814d3 "mm: fix division by 0 in percpu_pagelist_fraction()"
mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
us expect it to be left unset at 0 (and it's not then used as a divisor).
MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB
Repetitive test with percpu_pagelist_fraction 8:
MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB
Same test with percpu_pagelist_fraction back to 0:
MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB
Signed-off-by: Hugh Dickins <hughd@google.com>
[ We really should fix the crazy sysctl interface too, but that's a
separate thing - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
a possible error from proc_dointvec_minmax().
If any other error is returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point. For
example, writing 0 bytes into the proc file would trigger the issue.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By default the kernel tries to keep half as much memory free at each
order as it does for one order below. This can be too agressive when
running without swap.
Change-Id: I5efc1a0b50f41ff3ac71e92d2efd175dedd54ead
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.
This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).
This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.
Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.
In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:
$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0
$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts
$ hackbench 400
[OOM messages snipped]
$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642
$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts
Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:
- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,
- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or
- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.
This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer chooses not to kill a thread if:
- an eligible thread has already been oom killed and has yet to exit,
and
- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.
SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.
For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.
The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA. This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following NULL ptr dereference caused by
cat /sys/devices/system/memory/memory0/removable
Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b
We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000
The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.
When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.
We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.
Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().
Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore. The higher-order loop will work just fine for
zero order pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 88f5acf88a ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.
Reported-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000
fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000
btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000
ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>