Memory management proposal
Overview
This is a proposal for improving memory management and I/O in Rockbox, aimed at addressing the current limitations with dynamic memory management and enabling its more widespread use. As a side effect it could also dramatically improve I/O performance for certain workloads on native targets. The proposed changes are somewhat interdependent, but are limited to three main areas: the native file I/O code, audio buffering, and buflib.
File caching on native targets
Rockbox's I/O performance leaves a lot to be desired. There are a number of performance traps for the unwary and common I/O patterns are penalized, mostly to do with small I/O. Requests of less than 512 bytes are guaranteed to have the worst throughput by doing single sector disk accesses, and if the in-memory buffer is not aligned with the
STORAGE_ALIGN
macros, or the file position is not sector aligned, the file code is forced to use bounce buffers and break longer transfers into multiple small ones, resulting in much lower throughput. RAM also has to be wasted on statically allocated bounce buffers.
These performance problems could be solved by caching file data in free RAM like general purpose OSs do, aka. page caching.
mmap
without an MMU
The crucial abstraction required to implement page caching effectively is
mmap
: a way to map a contiguous segment of a file to a contiguous buffer in memory. Most OSs use a hardware MMU for this purpose, so the mapping is only
virtually contiguous, but Rockbox supports platforms that don't have an MMU, so we need to take a different approach that will work with physically contiguous memory. Fortunately we already have buflib as an allocator for physically contiguous chunks of memory, so half of the
mmap
implementation already exists.
The other half of
mmap
is I/O. On systems with a hardware MMU, this is driven by page faults when the CPU does a load or store inside the memory-mapped region. For no-MMU systems we can't trap on memory accesses, so we have to make function calls instead to declare what type of access we'll be doing in the mapping, prior to performing the access.
There are a few other considerations to take into account. Because we're using physical memory and buflib needs to move allocations around to prevent memory fragmentation, the
mmap
API needs to work with handles instead of pointers, and callers need to be prepared to deal with the mapping address moving around when a buflib compaction occurs. In other words, memory mappings are subject to the same constraints as buflib allocations.
One of the design goals is to ensure that POSIX
mmap
API is a subset of the no-MMU
mmap
API, to allow for an efficient implementation on hosted targets. This will allow
mmap
to be used from application code.
Draft API
enum {
MMAP_READ,
MMAP_WRITE,
MMAP_READWRITE,
};
int mmap_create(int fd, int mode, off_t offset, size_t length);
void mmap_destroy(int md, size_t length);
void mmap_read(int md, size_t offset, size_t length);
void mmap_write(int md, size_t offset, size_t length);
void mmap_sync(int md, size_t offset, size_t length);
void* mmap_get(int md);
void mmap_put(int md);
size_t mmap_pagesize(void);
mmap_create
is used to create a new mmap object which reflects the contents of the file descriptor
fd
starting at
offset
from the beginning of the file and extending for
length
bytes. The offset has to be aligned to the system page size (returned by
mmap_pagesize
). The access mode is given by
mode
and must be compatible with the file's open mode. Once the mapping is opened the file descriptor is no longer needed and can be closed. The semantics of the mapping are similar to a POSIX
MAP_SHARED
mapping, ie. if the mapping is writable changes to the mapped buffer will be carried through to the underlying file, but there are a few important caveats to this (see caveats below).
mmap_destroy
destroys an mmap object and frees any associated memory. Any dirty pages are written back to the underlying file before the mapping is destroyed. The length of the mapping, as passed to
mmap_create
, needs to be given. (The length is required for an efficient POSIX implementation using
munmap
.)
mmap_get
and
mmap_put
are used to get access to the mapped buffer. The pointer returned by
mmap_get
is valid and guaranteed not to be moved by buflib, up until a matching call to
mmap_put
. Pairs of get/put calls nest, so it is possible to call
mmap_get
multiple times, and only the outermost
mmap_put
will finally release the pointer and allow it to be moved by buflib again.
Before reading or writing from the mapped buffer,
mmap_read
or
mmap_write
must be used to declare the range of bytes being accessed. The
offset
parameter is specified relative to the beginning of the mapping. Pages overlapping the byte range starting at
offset
and extending for
length
bytes will be populated with the corresponding file contents (if they have not previously been populated) and for
mmap_write
, the pages will also be marked dirty so that modifications to them can be written back to the file.
Writeback of a specific byte range can be triggered by calling
mmap_sync
, and
mmap_destroy
will also write back any dirty pages to the file. A successful writeback will mark pages as clean, so they will not be written back again unless they are dirtied again.
It's possible to perform get/put and read/write/sync calls at any time and from any thread. Multiple threads are allowed to use the same mmap concurrently; the mmap implementation will take care to ensure the results are well-defined, but competing accesses from threads can occur in any order (so they should ensure not to concurrently modify the same data, as with any shared state).
Users of mmap need to take care to declare all reads and writes properly,
before doing the memory access and not step outside the declared bounds. Failing to do this could cause strange bugs, especially if the mapping is being concurrently accessed and busy with I/O in those regions.
Implementation
The basic implementation of a native
mmap
would use a buflib allocation to allocate enough memory to back the mapping, plus some extra data for controlling the map, eg. state bits for tracking whether pages are dirty or valid.
The core of
mmap
would be pretty basic: each read/write/sync request is a state transition on the affected pages and doing any needed I/O. To prevent multiple threads from trying to perform conflicting updates on the same page, they need to lock the affected pages. This can be done by simply locking the range during the update operation, which keeps overhead low but still allows some level of concurrent access.
Because the mmap allocation can move, locks can't be stored inside the allocation. Instead we can use an array of locks shared between all mappings and assign one to each mapping at creation time (a form of hashed locking). This means unrelated maps may contend for the same lock and sometimes block unnecessarily, but since multithreaded access will probably be limited this should be acceptable.
On a native target mmap will have to interact with low-level filesystem code because the mapping has to outlive the file descriptor used to create it. In terms of the current code, this would mean using
struct filestr_base
to do uncached disk I/O.
On hosted targets, mmap will be a thin wrapper over the OS's own mmap functionality (either POSIX or Windows). Most of the API becomes a no-op on a hosted target which will give it very low overhead.
Caveats
Although mappings are intended to have
MAP_SHARED
semantics, this is not efficient on no-MMU systems because we have no way to re-use pages of one mapping in an overlapping mapping. In the case of read-only mappings this is harmless because they will all see the same file contents, though it wastes RAM. For writable mappings, changes made to one mapping will not propagate to other mappings, which may be a problem. Because overlapping mappings are not likely to be useful for Rockbox and not expected to occur, this is a limitation we can live with.
Using mmap
to implement a page cache
The page cache (which would be more accurately called a segment cache in this case, since we are not managing it at the page level) can be implemented by tracking all mmaps of a file.
read()
and
write()
may be implemented by walking the mappings of the open file and copying data between the mapping and the user-supplied buffer, creating new mappings as needed. File I/O would be driven indirectly via the
mmap_read()
and
mmap_write()
calls. This would solve all the current problems associated with small I/O, alignment, etc, in one fell swoop. It would also give us write caching for free, which could speed up write-heavy workloads dramatically (for example: building the database or saving artwork in pictureflow).
Mappings will never be explicitly destroyed; they'll just hang around until buflib needs to reclaim the memory for a new allocation. This ensures that free RAM is always used up, instead of sitting idle. For this to work properly, buflib will need to be smarter about how it compacts and frees allocations to make room for new ones; its current first-fit allocation and greedy compaction/shrinking strategies would lead to excessive thrashing when used to manage cached file segments.
One downside to this approach is that data is duplicated in RAM, once in the page cache and once in the user buffer. If the user buffer is large, the added RAM use could be significant; in these cases it might be better to use
mmap
directly, as this could avoid pressure on the page cache. This will probably only be a concern for "atomic audio" used in the buffering code (ie. audio formats that need to be allocated contiguous in memory, mainly chiptunes where the codec is an emulator that needs to perform random access on the file).
Improved audio buffering
Audio buffering is used to load data that will be needed by playback in advance, to avoid audio dropouts from I/O and optimize disk access on HDD-based targets. This is currently implemented by
buffering.c
. Data is buffered in a monolithic allocation (the audio buffer) which consumes all free RAM; this allocation can't shrink without stopping playback, throwing away all buffered data, and re-buffering. This is massively disruptive to the user experience and makes it effectively impossible to use
core_alloc
from code that may run during playback.
Breaking up this monolithic allocation would allow buflib to allocate without stopping playback. Although a similar effect could be achieved by adding a smarter shrink callback that shifts data around inside the allocation, page caching could replace almost all of the memory management done by
buffering.c
. Making use of the page cache would also be smart on hosted targets that don't have much RAM, although there we wouldn't have as much control over the page cache as on native targets. There would still be a need for a buffering thread to perform background loading, but this would be a much simpler task because a lot of the work will have been shifted to OS-level code.
Types of allocations
There are basically two types of allocations managed by
buffering.c
:
- Atomic allocations, which must be contiguous and which are accessed at random
- Streaming allocations, which are accessed in a file-like manner
Atomic allocations are used for tags, bitmaps, cuesheets, and certain audio formats. Most audio formats use streaming allocations. On native targets, codecs are also loaded using streaming allocations.
Sketch of a replacement system
Atomic allocations can be serviced using buflib allocations or an mmap'd file. An mmap is the best option for native targets when the allocation is meant to contain literal file data, since it avoids pressure on the page cache, while a plain buflib allocation is required to handle tag data or bitmaps (because those do not contain literal file data). Streaming allocations can be served using a normal file descriptor, since the memory never needs to be accessed directly.
Atomic allocations, whether buflib-backed or mmap-backed, are relatively easy to handle because they do not interact meaningfully with the page cache. Usually a buflib alloc needs to be filled in by running some operation in the background, for instance metadata parsing or JPEG decoding. An mmap can be populated by faulting in pages with
mmap_read()
in the background.
Streaming allocations require more care because they should interact closely with the page cache to avoid duplicating data on native targets. It's not clear yet what this interaction will look like. On hosted targets we have less control over the contents of the page cache, so to guarantee latency using a small buffer for the current file may still be necessary.
Improvements to buflib
Buflib is not suitable as an OS memory manager in its current form, since it assumes all allocations are equally important. The proposed improvements to file caching and audio buffering would result in excessive thrashing if used with buflib as-is, since buflib can't tell if the block it's shrinking contains some audio data that is going to be needed in the next few (milli)seconds, a piece of a settings file read at startup that is no longer needed, or a large chunk of dirty file data that needs to be written to disk.
Deciding what to free
Some more thought needs to be put into this, but for now, some general rules of thumb:
- Any data needed for immediate playback should not be freed because it'll just need to be read back in, causing thrashing and possibly audio dropouts.
- Dirty file mappings are very expensive to free because file data must be written to disk before the memory can be freed.
- Mappings which have seen recent I/O activity (say in the past few seconds) should generally not be freed, in case they need to be accessed again.
Using something like LRU eviction should generally work OK, combined with a few hacks to keep the most critical allocations from being freed. Doing periodic writeback of dirty mappings should prevent dirty data from building up to dangerous levels, so they are not likely to be a concern in practice, especially as Rockbox isn't write-heavy.
Replacing the shrink callback
The shrink callback seems to be of questionable value, since there are really only two users. One is in
talk.c
and handles shrinking by just freeing the allocation, and probably would be made redundant by page caching. The other two uses are related to
buffering.c
and would not be needed once the monolithic allocation is gone.
Using a simpler free callback that is called when buflib wants to free the allocation (can be blocked by the callback) would allow removing a few shrink-specific bits from buflib.
Handle table
Buflib currently expands the handle table on demand from the top of memory down, which can cause allocations to fail if there is an immovable allocation at the top of RAM. This generally won't happen with the current code because the buffering thread owns the top-of-RAM allocation, and can always shrink, but once that's gone it might be a bigger problem. One way to avoid issues is by always reserving some free space to grow the handle table and proactively moving stuff away from the end when space runs low.
--
AidanMacDonald - 16 Apr 2022
Copyright © by the contributing authors.