Nintendo 64 Part 11: Scheduling RCP Tasks
Better Performance
Conceptually, the main loop looks like this:
for (;;) {
update();
render();
swap_buffer();
wait_for_retrace();
}
However, writing out a loop this way is wasteful, since different processors in the system will spend a significant amount of time waiting for other parts of the system to complete tasks. Here is how drawing frames look using my simple code, right now:
This is wasteful. The CPU spends time waiting for the RSP and RDP to finish the graphics task and then waits for the retrace notification. The RSP and RDP spend time waiting for the CPU to issue more instructions.
What I want is to have multiple frames in flight at any given point, using both the CPU and RCP simultaneously:
This is recommended by the Nintendo 64 development manual, and it’s also how modern graphics cards work, more or less.
LibUltra comes with a way of doing this called the scheduler, but I’m not entirely happy with the scheduler API and how it structures events. Instead, I’m going to write my own scheduler. It turns out that this only takes about 130 lines of code, not counting the header file, and it is fairly easy to understand.
Scheduler
My scheduler accepts tasks over a queue.
The task consists of two parts: an RCP task (OSTask
structure)
and a framebuffer.
These are processed in sequence.
First, the scheduler runs the RCP task, then the scheduler swaps the
framebuffer pointer.
Both the RCP task and the framebuffer come with a message to send when they
finish.
// A framebuffer to be displayed on screen.
struct scheduler_framebuffer {
// Switch to this framebuffer.
void *ptr;
// Send this message to this queue after the framebuffer is no
// longer on screen, after another framebuffer has replaced it.
OSMesgQueue *done_queue;
OSMesg done_mesg;
};
// A task to be scheduled on the RCP.
struct scheduler_task {
// RCP task to run.
OSTask task;
// Send this message to this queue when the RCP task is done.
OSMesgQueue *done_queue;
OSMesg done_mesg;
// The framebuffer to be displayed on screen after the RCP task is
// done.
struct scheduler_framebuffer framebuffer;
};
The scheduler thread recieves messages from two queues. One queue is the event queue, which delivers events from the OS, like notifacitions that the RDP task is done or that the framebuffer has been swapped. The other queue is the task queue which recieves task pointers from the main thread.
The main scheduler loop looks like this:
struct scheduler *sc = ...;
for (;;) {
// Process the next event from the event queue.
OSMesg mesg;
osRecvMesg(&sc->evt_queue, &mesg, OS_MESG_BLOCK);
switch ((uintptr_t)mesg) {
case EVT_TASK:
// No-op, just wakes up the scheduler.
break;
case EVT_RDP:
...
case EVT_VSYNC:
...
}
// Read all tasks from the task queue.
read_tasks(sc);
// If we are idle and there is a task pending, run it.
if (sc->task_running == NULL && sc->pending_count > 0) {
struct scheduler_task *task = dequeue_task(sc);
osSpTaskLoad(&task->task);
osSpTaskStartGo(&task->task);
sc->task_running = task;
}
}
Note that rather than putting pointers to real objects in the event queue, each event is just a number which is cast to a pointer.
When the task finishes, the OS will send an EVT_RDP
to the
event queue:
case EVT_RDP: {
struct scheduler_task *restrict task = sc->task_running;
sc->task_running = NULL;
if (task == NULL) {
fatal_error("NULL task");
}
if (task->framebuffer.ptr != NULL) {
unsigned pending = sc->framebuffers_pending;
if (pending >= 2) {
fatal_error("Framebuffer overflow");
}
if (pending == 0) {
osViSwapBuffer(task->framebuffer.ptr);
// No previous frame, so the screen is black, and we need to
// disable that.
if (sc->framebuffers[0].ptr == NULL) {
osViBlack(false);
}
}
pending++;
sc->framebuffers[pending] = task->framebuffer;
sc->framebuffers_pending = pending;
}
if (task->done_queue != NULL) {
int r = osSendMesg(task->done_queue, task->done_mesg,
OS_MESG_NOBLOCK);
if (r != 0) {
fatal_error("Dropped RCP task done message");
}
}
} break;
When the RSP task finishes, it is possible that the previous
frame’s framebuffer is not on screen yet.
The code must either drop one of the frames (which means we rendered a frame
that won’t get shown to the user) or use a queue (which means some additional
latency).
I chose to hold the framebuffer in a queue, so the code only calls
osViSwapBuffer
if there are no other pending framebuffers.
The scheduler should only block waiting for OS events, so it’s a fatal error if sending an event back to the main thread. The main thread must create a queue large enough to hold all possible events in flight.
When the retrace happens, all of the framebuffers move up in the queue.
Note that the queue has three entries.
Entry 0 is the framebuffer currently on-screen, and it’s stored by the
scheduler so the scheduler can notify the main thread when the framebuffer
is no longer in use.
Entry 1 is the next framebuffer the VI will display, which has already
been passed to osViSwapBuffer()
.
Entry 2 is an additional queued buffer.
Here is the scheduler’s vertical retrace handler:
case EVT_VSYNC: {
unsigned pending = sc->framebuffers_pending;
if (pending > 0) {
struct scheduler_framebuffer *restrict fb = sc->framebuffers;
if (fb[0].done_queue != NULL) {
int r = osSendMesg(fb[0].done_queue, fb[0].done_mesg,
OS_MESG_NOBLOCK);
if (r != 0) {
fatal_error("Dropped VSYNC message");
}
}
fb[0] = fb[1];
fb[1] = fb[2];
fb[2] = (struct scheduler_framebuffer){0};
if (pending > 1) {
osViSwapBuffer(fb[1].ptr);
}
sc->framebuffers_pending = pending - 1;
}
} break;
Main Thread
The main thread now has to be modified to use the scheduler and to put two RCP tasks and framebuffers in flight. Here is the main thread state:
// State of the main thread.
struct main_state {
// Whether each task is running.
bool task_running[2];
// Whether each framebuffer is being used.
bool framebuffer_in_use[2];
// Whether a controller read is in progress.
bool controler_read_active;
// The tasks.
struct scheduler_task tasks[2];
// Whether we have a controller connected.
bool has_controller;
// Which controller is connected.
int controller_index;
// Queue receiving scheduler / controller events.
OSMesgQueue evt_queue;
OSMesg evt_buffer[16];
};
The event queue is used both for messages from the scheduler and messages from the SI when it reads controller state. Processing an event just involves updating the main thread’s state and is fairly simple, and it is designed to work in both blocking and non-blocking mode for reasons I’ll explain below:
// Read the next event sent to the main thread and process it.
static int process_event(struct main_state *restrict st,
int flags) {
OSMesg evt;
int ret = osRecvMesg(&st->evt_queue, &evt, flags);
if (ret != 0)
return ret;
int type = event_type(evt);
int value = event_value(evt);
switch (type) {
case EVT_CONTROL:;
OSContPad controller_state[MAXCONTROLLERS];
osContGetReadData(controller_state);
st->controler_read_active = false;
struct game_state *gs = &game_state;
if (st->has_controller) {
gs->controller = controller_state[st->controller_index];
}
break;
case EVT_TASKDONE:
st->task_running[value] = false;
break;
case EVT_FBDONE:
st->framebuffer_in_use[value] = false;
break;
}
return 0;
}
The main loop itself now just alternates between the two sets of buffers and tasks, and processes events until a task and framebuffer are free:
// Main loop.
for (int current_task = 0;; current_task ^= 1) {
// Wait until the task and framebuffer are both free to use.
while (st->task_running[current_task])
process_event(st, true);
while (st->framebuffer_in_use[current_task])
process_event(st, true);
while (process_event(st, false) == 0) {
}
// Read the controllers if we are not already doing so.
if (!st->controler_read_active) {
osContStartQuery(&st->evt_queue);
st->controler_read_active = true;
}
// Update game state, create display lists.
...
struct scheduler_task *task = &st->tasks[current_task];
*task = (struct scheduler_task){
.task = {{ ... /* RSP task */ }},
.done_queue = &st->evt_queue,
.done_mesg = make_event(EVT_TASKDONE, current_task),
.framebuffer =
{
.ptr = framebuffers[current_task],
.done_queue = &st->evt_queue,
.done_mesg = make_event(EVT_FBDONE, current_task),
},
};
scheduler_submit(&scheduler, task);
st->task_running[current_task] = true;
st->framebuffer_in_use[current_task] = true;
}
This code works!
Note that it is fairly trivial to modify this code to use triple-buffering
(Wikipedia: Multiple buffering).
All I have to do is add a third buffer, and keep track of an additional
variable, current_buffer
in addition to
current_task
.
The extra framebuffer would consume extra memory (and the Nintendo 64 does
not have much to begin with), but it would allow the main thread to issue
RCP tasks while two buffers are in use, increasing the framerate.
Printing to the Screen
I’ve been suffering from an inability to get information about the
program’s state to the screen for debugging and testing, so I’ll take the
text drawing code I wrote for fatal_error()
and adapt it
so text can be written on top of the game screen with some variation
of printf()
.
Crashes and Assembler Shenanigans
At this point, my code was working with optimizations disabled but enabling
optimizations resulted in a crash in the scheduler.
It looked like global variables were not zeroed, and I thought this was
caused by a careless buffer overrun somewhere else in the code.
However, when I printed out the value of global variables at the beginning
of main()
and saw that they were already corrupted, I realized
that something else was going on.
I started looking at the disassembly, looking for the values that had
appeared unexpectedly in RAM.
Maybe I could at least figure out where they were coming from.
However, I noticed something odd about the _start
routine:
$ mips32-elf-objdump -d bazel-bin/game/Thornmarked.elf 80000400 <_start>: 80000400: 3c1d8007 lui sp,0x8007 80000404: 27bdebb0 addiu sp,sp,-5200 80000408: 3c048001 lui a0,0x8001 8000040c: 2484fd90 addiu a0,a0,-624 80000410: 3c058007 lui a1,0x8007 80000414: 24a5c7a5 addiu a1,a1,-14427 80000418: 0c00144c jal 80005130 <bzero> 8000041c: 00000000 nop 80000420: 00a42823 subu a1,a1,a0 80000424: 080005b1 j 800016c4 <boot> 80000428: 00000000 nop 8000042c: 00000000 nop
There’s extra nop
instructions added! Why?
Shouldn’t disassembly correspond 1:1 with my original assembly?
Here is the original code, for comparison, from part 5:
# Entry point for the ROM image.
.section .text.entry
.global _start
_start:
# Set up the stack
la $sp, _boot_stack
# Zero BSS.
la $a0, _bss_start
la $a1, _bss_end
jal bzero
subu $a1, $a1, $a0
# Jump to C entry point.
j boot
nop
It turns out that the MIPS assembler reorders instructions to use branch
delay slots, avoid load hazards, and that sort of thing.
I guess it’s a “smart” assembler.
I could turn this feature off with the directive
.set noreorder
, which is what GCC uses, or I could just move
my code out of the delay slots.
I chose to just move my code out of the delay slots.
On one hand, this might
be confusing because any MIPS I write will look different from the MIPS
I see in disassembly, but this is already true for pseudo-instructions
like la
.
On the other hand, the assembler will probably do a better job than me.
Here is the updated start routine:
# Entry point for the ROM image.
.section .text.entry
.global _start
_start:
# Set up the stack
la $sp, _boot_stack
# Zero BSS.
la $a0, _bss_start
la $a1, _bss_end
subu $a1, $a1, $a0
jal bzero
# Jump to C entry point.
j boot
And after checking, I will note that this is not documented
anywhere in the GNU assembler manual!
You can find it in other manuals for other MIPS
assemblers, ask someone, or figure it out from GCC’s output
which uses .set noreorder
.
The other lesson is that I may want to test both with and without optimizations enabled.