Code Slow: Writing a 4K intro in Rust

Some weeks ago I started looking at using Rust for making a 64K intro. I started out by making a minimal window app that takes up only 3.5Kbytes without any compression. Some of the feedback I got encouraged me to have a look at Crinkler for compressing the executable. Given that crinkler is really targeted at 4K intros I decided to try creating a minimal modern OpenGL app that can be squeezed into 4K or less.

Rust OpenGL in 2 Kbytes

Debugging

Before working on new features I had to address debugging.

My minimal application was a nightmare to debug. If something went badly wrong it would just freeze without any indication of what happened. I couldn't print out any debug messages to console because the std library was not included. The only method of debugging I had was to pop up windows MessageBoxes at certain points but even that was limited because I could not do any string formatting.

Most crucially, I could not step through the code because the optimizations meant that the debugger could not tell what the variable values were.

I thought I could just create a new a new debug profile in the toml file which would not use any optimizations. This way both versions would have the same capbilities but I could easily step through the non-optimized dev build. I added this profile to the toml.

[profile.dev]
lto = true 
opt-level = 0
panic = "abort"

This did not work. The linker complained about unresolved symbols to memset and memcpy. It was surprising to find out that the optimization level changed the set of of required external functions. My suspicion is that the higher optimization just happened to make the memory management functions unnecessary for my particular program.

Initially I started writing my versions of memset and memcpy but then I realized that I could just copy the std versions into my program. (Not something I would normally recommend).

Once I had created the replacements for memset and memcpy I was able to build a dev version that I could now properly debug and step through the code.

Setting up OpenGL

There are several parts to getting an application ready to use OpenGL:

Creating an OpenGL context.
Setup all OpenGL functions.
Setting up all the data required to draw something.

There are several good resources that go through the above steps. The Ironic blog has a an excellent set of tutorials for setting up OpenGL with Rust. Additionally I looked at the learnopenglarticles for more in-depth explanations on OpenGL. Finally, the official OpenGL website has a lot of information but it is not the best place to for easy-to-understand introductory explanations.

Setting up the context

Setting up the GL context is pretty straight forward and just uses functions and enums imported by winapi. The context setup effectively boils down to the following bit of code ( with all error checking removed )

        let mut PFD : PIXELFORMATDESCRIPTOR = core::mem::zeroed();
        PFD.nSize = core::mem::size_of::<PIXELFORMATDESCRIPTOR>() as u16;
        PFD.nVersion = 1;
        PFD.dwFlags = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
        PFD.iPixelType = PFD_TYPE_RGBA;
        PFD.cColorBits = 32;
        fakePFD.cAlphaBits = 8;
        fakePFD.cDepthBits = 24;
         
        let pFDID : i32 = ChoosePixelFormat(hDC, &fakePFD );
        SetPixelFormat(hDC, fakePFDID, &fakePFD);
        let fakeRC : HGLRC = wglCreateContext(hDC);
        wglMakeCurrent(hDC, fakeRC);

Importing GL functions

None of the modern OpenGL functions are present in winapi but must loaded into memory before they can be used. This has nothing to do with winapi but is just how OpenGL functionality is accessed. The crate gl has the functionality for loading these functions into memory and providing bindings to all of the functions. Unfortunately the generated bindings require std to function so I need to pull out the required functionality manually. Also, the bindings generators creates bindings for functions that I wont be using thus bloating the executable size.

My OpenGL function loader is very much based on jonil's project on github.

The GL function loader needs to locate the address for the OpenGL function and provide a safe wrapper for each function. The init function below is the heart of the function loader. It goes through the LOAD_DESC array and loads the function names listed in it and stores them at the corresponding indices in the GL_API array.

The type of the GL_API array where all the function addresses are stored is usize which is guaranteed to be large enough to store a function pointer. There is no general function pointer type in Rust which is why they are stored in an array of usize.

static mut GL_API: [usize; 696] = [0; 696]; // large enough to store all function pointers  
//..
const CreateProgramIdx: u16 = 96;
//..
static LOAD_DESC: &'static [(u16, &'static str)] = &[
    (CreateProgramIdx, "glCreateProgram\0"),
    //...
];

pub fn init() {
    let handle : HMODULE;
    unsafe { handle = LoadLibraryA( "Opengl32.dll\0".as_ptr() as *const i8);  }
    for &(index, name) in LOAD_DESC {
        unsafe {
            let mut prc = wglGetProcAddress(name.as_ptr() as *const i8) as usize;
            if prc == 0 {
                prc = GetProcAddress( handle, name.as_ptr() as *const i8 ) as usize;
            }
            GL_API[ index as usize] =  prc;
        }
    }
}

The wrapper functions do the address look-up and call into OpenGL. Each wrapper function wraps one of the underlying actual GL functions and is responsible for converting the usize value into the actual function pointer type and then calling it with the given arguments. Each wrapper function looks like the following;

pub unsafe fn GenBuffers(n: GLsizei, buffers: *mut GLuint) -> () {
    mem::transmute::<_, extern "system" fn(GLsizei, *mut GLuint) -> ()>(*GL_API.get_unchecked(GenBuffersIdx as usize))(n, buffers)
}

The wrapper function GenBuffers wraps the OpenGL function of the same name. The mem::transmute call converts the usize into an external function pointer without doing any checks. This is clearly an unsafe function ( as is calling the external function ).

This is enough to get OpenGL functions loaded but nothing is being shown on the screen yet. For that we need to set up vertex and fragment shaders.

Setting up Shaders

Modern OpenGL requires that at least one fragment shader and one vertex shader is setup before anything can be rendered. The process of setting up the shaders is identical regardless of what type of shader it is so I created the helper function shader_from_source that creates both kinds of shader objects.

pub fn shader_from_source( shader_source : &str, kind: gl::GLenum, error_dest : &mut [i8] ) -> Option<gl::GLuint> {
    let id;
    let mut success: gl::GLint = 1;
    unsafe {
        id = gl::CreateShader(kind);
        gl::ShaderSource(id, 1, &shader_source.as_ptr(), 0 as *const _);
        gl::CompileShader(id);
        gl::GetShaderiv(id, gl::COMPILE_STATUS, &mut success);
    }
 
    if success == 0 {
        unsafe{ gl::GetShaderInfoLog( id, error_dest.len() as i32,  0 as *mut _, error_dest.as_mut_ptr() as *mut gl::GLchar ); }
        return None;
    }
    return Some( id );
}

I have tried to keep it as Rust like as possible despite not having std available. The one odd thing about the code above is that the error text is stored into a buffer given to the function as an argument rather than allocated by the function itself ( There is no std that would bring in Stringand there is not even any heap allocator ). I considered using a local static buffer for errors but that would have broken too many ideas about Rust memory safety for comfort.

The shaders themselves are pretty simple. The vertex position is passed through to the fragment shader which in turn does some colorful calculation based on the screen location and time. To vary its output over time the fragment shader does need to be told the time through the iTime uniform. ( The full code is on github )

    let vtx_shader_src : &'static str = "#version 330 core
    layout (location = 0) in vec3 Position; 
    void main()
    {
     gl_Position = vec4(Position, 1.0);
    }\0";

    let frag_shader_src : &'static str = "#version 330 core
    in vec4 gl_FragCoord;
    out vec4 Color;
    uniform float iTime;
    void main()
    {
        // Do interesting stuff
    }\0";

Passing these shaders into the shader_from_source function creates the required shaders.

    let vtx_shader = match gl_util::shader_from_source( vtx_shader_src, gl::VERTEX_SHADER, &mut error_message ) {
        Some( shader ) => shader,
        None => { show_error( error_message.as_ptr()  ); 0 }
    };

    let frag_shader  = match gl_util::shader_from_source( frag_shader_src, gl::FRAGMENT_SHADER,  &mut error_message ) {
        Some( shader ) => shader,
        None => { show_error( error_message.as_ptr() ); 0 }
    };

These shaders are finally combined into a shader program that describes the entire graphics pipeline from vertices into screen pixels. program_from_shaders is a utility function that puts all the given shaders into the new shader program.

    let shader_prog = match gl_util::program_from_shaders(vtx_shader, frag_shader, &mut error_message ) {
        Some( prog ) => prog,
        None => { show_error( error_message.as_ptr() ); 0 }
    };

Setting up vertices

Shaders need vertices to operate on. Modern OpenGL requires a buffer object to hold the actual vertex data.

    gl::GenBuffers(1, &mut vertex_buffer_id);

The buffer holding the vertex data is just some data in memory but OpenGL also needs to know how the data is laid out so it can map it to the vertex shader. To avoid having to fully define the vertex layout every time the selected vertices are changed OpenGL uses vertex arrays. These combine the vertex buffer and their configuration into a single object that be selected by one bind call.

        gl::GenVertexArrays(1, &mut vertex_array_id );
        gl::BindVertexArray(vertex_array_id);

After the gl::BindVertexArraycall all vertex configuration is stored on the bound vertex array.

The vertex data is loaded using gl::BufferData which loads the given bloack of memory into the bound vertex array. The code loads 4 vertices, each holding 3 floats.

    gl::BindBuffer(gl::ARRAY_BUFFER, vertex_buffer_id);
    gl::BufferData( gl::ARRAY_BUFFER, size_of::<gl::GLfloat>() as isize * 3 * 4, vtx_coords.as_ptr() as *const gl::CVoid, gl::STATIC_DRAW);

The vertices are configured using a series of gl::EnableVertexAttribArray and gl::VertexAttribPointer calls to first enable attribute location and then define it. The attribute location has to match the layout location used in the vertex shader. The vertex shader only uses location 0 for the Position input.

    gl::EnableVertexAttribArray(0);     // enable location 0
    gl::VertexAttribPointer( 0, 3, gl::FLOAT, gl::FALSE, 3 * size_of::<f32>() as gl::GLint, 0 as *const CVoid );    
;

Now we are finally ready to draw the scene in the main loop

    gl::UseProgram(shader_prog);
    gl::BindVertexArray(vertex_array_id);
    gl::DrawArrays( gl::TRIANGLE_STRIP, 0, 4 );

Passing uniform data

The fragment shader uses the uniform iTime for its animation. Uniforms are effectively global variables that all stages of the shader can read and every fragment/vertex sees the same values. They are ideal for passing things global parameters like transformation matricies and time.

The following code finds the openl identifier to the uniform iTimein the program and sets its value from time.

    let time_loc : i32 = gl::GetUniformLocation(shader_prog, "iTime\0".as_ptr());
    gl::Uniform1f(time_loc, time );

Syncing the refresh

By default OpenGL will render frames as fast as it can. This means that the application will max out on the GPU and do a lot unnecessary work because many of the rendered frames will never be shown to the user if the frames are calculated faster than display refreshes.

OpenGL on Windows has an extension function wglSwapIntervalEXT that can be used for setting the minimum number of display refreshes between displaying frames. If it is set to one OpenGL will show a new frame for every display refresh if a new frame is available.

This means that the program will block on the gl::SwapBuffers call if a frame is already waiting to be shown. But it also means that GPU utilization on my computer goes from 99% to somewhere around 10% (depending on what is being rendered)

This only works for wWindows but this is not really a concern here given that this already tightly coupled with the windows API.

Using Crinkler

Crinkler performs the real magic in compressing the executable into an incredibly small size. Crinkler is a bit more complicated to use than the standard exe packers such as UPX because it operates on object files, not the final executable. In effect crinkler is a compressing linker and this is one of the reasons why it achieves such impressive compression rates.

It took my awhile to figure out how to coax xargo to output the right kind of object file but the correct command is;

xargo rustc --release --target i686-pc-windows-msvc -- --emit=obj

The cargo rustc docs explain that the rustc command compiles the package and passes in the extra options after the --. In this case the only required extra option is emit=obj which tells the compiler to emit an object files ( as explained in the rustc documentation.)

Once the object file is available crinkler can be run with;

crinkler /OUT:test.exe /SUBSYSTEM:WINDOWS miniwin.o  /ENTRY:mainCRTStartup "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\Lib\10.0.17763.0\um\x86" gdi32.lib user32.lib opengl32.lib kernel32.lib

The command tells crinkler

to create a file called out.exe (/OUT:test.exe)
that the program is a windows program (/SUBSYSTEM:WINDOWS) rather than a console program
to use the object file miniwin.o. The object files are simply listed in the command
the program entry point is mainCRTStartup (/ENTRY:mainCRTStartup)
which path the libraries can be found in
which libraries to link to

This compresses the executable into which is 1911 bytes long into only X bytes. This leaves over 2000 compressed bytes for doing something interesting. ( The fragment shader is not smallest possible and is probably costing a couple hundred bytes )

Wrapping up

All the code is available on github at https://github.com/janiorca/tinywin/tree/master/miniwinGL

There is still some functionality missing (sound and fullscreen mode ) and the invokation of of Crinkler is still very manual but I think this provides a good starting point for anyone thinking about using rust for writing a 4K intro. The code is not particularly aggressively structured so it could could probably be quite a bit smaller. A shader minifier would also help saving some extra bytes.

Code Slow

Saturday, 11 January 2020

Writing a 4K intro in Rust