Building Builder

9 minute read

59 Days Until I Can Walk

Before I say anything, can we just appreciate for a moment that the little counter until I can walk again is now in the 50s. Cause for celebration, to be sure!

I spent today working through the first problem in David Tolnay’s Procedural Macros Workshop, and have made it a little more than half way through the task.

The first task is to define a Builder macro which may be used to generate a builder struct for any struct which derives from Builder. See the code sample below for an example:

use derive_builder::Builder;

#[derive(Builder)]
pub struct Command {
    executable: String,
    args: Vec<String>,
    env: Vec<String>,
    current_dir: String,
}

fn main() {
    let command = Command::builder()
        .executable("cargo".to_owned())
        .args(vec!["build".to_owned(), "--release".to_owned()])
        .env(vec![])
        .current_dir("..".to_owned())
        .build()
        .unwrap();

    // snip
}

Tolnay provides a series of unit tests of increasing complexity (nine in total) which add extra requirements and features to the Builder macro as the project develops. The initial unit test simply checks if the macro is defined. The fifth checks if it is correctly generating setter functions that can be chained. The last checks if your macro will break when prelude types are redefined.

I want to start this post by talking about some of the things I learned over the course of the day, ranging from small discoveries to large. I will reproduce my solution in full afterwards and will discuss some of the implementation details, design decisions, and flaws as I understand them.

All Learnings Great And Small

Let’s start with the most obvious. You can define a macro by declaring a function which takes a TokenStream as input and returns a TokenStream as output and annotating the function with the proc_macro_derive attribute as shown below:

#[proc_macro_derive(Builder)]
pub fn derive(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
    // snip
}

Now, a question: why do I fully qualify the TokenStream? This is because there are actually two proc_macro crates which are commonly used when writing macros, proc_macro and proc_macro2. My understanding is that proc_macro2 is actually a wrapper around several structs and functions defined in proc_macro. However, proc_macro can only be used in procedural macros. This means that it cannot be used, for example, in build.rs files, or unit tests. proc_macro2 can be used everywhere. By definition, functions annotated with proc_macro_derive must return a proc_macro::TokenStream. But everywhere else in our code can use proc_macro2. So we fully qualify the one struct we need from proc_macro, and then import everything else with proc_macro2.

Useful Crates

Two very popular crates for writing macros (aside from proc_macro2) seem to be syn and quote.

syn provides support for parsing the input TokenStream into a meaningful datastructure which can be traversed and examined. TokenStream is literally a raw stream of tokens extracted from your source code. syn can parse this into a tree structure, allowing you to perform tasks such as checking the type of a variable, identifying the fields of a struct, and more. Rather than trying to work with the raw TokenStream, adding syn to your crate gives you a much more powerful interface for inspecting the inputs to your macro.

quote is the inverse of syn and is used to generate an output TokenStream. The syntax here is extremely simple. You can basically write normal Rust code, but inject variables into your code by marking them with the # symbol. For example:

use proc_macro2::{ Span, Ident };
use quote::{ quote };
use syn::{ parse_macro_input, DeriveInput };

#[proc_macro_derive(Builder)]
pub fn derive(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
    let input = parse_macro_input!(input as DeriveInput);

    let ident = input.ident;
    let builder_ident = Ident::new(&format!("{}Builder", ident), Span::call_site());

    let expanded = quote! {
        pub struct #builder_ident {}
    };

    proc_macro::TokenStream::from(expanded)
}

Given the Command struct used as an example earlier, the code above within the quote! macro will expand to:

pub struct CommandBuilder {}

So quote makes it very easy to generate TokenStreams using Rust-like syntax as input.

Confusing Syntax

One feature of generating TokenStreams using quote! which has absolutely broken my brain is the syntax for iterating over collections. Expanding a little more on the previous example, let’s assume that we also want to iterate over all the fields for the input struct and add them as Optional fields in the builder. We could do that using the code below:

use proc_macro2::{ Span, Ident };
use quote::{ quote };
use syn::{ parse_macro_input, Data, DeriveInput, Type };

#[proc_macro_derive(Builder)]
pub fn derive(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
    let input = parse_macro_input!(input as DeriveInput);

    let ident = input.ident;
    let builder_ident = Ident::new(&format!("{}Builder", ident), Span::call_site());

    let fields = match input.data {
        Data::Struct(s) => s.fields,
        Data::Enum(_) => panic!("Builder not supported on enum type"),
        Data::Union(_) => panic!("Builder not supported on union type"),
    };

    let field_name = fields.iter().map(|field| &field.ident);
    let field_type  = fields.iter().map(|field| &field.ty);
    
    let expanded = quote! {
        pub struct #builder_ident {
            #(#field_name: Option<#field_type>,)*
        }
    };

    proc_macro::TokenStream::from(expanded)
}

Admittedly this looks like a large expansion, but the line I want to draw attention to is:

    #(#field_name: Option<#field_type>,)*

This is what iteration looks like in quote! We have defined two collections, field_name and field_type. Iterating over these collections in the macro uses the #()* syntax. In Rust macro terminology, these are called “repetitions”. Quoting from the README for quote:

Repetition is done using #(…)* or #(…),* similar to macro_rules!. This iterates through the elements of any variable interpolated within the repetition and inserts a copy of the repetition body for each one. The variables in an interpolation may be anything that implements IntoIterator, including Vec or a pre-existing iterator.

+#(#var)* – no separators

+#(#var),* – the character before the asterisk is used as a separator

+#( struct #var; )* – the repetition can contain other things

+#( #k => println!("{}", #v), )* – even multiple interpolations

Note that there is a difference between #(#var ,)* and #(#var),*—the latter does not produce a trailing comma. This matches the behavior of delimiters in macro_rules!.

This took some getting used to, and I would particularly like to dive into how multiple interpolations work. What happens, for example, if one collection is longer than the other?

stringify! Exists

This might seem like a small thing, but let’s I want to generate a string in a macro which contains the name of a field e.g. rather than have an error message say "All fields must be initialized", I can specifically say "args must be set". stringify! allows me to pass in some variable data and a set of tokens. These will be wrapped in quotes and returned as a string literal.

let field_name = input.ident;

quote! {
    println!(stringify!(#field_name must be set))
};

Solving The Builder Problem

Ok, let’s have a look at what I actually did to solve the first of David’s problems. My full solution for the first five unit tests is printed below.

use proc_macro2::{ Span, Ident };
use quote::{ quote };
use syn::{ parse_macro_input, Data, DeriveInput, Type };

#[proc_macro_derive(Builder)]
pub fn derive(input: proc_macro::TokenStream) -> proc_macro::TokenStream {
    let input = parse_macro_input!(input as DeriveInput);

    let ident = input.ident;
    let builder_ident = Ident::new(&format!("{}Builder", ident), Span::call_site());

    let fields = match input.data {
        Data::Struct(s) => s.fields,
        Data::Enum(_) => panic!("Builder not supported on enum type"),
        Data::Union(_) => panic!("Builder not supported on union type"),
    };

    let field_name: Vec<Ident> = fields.iter().map(|field| field.ident.clone().unwrap()).collect();
    let field_type: Vec<&Type>  = fields.iter().map(|field| &field.ty).collect();
    
    let expanded = quote! {
        pub struct #builder_ident {
            #(#field_name: Option<#field_type>,)*
        }

        impl #builder_ident {
            #(
                fn #field_name(&mut self, #field_name: #field_type) -> &mut Self {
                    self.#field_name = Some(#field_name);
                    self
                }
            )*

            pub fn build(&mut self) -> Result<Command, Box<dyn std::error::Error>> {
                Ok(#ident {
                    #(
                        #field_name: self.#field_name
                            .clone()
                            .ok_or(stringify!(#field_name must be set))?,
                    )*
                })
            }
        }

        impl #ident {
            pub fn builder() -> #builder_ident {
                #builder_ident {
                    #(#field_name: None,)*
                }
            }
        }
    };

    proc_macro::TokenStream::from(expanded)
}

Given what I have explained above, I think it should be reasonably clear what has been done here. The challenges completed in order are:

  1. Define the Builder macro
  2. Define the builder() function on the input struct that allows us to initialize a builder for that struct
  3. Generate setters for the builder which match the fields of the input struct
  4. Call a build function and return a populated instance of the struct. There should be some error handling to ensure all fields have been initialized
  5. Demonstrate that chaining works in the setters

Solving the first problem is extremely easy. We simply create a function derive which takes a TokenStream as input, returns a TokenStream as output, and is annotated with proc_macro_derive.

#[proc_macro_derive(Builder)]
pub fn derive(input: proc_macro::TokenStream) -> proc_macro::TokenStream {

The second challenge requires us to generate an identifier for the builder based on the name of the input struct. We parse the input TokenStream into a DeriveInput struct and extract the identifier of the struct. We then generate a new identifier, appending the word “Builder” to the end of the input struct’s identifier. We need to define the Builder struct itself, and all of its fields and corresponding types. We therefore extract all the fields and their types from the input struct and wil use these to generate code that will define and initialize the builder. We do a little bit of error handling first to ensure that we have been passed an object for which we can create a builder. This is predicated on whether or not the input is a struct.

let fields = match input.data {
    Data::Struct(s) => s.fields,
    Data::Enum(_) => panic!("Builder not supported on enum type"),
    Data::Union(_) => panic!("Builder not supported on union type"),
};

let field_name: Vec<Ident> = fields.iter().map(|field| field.ident.clone().unwrap()).collect();
let field_type: Vec<&Type>  = fields.iter().map(|field| &field.ty).collect();

let expanded = quote! {
    pub struct #builder_ident {
        #(#field_name: Option<#field_type>,)*
    }
    // snip

    impl #ident {
        pub fn builder() -> #builder_ident {
            #builder_ident {
                #(#field_name: None,)*
            }
        }
    }
};

Note in the above call to quote!, #ident will expand to the name of the input struct (e.g. Command) and #builder_ident will expand to the name of the builder (e.g. CommandBuilder). Here we can also see how iteration works, looping through the field_name and field_type collections both to define the builder struct and populate it in the builder function.

Generating setters for the builder is a relatively straightforward iteration on defining the builder itself. Simply iterate over all fields and their types and generate a setter that returns a reference to the builder so that we can do chaining:

impl #builder_ident {
    #(
        fn #field_name(&mut self, #field_name: #field_type) -> &mut Self {
            self.#field_name = Some(#field_name);
            self
        }
    )*
}

Note that the fields on the builder struct are all of type Optional, hence the call to Some.

Because of the requirement for error handling, implementing build is a little more challenging, but not by much. We simply unwrap the Optional fields on the builder and raise an appropriate error if None is returned. Note the use of stringify! so that we can specifically indicate which field threw the error in our error messages.

impl #builder_ident {
    // snip

    pub fn build(&mut self) -> Result<Command, Box<dyn std::error::Error>> {
        Ok(#ident {
            #(
                #field_name: self.#field_name
                    .clone()
                    .ok_or(stringify!(#field_name must be set))?,
            )*
        })
    }
}

Because of how I have written my code, I get the final unit test for free. All my setters return a reference to self, and so this test will pass.

Conclusion

Work on the engine has paused for a little bit while I try to get to grips with macros. I’ll be continuing this exercise tomorrow as it has been incredibly useful. I am still hoping to have finished The Great Refactor by the end of the week, but the more I dig into this, the more getting properly to grips with macros seems like a priority. So there will be more updates on this tomorrow.