Code Generation and Optimization

Deadlines

Milestone: April 30th, 11:59pm
Final Virtual Machine and Writeup: May 16th @ 11am
Derby: May 17th @ 10am (normal class time)

Description

In this final assignment, your goal is to reduce the execution time of a given application when executed on your virtual machine. This assignment is completely opened-ended; you are not required to implement any specific optimization except for simple code generation (see Milestone section). Your grade will be based on 1) your overall design (with heavy emphasis on the writeup) and 2) how well your virtual machine performs as compared to the submissions of the other groups. Namely, we will hold a Virtual Machine Derby where your virtual machine implementation will compete against the implementations of your fellow classmates on a hidden program (the derby program).

In the final month of class, we will cover various performance improving optimizations. A substantial portion of this project phase is to determine which optimizations will provide a benefit given the programs of the provided benchmark suite and the target architecture.

Some optimizations that we will cover in class include:

Optimized Runtime Implementation Techniques: Our main focus up until now has been to write a correct implementation perhaps even at the expense of the performance of your implementation. There are certainly opportunities to rethink the design of your runtime structures to produce a more efficient representation. For example, Tagged Pointers (https://en.wikipedia.org/wiki/Tagged_pointer) are a very common runtime representation that enables more efficient execution for integer-heavy code at the expense of a more complicated implementation.
Register Allocation: Your virtual machine can implement a register allocator. See Chapter 16 of the Whale book, Chapter 8.8 of the Dragon book, and the Reference Materials section below
Intermediate Representations Design: For evaluation of your final implementation, we will only give you MITScript source language programs. You are therefore free to change the design of the MITScript bytecode language, replace it with an entirely different representation (e.g., a low-level register-based representation), or even remove all intermediate representations if they do not suit your implementation strategy.
Peephole Optimizations: Peephole optimizations replace small instruction sequences with faster counterparts based on local information. For example, replacing multiplications/divisions by 2 with shifts. See Chaper 9.9 of the Dragon Book for additional information.
Function Inlining: Dynamic control flow – such as calling a function – is expensive. Function Inlining is an optimization that inlines a callee into the body of the caller. This optimization is most easily performed when it is possible to statically (at initial code generation time) determine the function to be called. There may be functions within the program for which you can determine this via Abstract Interpretation. However, there are many helper functions that your virtual machine uses for which you know the function to be called and, therefore, some or all of the body of those methods may be good candidates for optimization. See Chapter 15 of the Whale book for more information.
Abstract Interpretation/Dataflow-based Optimizations: When you generate naive code in the intermediate representation, you may notice the occurrence of multiple inefficient code sequences. For example, assigning a temporary value into memory and then immediately reloading the value from memory for the next computation. Or, adding multiple constants even the result of the operation is known statically (e.g., 1 + 1). Or, checking that a value is an integer even though you know it is an integer constant. Abstraction Intepretation/Dataflow gives a principled framework for identifying these opportunities and optimizing the code with optimizations such as constant folding, constant propagation, copy propagation, and dead code elimination. See Chapter 12 and of the Whale book and Chapter 9 of the Dragon Book for additional information.
Specialization: Dynamic languages are flexible in that they give great flexibility to the organization of code around the structure of the types it operates on (“duck typing”). However, this flexibility comes at significant performance cost. For example, your virtual machine interpreter needs to check that the inputs to each arithmetic operation (e.g., subtraction) are in fact integers before performing the operation. In practice, however, a given operation may only operate on objects from a small set of types – or even a single type. In cases where this set of types can be identified (either statically or after profiling), these operations can be specialized to execute more quickly for that set of types. For example, if virtual machine can determine that a given add operation only receives integers as operations, it can generate more efficient assembly code that need not consider the various implementations of add
Advanced Garbage Collection: In assignment 4 you implemented a mark and sweep garbage collector. However, you may instead find a performance benefit from implementing a more complicated collector (e.g., copy collector or generational collector). See these books and papers for additional information:
- Garbage Collection: Algorithms for Dynamic Memory Management by Jones (Book)
- The Garbage Collection Handbook by Jones, Hosking, and Moss (Book)
- Uniprocessor Garbage Collection Techniques by Paul Wilson (Survey Paper)

In order to identify and prioritize optimizations, we will provide you with a benchmark suite of programs. These programs are more complex than the code that has been provided during the previous phases, so your first priority is to produce correct unoptimized assembly code for each benchmark.

After your language implementation produces correct unoptimized code, you should begin to study the applications of the benchmark suite. You will use these programs (and any other programs provided during previous phases) to determine which optimizations you will implement. Specifically, you are expected to analyze the performance of your virtual machine to classify the effectiveness of each proposed optimization.

You are not limited to the optimizations covered in class. You are free to implement any optimization that you come across in your reading or any optimization that you devise. However, your writeup must demonstrate that each optimization is effective and semantics preserving. Furthermore, you must argue that each optimization is general, meaning it is not a special-case optimization that works only for the derby program or a specific application in the benchmark suite.

Milestone

For the milestone, you are expected to deliver 1) a virtual machine that implements code generation and 2) a preliminary writeup of the optimizations you plan to implement for your final virtual machine.

Code Generation: For the milestone, you are not expected to produce great (generated) code. In fact, even horrendous code is acceptable. When considering tradeoffs, always choose simplicity of implementation over performance. You are not constrained as to how you go about generating your assembly However, the general approach presented in lecture and recitation is a reasonable start.

Your generated code must include all runtime checks listed in the MITScript semantics given Assignment 2. These may be optimized (or removed in some cases) as long as the optimized programs reports a runtime error whenever the unoptimized program reports an error. Note that because we will only give you MITScript and because you are free to choose your intermediate representation, the runtime errors from Assignment 3 correspond to your own internal implementation errors (e.g., InsufficientStack) and have no relation to the MITScript source semantics. While you may consider removing these checks, it is good development practice to support runtime checking for such errors in, for example, a debug build.

For the milestone you should focus your creative energies on 1) designing your intermediate representation, 2) familiarizing yourself with our target ISA, 3) designing your machine-code representations of the runtime structures, and 4) generating correct assembly code.

The X86-64 Architecture Guide, provided as a supplement to this handout on the course reference materials page, presents a not-so-gentle introduction to the x86-64 ISA. It presents a subset of the x86-64 architecture that should be sufficient for the purposes of this project. You should read this handout before you start to write the code that traverses your IR and generates x86-64 instructions.

Requirements: For a given function to be executed, you should generate a sequence of assembly instructions that implements its semantics. Your generated code must use the machine code stack and/or registers to compute intermediate results (See Code Generation Lecture #2 slides for an illustration). Specifically, you must use machine code (and machine code only) to load constants (LoadConst), access local variables (LoadLocal, StoreLocal, LoadReference, StoreReference), compute arithmetic operations that are not overloaded by type (Sub, Mul, Div, Neg, Gt, Geq, And, Or, Not), and implement control flow (if and while). Exceptions to these requirements are as follows:

Exceptional Behavior: you are permitted to use helper functions implemented in C++ to aid with checking runtime errors. This includes calling such functions when, for example, certain types are required for an otherwise simple arithmetic operation (e.g., consider the implementation of Sub from lecture).
Globals and Records: You are permitted to use helper functions implemented in C++ to aid with accessing globals and records.
Allocation: You are permitted to use helper functions to allocate values.
Function calls: You are permitted to use helper functions implemented in C++ to aid with dispatching function calls.
Add and Equality: these instructions have a more complicated semantics for which it is acceptable to use helper functions implemented in C++ to support your implementation (including providing its entire implementation).

Note that while we are using MITScript bytecode instructions to refer to these operations, you need not necessarily translate MITScript source to bytecode to perform code generation. You may find, however, that it simplifies the task of naive code generation for the Milestone.

Writeup: As you are implementing code generation in your virtual machine, you will also need to begin to assess where you may be able to achieve speedup in your virtual machine implementation. For your writeup, you need to deliver a preliminary assessment of the optimizations that you plan to implement. See the following section on the final write up to understand how to proceed.

Writeup

The writeup for this project is extremely important. Although it explicitly accounts for only 20% of the grade, it will also be used to determine your score for the Implementation aspect of the project (50%).

Your written documentation must discuss each optimization that you considered. The thoroughness of your exploration of the optimization space will be an important aspect of your grade. For each optimization that you implement, your writeup must convince the reader that the optimization was beneficial, general, and correct.

For runtime representation and garbage collection optimizations, your write up should include diagrams that illustrate your represenation changes and algorithms. For code generation optimizations, you should include assembly or IR code examples for each optimization. Show how your generated code is transformed by the optimization (might be hand-applied for optimizations you did not implement). Highlight the benefit of the optimization on the assembly or IR code. Discuss exactly how the benefit was achieved given the characteristics of our target architecture. Include empirical evidence that proves your conclusion for the optimization.

Your writeup should include the following:

Command-Line Arguments: Your virtual machine should include a “full optimizations” command-line option (see below). Your written documentation should present a detailed discussion of this option including how you determined the order in which your optimizations are performed and how many times you apply the optimizations.

Division of Work: A brief description of how your group divided the work. This will not affect your grade; it will be used to alert the instructors to any possible problems.

Assumptions: A list of any clarifications, assumptions, or additions to the problem assigned. The project specifications are fairly broad and leave many of the decisions to you. This is an aspect of real software engineering. If you think major clarifications are necessary, feel free consult the instructors.

Overview: An overview of your design, an analysis of design alternatives you considered, and key design decisions. Be sure to document and justify all design decisions you make. Any decision accompanied by a convincing argument will be accepted. If you realize there are flaws or deficiencies in your design late in the implementation process, discuss those flaws and how you would have done things differently. Also include any changes you made to your implementations from previous assignments and explain why they were necessary.

Implementation: A brief description of interesting implementation issues. This should include any non-trivial algorithms, techniques, and data structures. It should also include any insights you discovered during this phase of the project.

Derby

We will hold a virtual machine derby. We will race your virtual machine against the submissions from the other groups in the class. The derby benchmark will be revealed one day prior to the derby. 30% of the grade will be determined by your ranking in the derby. At this time, we will not reveal the exact formula for determining this portion of the project grade.

On derby day, your group will also give a short presentation on the design of your virtual machine as well as the optimizations that you implemented.

Reference Materials

Beyond the normal course materials, we have provided links on the course website to useful and well-known papers on register allocation. Reading these before designing and writing your register allocator may be very useful. We will also post links to papers about other useful optimizations to the course website as we discover them. You can find these links under the “Reference Materials” section.

Known Issues

The x64asm library we have included for your use has a known issue where the assembler does not reserve enough memory for the code generation of long functions. Specifically, there is no logic to dynamically reserve more space in the instruction buffer as instructions are added. It is therefore your responsibility to call the reserve() method of each x64asm::Function as needed to ensure there is enough space in the buffer. We recommend doing this once per function, at the start. You can reserve irlist.size() * MAX * 15 bytes per function, where irlist is a vector containing your IR to be compiled, and MAX is a static upper bound on the number of assembly instructions that can result from a single IR instruction.

Tips

The examples in asm/x64asm/examples are a good starting point for code generation.
Watch subtle differences such as jmp vs jmp_1
Use gcc -S to see the assembly generated by a standard compiler if you need inspiration

Deliverables

Your virtual machine must provide a command-line option for each code generation optimization and/or garbage collector algorithm. Your project writeup should include documentation for each command-line option. For example:

--opt=regalloc: turns on register allocation
--opt=inline: turns on function inlining
--opt=constantprop: turns on constant propagation
--opt=gc-generational: turns on generational garbage collector

Required Command-line Options: You must provide a --opt=machine-code-only flag that only turns on machine code generation. We will use this for the milestone. You must provide a --opt=all flag to turn on all optimizations (“full optimizations”). We will use this command-line for the derby. For code generation optimizations, this option should consider how many times each code generation optimization is applied and the application order of the optimizations.

Grading

This assignment will be graded as follows:

(20%) Writeup. with particular attention given to your description of the optimization selection process. Overall, we will limit the length of the supporting documentation to 8 pages (single-column, 11 pt font).
(50%) Implementation. As each group implements different optimizations, the only public test is the generation of correct results for the benchmark suite and the Derby program (25%). The hidden tests will check for overtly optimistic optimizations and incorrect handling of programs 25%.
(30%) Derby Performance. The formula for translating the running time of the program by your virtual machine into a grade will be announced later.