Статьи

Написание компилятора на C #: генерация кода C, часть 1

После подробного обсуждения этапов лексического анализа и синтаксического анализа, пришло время запачкать руки фактической генерацией кода. Теоретически, наш синтаксический анализатор генерирует промежуточное представление анализируемой программы — интерфейс генератора кода, показанный ниже, может использоваться для построения фактического дерева, отображающего структуру программы.

Для практической цели перевода программы Jack на C или ассемблер нет необходимости хранить в памяти реальное дерево разбора. Используя состояние символа и небольшой набор вспомогательных структур данных, мы можем реализовать генератор кода, который испускает допустимый C-код. Этот код C может быть скомпилирован компилятором C для получения исполняемой программы. (Если идея компилировать Jack в C и затем полагаться на компилятор C кажется вам обманом, обратитесь к истории языка программирования C ++. Самый первый компилятор C ++, Cfront by Bjarne Stroustrup, преобразовал C ++ в C.)

Давайте посмотрим на интерфейс, который должен реализовать генератор кода. (Очевидно, что здесь есть несколько удобных вариантов, облегчающих разработку генератора кода специально для Джека.)

internal interface ICodeGenerator
{

void SetOptions(CodeGeneratorOptions options);
void InitSymbolTables(
SymbolTable classSymTable,
SymbolTable methodSymTable);
void EmitEnvironment();
void EmitBootstrapper();

void StaticDeclaration(Symbol variable);
void FieldDeclaration(Symbol variable);
void ConstructorDeclaration(Subroutine subroutine);
void FunctionDeclaration(Subroutine subroutine);
void MethodDeclaration(Subroutine subroutine);
void EndSubroutine();
void Return();
void BeginWhile();
void WhileCondition();
void EndWhile();
void BeginIf();
void PossibleElse();
void EndIf();
void Assignment(Token varName, bool withArrayIndex);
void Add();
void Sub();
void Mul();
void Div();
void Mod();
void And();
void Or();
void Less();
void Greater();
void Equal();
void LessOrEqual();
void GreaterOrEqual();
void NotEqual();
void IntConst(int value);
void StrConst(string value);
void True();
void False();
void Null();
void This();
void Negate();
void Not();
void VariableRead(Token varName, bool withArrayIndex);
void Call(string className, string subroutineName);
void DiscardReturnValueFromLastCall();
void BeginClass(string className);
void EndClass();
}

Если вы помните анализатор выражений Jack, который мы разработали несколько недель назад, он эффективно преобразует выражение Jack в постфиксную форму, вызывая генератор кода при обходе дерева синтаксического разбора по порядку. Например, если x и y являются терминалами (например, локальными переменными), то синтаксический анализатор обрабатывает выражение x + y , выполняя следующие вызовы генератора кода:

VariableRead(x, withArrayIndex:false)
VariableRead(y, withArrayIndex:false)
Add()

Another example to drive the point home—the expression true|this==null&(x[3]-y)<13 results in the following calls, indented for easier understanding:

True()
This()
Null()
Equal()
IntConst(3)
VariableRead(x, withArrayIndex:true)
VariableRead(y, withArrayIndex:false)
Sub()
IntConst(13)
Less()
And()
Or()

Evaluating this expression at runtime is best modeled by a stack. Operations like IntConst, VariableRead, This push a value onto the stack; operations like Sub, Less, And pop two operands off the stack and push the result of the operation onto the stack; and so on.

When we’ll develop the x86 assembly language code generator for Jack, we’ll use the assembly PUSH and POP instructions to manipulate the thread’s explicit stack. To emulate a similar process for C code generation, we’ll use a global array of words, called __STACK, and emit a couple of intrinsic macros, __PUSH and __POP, that manipulate the stack.

Assuming that the local variables x and y in the Jack program are represented by the local variables x and y in the resulting C function, compiling x + y to C boils down to:

__PUSH(x);
__PUSH(y);
__SCRATCH1 = __POP();
__PUSH(__POP() + __SCRATCH1);

(Note that I’m using __SCRATCH1 as a scratch register—it’s simply a global variable of type __WORD, the only type we’ll be dealing with unless we want to address type checking.)

This is really wasteful, you say—we could compile the whole thing to __PUSH(x+y) and save a bunch of instructions! That’s true, but we’re going to leave optimization off the table for now, especially considering that we’re using the C compiler in the next phase—and it’s already pretty good at optimizing things. (Again, if the idea of not doing optimization at the intermediate compilation phase strikes you as lazy, consider what the C# compiler does—it emits IL instructions that manipulate the IL evaluation stack, and leaves it up to the JIT to decide whether some stack operations can be optimized or even replaced by register manipulation altogether.)

With that said, we’re ready for the part of the code generator that deals with expressions and the assignment statement—let. (Control constructs, subroutines, and top-level program structure will be dealt with next.)

There are a couple of translation decisions that you need to be aware of prior to reading this code:

  • __WORD is the single type we use for everything. In this version of the compiler, it’s simply int.
  • A Jack class is mapped to a C struct.
  • Jack class fields are mapped to C struct fields.
  • Jack class statics are mapped to C global variables.
  • Jack subroutines are mapped to C functions.
  • Jack subroutines that return void are mapped to C functions that return __WORD. This return value is 0 by convention, and is ignored by the caller.
  • Within a method or constructor, THIS is a local variable that holds the value of this.

public override void Assignment(
Token varName, bool withArrayIndex)
{
if (MethodSymTable.HasSymbol(varName.Value))
{
Symbol symbol =
MethodSymTable.GetSymbol(varName.Value);

Output.WriteLine(
"__PUSH((__WORD)&{0});", symbol.Name);
}
else
{
Symbol symbol =
ClassSymTable.GetSymbol(varName.Value);

if (symbol.Kind == SymbolKind.Static)
{
Output.WriteLine("__PUSH((__WORD)&{0});",
FormatStaticName(symbol.Name));
}
else if (symbol.Kind == SymbolKind.Field)
{
Output.WriteLine(
"__PUSH((__WORD)&((({0}*)THIS)->{1}));",
_currentClassName, symbol.Name);
}
}

//If it's an array, obtain the address of the right
//element by adding the index which is on the stack.
//We need a scratch location because issuing two __POP()
//calls in the same statement does not guarantee
//left-to-right evaluation.
if (withArrayIndex)
{
//The array address is now on the stack, but we
//really need the address of the first element.
//Hence the dereference:
Output.WriteLine("__SCRATCH2 = *(__WORD*)__POP();");
//This is the RHS value that we ought to put in
//the array element:
Output.WriteLine("__SCRATCH1 = __POP();");
//Finally, the top of the stack contains the value
//of the array indexing expression, i.e. the
//element index:
Output.WriteLine(
"*(((__WORD*)__SCRATCH2) + __POP()) = __SCRATCH1;");
}
else
{
Output.WriteLine(
"__SCRATCH1 = __POP();"); //This is the LHS
Output.WriteLine(
"*((__WORD*)__SCRATCH1) = __POP();");
}
}

public override void Add()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__SCRATCH1 + __POP());");
}
public override void Sub()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() - __SCRATCH1);");
}
public override void Mul()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() * __SCRATCH1);");
}
public override void Div()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() / __SCRATCH1);");
}

public override void Mod()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() % __SCRATCH1);");
}
public override void And()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() & __SCRATCH1);");
}

public override void Or()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() | __SCRATCH1);");
}
public override void Less()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() < __SCRATCH1 ? -1 : 0);");
}

public override void Greater()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() > __SCRATCH1 ? -1 : 0);");
}

public override void Equal()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() == __SCRATCH1 ? -1 : 0);");
}

public override void LessOrEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() <= __SCRATCH1 ? -1 : 0);");
}

public override void GreaterOrEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() >= __SCRATCH1 ? -1 : 0);");
}

public override void NotEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() != __SCRATCH1 ? -1 : 0);");
}

public override void IntConst(int value)
{
Output.WriteLine("__PUSH({0});", value);
}

public override void StrConst(string value)
{
Output.WriteLine("__PUSH((__WORD)\"{0}\");", value);
}

public override void True()
{
IntConst(-1);
}
public override void False()
{
IntConst(0);
}

public override void Null()
{
IntConst(0);
}
public override void This()
{
Output.WriteLine("__PUSH(THIS);");
}

public override void Negate()
{
Output.WriteLine("__PUSH(-__POP());");
}

public override void Not()
{
Output.WriteLine("__PUSH(~__POP());");
}

public override void VariableRead(
Token varName, bool withArrayIndex)
{
//Put the value of the variable on the top of
//the stack. If it's an array, the value is the
//address of the array's first element.
if (MethodSymTable.HasSymbol(varName.Value))
{
Symbol symbol =
MethodSymTable.GetSymbol(varName.Value);
Output.WriteLine("__PUSH({0});", symbol.Name);
}
else
{
Symbol symbol =
ClassSymTable.GetSymbol(varName.Value);
if (symbol.Kind == SymbolKind.Static)
{
Output.WriteLine("__PUSH({0});",
FormatStaticName(symbol.Name));
}
else if (symbol.Kind == SymbolKind.Field)
{
Output.WriteLine(
"__PUSH((({0}*)THIS)->{1});",
_currentClassName, symbol.Name);
}
}
//If it's an array, dereference it using [].
if (withArrayIndex)
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH( ((__WORD*)__SCRATCH1)[ __POP() ] );");
}
}

As always, most of the code (other than the part that deals with variables and assignments) writes itself. This gives us the foundation upon which we can build control statements and the general program structure, up next.