Wednesday, March 31, 2010

The DeadEnds Interpreter, Part Three

This (rather long) post describes the main Objective-C classes used in the implementation of the DeadEnds programming language interpreter. When program files are parsed the interpreter builds up a semantic representation of the modules (functions and procedures) defined in the program. This representation is a tree of objects; most of the objects are instances of different Expression and Statement classes implemented in Objective-C.

There are three main classes used to build the internal representations of programs -- Module, Expression and Statement. The sections that follow describe these three classes and many of their sub-classes.

Modules
Each function or procedure parsed from a DeadEnds program results in a Module object being created that is added to a table of Modules. As execution of the program proceeds, expressions in the module begin interpreted may call other modules. The Objective-C interface for the Module class is:

@interface Module : NSObject
{
InterpType type;
NSString* name;
NSArray* parameters;
BlockStatement* body;
}
- (id) initWithType: (NSNumber*) type name: (NSString*) name parameters: (NSArray*) parameters block: (BlockStatement*) block;
- (ReturnCode) interpretArguments: (NSArray*) arguments symbolTable: (SymbolTable*) symbolTable;
@end

A Module has a return type, a name, a list of parameters and a BlockStatement. During execution Modules are called by CallExpressions. When a CallExpression is evaluated, the name in the Expression is used to lookup the Module object in the Module table. The argument Expressions in the CallExpression are evaluated and assigned to the Module's parameters and used to initialize the symbol table that provides the context for interpreting the Module. The Module's BlockStatement is then interpreted.

The important method of the Module class is the interpArguments:symbolTable method that interprets the Module. The method requires two arguments, the list of argument Expressions to pass to the Module, and the SymbolTable at the base of the active SymbolTable chain that provides the execution context.

Expressions
Expressions are semantic objects created during parsing and they make up a substantial portion of the semantic trees. Generally each expression in the programs will be represented by an Expression object in the semantic representation. There is one abstract class that encompass all Expression classes, another abstract sub-class for numeric constants, and a number of concrete Expression sub-classes for the different types of expressions found in the programming language. The simplified Objective-C interface to the Expression classes is:

@interface Expression : NSObject
- (RValue*) evaluate: (SymbolTable*) symbolTable;
@end

The Expression class is the base of all Expression sub-classes. It handles line numbers (part of error handling, not shown), and defines the primary method that all Expression sub-classes must implement. This is the evaluate: method, that evaluates an Expression in the context of a chain of SymbolTables. Every evaluate method returns an RValue object, an object that wraps one of the 13 (very likely more later) types supported by the language. Evaluation of Expressions does not modify the Expressions; Expressions are semantic objects that are part of the internal semantic representation of Modules. Evaluation of an Expression creates a dynamic RValue object that represents the run-time value of the Expression in the context of the current programming state, which is encoded in the states of the accessible SymbolTables.

@interface Constant : Expression
{
NSNumber* number;
}
- (id) initWithNumber: (NSNumber*) number;
@end

The Constant class is the base of all numeric constants. A Constant is a container for an NSNumber, a foundation object that can hold any kind of basic numeric value.

@interface IntegerConstant : Constant
@end

@interface FloatConstant : Constant
@end

@interface BooleanConstant : Constant
@end

@interface CharacterConstant : Constant
@end

@interface StringConstant : Expression
{
NSString* string;
}
- (id) initWithString: (NSString*) string;
@end

A StringConstant holds a single string as an NSString, another foundation object. Evaluation of the numeric Constants and StringConstants consists of simply returning a copy of their values wrapped in RValue objects.

@interface Variable : Expression
{
NSString* name;
}
- (id) initWithName: (NSString*) name;
@end

A Variable object holds the name of a program variable. Evaluating a Variable object consists of looking the variable's name up in the current set of SymbolTables and returning the variable's current RValue. As we will see later SymbolTables map variable names to their LValues, and LValues refer to their corresponding RValues. There should be a little more written somewhere to discuss the differences between LValues and RValues and how the concepts are during evaluation and interpretation.

@interface BinaryExpression : Expression
{
Expression* leftExpression;
Expression* rightExpression;
Operator operator;
}
- (id) initLeft: (Expression*) left right: (Expression*) right operator: (NSNumber*) operator;
- (RValue*) evaluateDotExpression: (SymbolTable*) symbolTable;
- (RValue*) evaluateAssignExpression: (SymbolTable*) symbolTable;
- (RValue*) evaluateAndOrExpression: (SymbolTable*) symbolTable;
- (RValue*) evaluateRelationalExpression: (SymbolTable*) symbolTable;
- (RValue*) evaluateArithmeticExpression: (SymbolTable*) symbolTable;
@end

A BinaryExpression holds an expressions with left side and right side sub-Expressions and an operator. Evaluating a BinaryExpression involves recursively evaluating the two sub-expressions and then using the operator to compute the final RValue. Five methods are used for the five major types of BinaryExpressions. Dot-expressions implement method calls and structure field access; assign-expressions are assignment statements; and-or-expressions are logical expressions (their evaluation uses the usual short-circuiting rules which may avoid evaluation the right sub-expression); relational-expressions compare the sub-expressions; and arithmetic expressions perform arithmetic operations. Evaluation of BinaryExpressions may involve implicit type coercion rules.

@interface UnaryExpression : Expression
{
Expression* expression;
Operator operator;
}
- (id) initExpression: (Expression*) expression operator: (NSNumber*) operator;
@end

UnaryExpressions holds unary expressions, those with an operator (+, -, !) and a sub-Expression. Evaluating a UnaryExpression involves recursively evaluating the sub-Expression and then applying the operator to that value.

@interface CallExpression : Expression
{
NSString* name;
NSArray* arguments;
}
- (id) initWithName: (NSString*) name arguments: (NSArray*) arguments;
@end

CallExpressions are function and procedure calls (Method calls are handled by BinaryExpressions). Each CallExpression consists of the name of the Module to call and a list of arguments to pass in, where each is an Expression in its own right. Evaluation of a CallExpression involves looking up the name in a table to get a Module object, evaluating the arguments and binding them to the Module's parameters in a new symbol table, and then interpreting the Module's BlockStatement. If the BlockStatement terminates with a ReturnStatement with an Expression, the ReturnStatement evaluates that Expression into an RValue, and that RValue becomes the final value of the CallExpression. Not as complicated as it sounds.

Statements
Statements are semantic objects created during parsing and, along with Expressions, are the other main object found in semantic trees. Each language statement parsed from a program is represented as a Statement object in a semantic tree. Statements are recursive because many Statements (e.g., if-statements, while-statements, block statements) contain other statements. The simplified Objective-C interface for the Statement classes is:

@interface Statement : NSObject
- (ReturnCode) interpret: (SymbolTable*) symbolTable;
@end

The Statement class is the abstract base class for all Statement sub-classes. It handles line numbers (part of error handling, not shown) and defines the key method that all Statement sub-classes must implement. That is the interpret: method, that interprets (runs, executes) a Statement in the context of a set of symbol tables. Every interpret: method returns a return code used to direct follow-on actions. Normally the return code indicates the normal case and interpretation continues to the next Statement in the program. Describing the other return codes is a little advanced for this introduction.

@interface ExpressionStatement: Statement
{
Expression* expression;
}
- (id) initWithExpression: (Expression*) expression;
@end

An ExpressionStatement is syntactically an Expression followed by a semicolon. For example, an assignment statement is an ExpressionStatement where the Expression is a BinaryExpression with '=' as its operator. Interpretation of an ExpressionStatement consists of evaluating the Expression and accumulating whatever side effects that it causes (e.g., the change of a value in a symbol table).

@interface IfStatement: Statement
{
Expression* conditional;
Statement* thenStatement;
Statement* elseStatement;
}
- (id) initWithConditional: (Expression*) conditional thenStatement: (Statement*) thenStatement elseStatement: (Statement*) elseStatement;
@end

IfStatements implement conventional if-then-else Statements where the else Statement is optional. Interpretation of an IfStatement begins by evaluating the conditional Expression that is coerced to a Boolean value. If that value is true the then Statement is recursively interpreted; if that value is false and the else Statement exists, the else Statement is recursively interpreted.

@interface WhileStatement: Statement
{
Expression* conditional;
Statement* statement;
}
- (id) initWithConditional: (Expression*) conditional statement: (Statement*) statement;
@end

WhileStatements implement conventional while statement semantics. Interpretation of a WhileStatement begins by evaluating the conditional Expression and coercing its value to a Boolean. If the value is true interpretation continues with the interpretation of the body Statement; after the Statement is interpreted the Expression is reevaluated in the new context and the Statement will be interpreted again if the Expression's value is still true. This continues until the value of the Expression is false, at which point interpretation of the WhileStatement ends. Note that ReturnStatements, ContinueStatements and BreakStatements found within the body Statement implement the conventional rules for such statements.

@interface DoStatement: Statement
{
Statement* statement;
Expression* conditional;
}
- (id) initWithStatement: (Statement*) statement conditional: (Expression*) conditional;
@end

DoStatements implement conventional do-while statement semantics. Interpretation of DoStatements is similar to that of WhileStatements except that the order of evaluating the conditional Expression and interpreting the body Statement are different. As in WhileStatements, any ReturnStatement, ContinueStatement or BreakStatement found in the body Statement is handled in the conventional manner.

@interface BlockStatement : Statement
{
NSArray* statements;
}
- (id) initWithStatements: (NSArray*) statements;
@end

A BlockStatement is simply an array of Statements. BlockStatements are interpreted by interpreting each of its sub-Statements, from first to last, in sequence.

@interface Declaration : Statement
{
InterpType type;
NSArray* initializers;
}
- (id) initWithType: (NSNumber*) type initializers: (NSArray*) initializers;
@end

Declarations are Statements that declare new Variables and optionally assign them their initial values. In the DeadEnds programming language Declarations are treated exactly as other Statements and can appear anywhere any other Statement can occur. Interpretation of a Declaration causes the Variables to be added to the SymbolTable associated with the BlockStatement the Declarations in (or to the global SymbolTable if the Declaration is outside of any Module definition). If a Variable has an initializing Expression, the Expression is evaluated normally, that is, in the context of the current SymbolTable, and the new Variable is assigned given that initial RValue in the SymbolTable. Initializing Expressions can refer to Variables that were declared earlier in the same BlockStatement. This is exactly what one would hope would happen.

@interface ReturnStatement: Statement
{
Expression* returnExpression;
}
- (id) initWithExpression: (Expression*) expression;
@end

ReturnStatements represent return statements in DeadEnds programs. When a ReturnStatement is interpreted control returns from the current Module to the CallExpression in the Module that called it. If the ReturnStatement has an Expression (it is optional), the Expression is first evaluated in the context of the current set of SymbolTables, and the resulting RValue is returned to the calling Module as the value of the CallExpression.

@interface ContinueStatement: Statement
@end

What you'd expect, the semantic object representing a continue statement in a DeadEnds program. Interpreting a ContinueStatement causes interpretation control to move to the next iteration of the WhileStatement or DoStatement the ContinueStatement is embedded in.

@interface BreakStatement: Statement
@end

What you'd expect, the semantic object representing a break statement in a DeadEnds program. Interpreting a BreakStatement causes interpretation control to break out of the current control structure using the conventional rules for break statements.

Values
Values are values created at run-time when program Modules are being interpreted. Values are always the result of evaluating Expressions within the context of a chain of symbol tables. There are two types of values, RValues and LValues, that are described in more detail below. The simplified Objective-C interface to the two value classes is:

@interface RValue : NSObject
{
InterpType type;
id value;
BOOL hasLValue;
}
- (id) initWithType: (InterpType) type value: (id) value isLValue: (BOOL) isLValue;
- (id) initWithInterpValue: (RValue*) interpValue;
- (BOOL) boolValue;
- (RValue*) coerceType: (InterpType) toType;
@end

RValue objects are computed by the interpreter at run-time by evaluating Expressions. An RValue can hold any one of the many (currently 15) valur types supported by the DeadEnds programming language (Boolean, Character, Integer, Float, String, List, Set, Table, Node, Record, Person, Family, Void, Any and Error). The type field indicates the specific type, and the value field holds a foundation object (usually an NSNumber or NSString) that holds the actual value. The hasLValue field is true if the RValue is referenced by an LValue, which generally means it is the value of a Variable. The RValue class has two methods to initialize an RValue, a method to return the coercion of the value to a Boolean value, and the coerceType: method that attempts to coerce the RValue from one type to another. If the coercion is not possible the method return an RValue with the Error type.

@interface LValue : NSObject
{
RValue* rValue;
}
- (id) initWithRValue: (RValue*) anRValue;
@end

LValues are objects that hold references to RValues. LValues are used in SymbolTables. A SymbolTable is a map (implemented by an NSMutableDictionary, another foundation class) from the names of Variables to their LValues. In a normal compiler an l-value is the actual location in memory where an r-value is located. In the object-oriented DeadEnds interpreter, an LValue becomes a reference (effectively a location) to another object, the RValue that holds the Variable's actual value. When an assignment BinaryExpression is evaluated, the RValue of the Variable on the left side of the assignment operator is changed to hold the RValue computed by evaluating the Expression on the right side of the operator. Since the Variable's RValue is being pointed to by the its LValue, this change to the RValue effectively changes the value of the Variable itself through the LValue mechanism. Note that the evaluation logic does not need to know if an RValue is the target to an LValue or not. The only reason why the evaluation logic cares about the existence of LValues is for issuing warnings if an assignment statement would fail to change the program's state because the left hand side is not a Variable or a component of a structured Variable. Again, it's not nearly as complicated as it sounds.

Symbol Tables
SymbolTables are run-time objects that map Variable names to their LValues and the LValues refer to their RValues. At any given time there is a chain of active SymbolTables. At the top of the chain is the global SymbolTable that holds any global Variables that were defined outside of any Module. At the bottom of the chain is the SymbolTable holding the Variables declared in the deepest BlockStatement currently being interpreted. Between the top SymbolTable and the bottom SymbolTables are intermediate SymbolTables holding the Variables defined in any open and pending BlockStatements between the current location in the program and the BlockStatement at the Module level. Note that if a BlockStatement does not have any DeclarationStatements the interpreter does not create a SymbolTable for that BlockStatement. When looking up a Variable in a SymbolTable, the SymbolTable at the bottom of the chain is searched first. If the Variable cannot be found in the bottom SymbolTables, the SymbolTables are searched one by one, up the chain, until the global SymbolTable is reached and searched. If the Variable cannot be found after all SymbolTables have been checked the Variable is undefined and an Error object is returned.

@interface SymbolTable : NSObject
{
NSMutableDictionary* nameValueTable;
SymbolTable* parent;
}
- (id) initWithParent: (SymbolTable*) parent;
- (BOOL) contains: (NSString*) name;
- (LValue*) valueOfName: (NSString*) name;
- (void) addDeclaration: (Declaration*) declaraction;
- (void) setValue: (LValue*) lValue ofName: (NSString*) name;
- (RValue*) lookup: (NSString*) name;
@end

Tuesday, March 30, 2010

The DeadEnds Interpreter, Part Two

Structure of the DeadEnds Interpreter
The DeadEnds interpreter has two phases. The first parses DeadEnds program files and converts their contents into an internal format that can be interpreted by the second phase. A DeadEnds program file is a sequence of program modules (functions and procedures) and global variables written in the DeadEnds programming language. DeadEnds programs can have include statements that indicate other DeadEnds program files that must be processed. The included files are located and processed also. This phase is similar to a language compiler, except that a compiler converts programs into machine instructions that can run directly on a processor.

The second phase runs (executes, interprets) the programs that were prepared by the first phase. The internal form of programs consists of a global symbol table, and a list of program modules in internal form. The internal modules are composed of statement and expression objects. This phase runs DeadEnds programs by interpreting program modules by evaluating the expressions and interpreting the statements within a context provided by a chain of symbol tables that track the state of the running programs.

The DeadEnds language parser is generated by Bison (a public domain version of Yacc), a program that reads a file that defines both the grammar of the DeadEnds language and the semantic actions that convert DeadEnds programs into trees of internal objects that are interpreted later by the second phase. Bison generates the code that parses the DeadEnds language and performs the actions. Normally Yacc and Bison generates C code, but the version of Bison on Mac OS X systems can generate Objective-C code. DeadEnds uses this feature because the actions are written in Objective-C, and they create trees of Objective-C objects.

DeadEnds Interpreter Grammar
Here is the context free grammar of the DeadEnds programming language taken from the DeadEnds Yacc file. Words in lower case are nonterminals, and those in upper case are the terminals or tokens. Characters in single quotes are literal terminals.
  • definitions : definition | definitions definition
  • definition : declaration | module | INCLUDE SCONS
  • declaration : TYPE identifiers ';'
  • module : TYPE IDEN '(' parameterso ')' block
  • identifiers : IDEN initializero | identifiers ',' IDEN initializero
  • initializero: initializer | /* empty */
  • initializer : ASGNOP initialvalue
  • initialvalue: expression | '{' expressionso '}'
  • parameterso : parameters | /* empty */
  • parameters : parameter | parameters ',' parameter
  • parameter : TYPE IDEN
  • statementso : statements | /* empty */
  • statements : statement | statements statement
  • statement : expression ';' | declaration | IF '(' expression ')' statement | IF '(' expression ')' statement ELSE statement | WHILE '(' expression ')' statement | DO statement WHILE '(' expression ')' ';' | block | RETURN expressiono ';' | CONTINUE ';' | BREAK ';'
  • block : '{' statementso '}'
  • expression : IDEN | ICONS | FCONS | SCONS | boolconstant | '(' expression ')' | IDEN '(' expressionso ')' | expression ADDOP expression | expression MULOP expression | expression RELOP expression | expression EQOP expression | expression ANDOP expression | expression OROP expression | expression '=' expression | expression '.' member | ADDOP expression | NOTOP expression
  • member : IDEN | IDEN '(' expressionso ')'
  • expressiono : expression | /* empty */
  • expressionso: expressions | /* empty */
  • expressions : expression | expressions ',' expression
  • boolconstant : TRUECONS | FALSECONS
Those programmers who have used the LifeLines programming language will see that I haven't included any of the specialized for-statements. These will be included in an update soon.

Monday, March 29, 2010

The DeadEnds Interpreter, Part One

During March, 2010, I designed the DeadEnds programming language and wrote most of the interpreter that executes programs in the language. The language lets DeadEnds users write their own programs that access records in DeadEnds databases. The language has a full set of types and operations for general purpose computing, and specialized types and operations for handling common genealogical entities such as persons, events and families.

The DeadEnds programming language is akin to the report interpreter language I wrote for LifeLines, but has a more conventional syntax. It uses declarations, statements and module definitions similar to those in C and Java. The LifeLines language was originally intended for describing genealogical reports so users could generate custom and precisely formatted reports. Many existing LifeLines programs do just this, generating complex HTML, postscript, and other text-based outputs that are used to generate reports. But it also became clear that the LifeLines programming language could be used for many other purposes, and it evolved quickly into a general purpose programming language with a number of data types making processing of genealogical data particularly simple. The DeadEnds language begins with the functionality of the LifeLines language and extends it in various directions.

Data Types
The DeadEnds programming language supports these data types:
  • Boolean - with the constants true and false.
  • Character - unicode characters.
  • Integer - signed integers.
  • Float - double-precision floating-point numbers.
  • String - strings of unicode characters.
  • List - mutable list of values that provide stack, queue and array access.
  • Set - mutable set of values.
  • Table - mutable table (map, dictionary) that maps strings to values.
  • Node - nodes in a genealogical (and other) records (e.g., lines in a Gedcom record or elements in an XML record).
  • Record - a record in a database, where the contents of the record can be treated as a tree of Nodes.
  • Any - any of the above.
  • Void - type with no values.
  • Error - used when run-time evaluation is impossible.
The language does not support user defined types though this feature could be provided if it seems useful.

Declarations as Statements
The original syntax for declarations had the form:

Integer a, b, c;

a type followed by a list of variables. The variables were added to the symbol table with default values. Declarations were not treated like statements and declarations had to precede statements in program blocks.

I have relaxed these restrictions in two ways. First, declarations can have initializers, expressions that give them initial values; for example:

Integer a = 1, b = 2, c = a + b;

Initializers are treated like any other expressions so are evaluated at run time. They do not have to be constants. Because they are processed left to right, the initializer for c in the example above can use the values of a and b and will have an initial value of 3. Initializers can also be used to initialize structured values, for example:

Set s = { "one", "two", "three"};

initializes Set s with three string elements.

I also removed the restriction that declarations had to occur before statements. They are now treated as any other statement type and can occur anywhere in a block of statements. They are interpreted as assignment statements that have the side effect of adding an entries to the block's symbol table.

Washing Bottles at DeadEnds Software

Introduction
This blog is for discussion of the DeadEnds genealogy software programs I am writing. As a software geek and practicing genealogist the posts will range from technical topics about implementing the software to thoughts on how genealogical research should be supported by software. I am writing DeadEnds for Apple's Mac OS X using Objective-C, the Cocoa frameworks and the Xcode development environment.

DeadEnds is a follow-on to the LifeLines program I wrote in the late 1980's early 1990's, now an open source project at Source Forge. LifeLines uses a Btree database to store genealogical and indexing records. The genealogical records are in a Gedcom format that is compatible with Gedcom standards. LifeLines is Unix-based and uses the curses/ncurses libraries to provide a primitive windows-based user interface. DeadEnds uses a more generic format for its genealogical records, one that generalizes both Gedcom and XML, so that data can be imported and exported in Gedcom, XML, or other similarly structured formats for holding hierarchical information. DeadEnds also provides more record types in order to support more of the genealogical research process.

A Note on Record Types
The Gedcom standard, such as it is, defines a number of record types, but only two, the INDI (individual or person) and FAM (family) records, hold genealogical data. These two types are suitable for representing the conclusions of genealogical projects, but are not adequate for representing the research data that are collected and that conclusions are based upon.

DeadEnds supports a larger set of genealogical records. A list of records supported by DeadEnds includes
  • Source - record describing the source of some genealogical evidence.
  • Evidence - record describing an item of information used to infer genealogical data.
  • Event - an event found in evidence, including the type, date and place of the event, and references to the person records of the role players from the event.
  • Person - the record of a single role player in a single event, including the person's name and any other information available (e.g., age, relationships to other persons).
  • PersonGroup - a record that groups together a tree of Person records inferred to refer to the same real person. DeadEnds software supports the building of PersonGroup records from Person records and other PersonGroup records in three ways: Manually; by algorithms built into the DeadEnds software; and by algorithms written in the DeadEnds programming language.
  • EventGroup - a record that groups together a tree of Event records...
  • Family - a conclusion record that brings together PersonGroup and EventGroup records.
  • Others