DeadEnds Genealogical Software: April 2010

Thursday, April 15, 2010

DeadEnds Interpreter, Expression Values

The most common operation that occurs when the DeadEnds Interpreter is running a program is expression evaluation. Within the context of the interpreter an expression is a semantic entity, an object of a sub-class of TWInterpExpression, created when a DeadEnds program was parsed. Expressions are evaluated by their evaluate: methods that are passed a symbol table (an object of class TWInterpSymbolTable) that holds the values the variables at the time of the call.

Here are the forms that a value can take.

As a TWInterpExpression

A TWInterpExpression object is the semantic representation of a bit of programming language syntax, part of the internal form of a parsed module. These expressions are recursive. For example, the expression found in a DeadEnds program:

i*alpha + b*21

is represented as a tree of seven TWInterpExpressions, three for the variables, one for the integer constant, two for the multiplication operators, and one for the plus operator. It is better to think of this structure as the potential for computing a value, rather than as the value itself. The process of evaluation, implemented by the evaluate: methods, is to convert these trees of semantic objects in a single real value. The evaluate: method for an integer constant expression (an object of type TWInterpIntegerConstant, a subclass of TWInterpExpression), returns the integer constant as an R-Value (see below). The evaluate: method for a variable (an object of type TWInterpVariable, also a subclass of TWInterpExpression) looks up the variable name in the sequence of symbol tables that are currently active and returns the variable's value as an R-Value. The evaluate: method for a binary expression (an object of type TWInterpBinaryExpression, also a subclass of TWInterpExpression) recursively calls the evaluate: methods on its left and right sub-expressions to get two R-Values, combines those values using the mathematical operator, and returns the overall result as a new R-Value.

As final, evaluated, values. These values are Booleans, Unicode characters, integers, floating point numbers, Unicode strings, lists, sets, tables, nodes and records. These are the types that a DeadEnds programmer thinks in. The DeadEnds Interpreter allows the user to write programs and think in terms of values of these types.

Behind the scenes the DeadEnds Interpreter must represent these values in a consistent form. There are four of these forms:

Primitive values

Values at this level are the lowest form possible. For example an integer is a 32 or 64 bit word on the computer. A list is an NSArray from the Foundation framework.

Boxed Values

Values at the next level up are Objective-C objects that represent the primitive values. Boxing is the term used to mean creating an object type for a primitive type in order to allow values of the primitive type to have all the benefits of objecthood. In the case of the DeadEnds Interpreter, of the Boolean, Character, Integer, and Float type are all boxed as NSNumber objects from the Cocoa Foundation. The String type is represented at the primitive level as an NSString object, which is an object, so have the same value at the both the primitive and boxed levels. The List, Set, and Table values at this level are the same as at the primitive level, objects of the Foundation classes NSArray, NSSet and NSDictionary. The Primitive and Boxed values of Node and Record objects are TWGedcomNode and TWGedcomRecord objects, which are classes defined in the TWGedcom Objective-C library.

R-Values

An R-Value is nothing more than a Boxed Value and a type code to indicate what kind of value is boxed. The values returned by all evaluate: methods are R-Values.

L-Values

L-Values are references to R-Values, and that's all that they are. A DeadEnds symbol table (an object of class TWInterpSymbolTable) contains a table (NSMutableDictionary) that maps variable names (NSStrings) to L-Values (objects of class TWInterpLValue). That L-Value refers to the R-Value that holds the boxed value of the variable. Consider the following code from a DeadEnds Program

Integer i = 0;

i = i + 1;

After the first line, the declaration of i, is interpreted there will be a new mapping added to the current symbol table. The variable name i maps to a newly created L-Value that refers to a newly created R-Value that contains a boxed NSNumber object that contains the Integer 0. When the assignment expression (the assignment expression is a top level binary expression whose left expression is the variable i and whose right expression is the binary expression i + 1. To evaluate an assignment expression first the right hand side is evaluated into an R-Value. In this case that R-Value is computed by looking up the R-Value of the variable i in the symbol table and adding 1 to it, so in this case a new value of 1. That value is boxed into an NSNumber and put in a new R-Value. Now the actual assignment occurs. Here the evaluator looks up the variable in the symbol table, this time interested in its L-Value and the R-Value that the L-Value refers to. The boxed value of the R-Value computed from the right hand side is used to replace the boxed value of the R-Value to the L-Value refers to. This has the desired effect of changing the value of the variable i. The next time an expression includes this variable, the variable's L-Value refers to an R-Value that contains a boxed NSNumber object with a primitive value of 1. It's not as complicated as this description probable makes it seem.

Tuesday, April 13, 2010

DeadEnds Interpreter, Memory Management

The DeadEnds interpreter is written in Objective-C using Apple's Foundation framework, part of the Cocoa platform. Two dynamic memory management models are supported in this environment, retain/release and garbage collection. I opted to use the retain/release model because it forces a more careful analysis of dynamic memory usage. In the retain/release model each allocated object has a retain count associated with it. As long as the retain count is greater than zero the object remains in existence. When the retain count drops to zero the system will reclaim the object. Any software that needs to maintain access to an object should increment its retain count (though see description of weak references below), and is responsible for decrementing (releasing) the retain count when access is no longer required. Simple rules are used for managing retain/release objects:

Allocation methods (alloc) and copy methods (name starts with copy) create objects that come into existence with a retain count of one. Code that calls these methods are responsible for releasing the object.
If software creates an object on behalf of a caller, it calls autorelease on the object before returning the object. This causes a release to occur in the future, giving the caller a chance to retain the object before it disappears.

When one object refers to a second object, and the first object retains a reference to the second object, the reference is called a strong reference. This is the usual convention. When the first object is released, it is responsible for first releasing its strongly referenced objects. If an object refers to a second object, but does not retain the reference to second object, the reference is called a weak reference. Weak references are used in situations where there are specific conventions in use for memory management that can assure that the second object will not be released while the first object still requires access to it. As we will see below, this convention is used by one of the object sets employed by the DeadEnds interpreter.

During interpretation there are two sets of objects created. The first are the semantic values that make up the internal format of the program modules. These are the objects created by the parser when reading the program files. Object creation occurs within semantic actions, snippets of Objective-C code that runs whenever a grammar rule is recognized and reduced. These objects are allocated and initialized by the semantic actions so they have a retain count of one. As parsing proceeds trees of these semantic objects are built up and ultimately each program module is represented by a single semantic object (a TWInterpModule) that has a tree of semantic objects below it representing the body of the module. Because each semantic object is created by the parser, and because there is no need for any other object to retain these objects, the references between all semantic objects and the objects they refer to are weak references. Therefore the initializer methods for semantic objects never retain references to other semantic objects. In fact, in the current version of the interpreter, once a module has been parse and converted to semantic tree form it is never released.

The second set of objects are the values created by evaluating expressions while methods are being actively interpreted (executed/run). There are a number of expression classes, all sub-classes of TWInterpExpression. Each of these classes implements an evaluate: method that takes a symbol table (TWInterpSymbolTable) as an argument. Each evaluate: methods returns an r-value (TWInterpRValue) object. The memory management rule used in this case is r-values returned by evaluate: methods are already retained with a count of one. So evaluate: methods use the same rules as alloc and copy methods do: it is the responsibility of the callers of evaluate: methods (which are often evaluate: methods higher in a recursive evaluation of a complex expression) to release all TWInterpRValue objects it receives from the evaluate: methods it calls.

To give an example of this second approach in action, here is psuedo code for the evaluate: method for a binary expression.

Call evaluate: on the left sub-expression resulting in a retained left r-value object.
Call evaluate: on the right sub-expression resulting in a retained right r-value object.
Create a new r-value (using alloc) with value computed from the values of the two just computed r-values. This new r-value has a retain count of one.
Release the two r-values computed in steps 1 and 2.
Return the resultant r-value computed in step 3.

The details are more complex that indicated here, but this gives a good introduction to the convention used.

Thursday, April 8, 2010

Objective-C Gedcom Library, Part One

As a precursor to writing DeadEnds software I implemented an Objective-C library that handles Gedcom. I have used this library to build a Gedcom validation program and plan to use it to build a replacement for the LifeLines program. In this post I'll describe parts of the Gedcom library.

TWGedcomSource and TWGedcomFile

TWGedcomSource is an abstract class representing a source of Gedcom records and and the errors encountered while processing them. TWGedcomFile is a sub-class of TWGedcomSource that uses a text file as a Gedcom source. It has an initializer method that reads a file containing Gedcom data and creates the TWGedcomRecordSet declared in TWGedcomSource. If any errors are encountered they are added to the TWGedcomErrors object.

@interface TWGedcomSource : NSObject

{

TWGedcomRecordSet* recordSet;

TWGedcomErrors* errors;

}

@interface TWGedcomFile : TWGedcomSource

- (id) initWithContentsOfFile: (NSString*) path;

@end

TWGedcomFile's initWithContentsOfFile: method is the method that reads a Gedcom file into sets of Gedcom records. A Gedcom file can be in a number of different character set encodings (e.g, ASCII, ANSEL, UTF-8), and this initializer method calls a number of other methods to help determine that set. After determining the character set, the initializer reads the file into an NSString using that encoding. Since ANSEL is a character set that is not supported by NSString, code in this class converts ANSEL to Unicode when needed. At the end of initialization, if all went well, the recordSet member variable, defined in TWGedcomSource, is initialized to the set of Gedcom records found in the file. If there were errors, the errors member variable will hold the list of errors encountered.

TWGedcomRecordSet

A TWGedcomRecordSet contains sets of Gedcom records, one for each of the officially defined (Gedcom 5.5) record types, one for any other custom, records that are found in the source, and one for records that were found to have errors too serious to be put in one of the other records sets. The records are created by calling the class's initializer method, initWithString:errors:, which takes an NSString containing Gedcom data, and converts that string into the sets of Gedcom records. The records are indexed by their cross reference keys. The records are also validated for a wide variety of error conditions, and any errors found are recorded in the TWGedcomErrors object passed to the initializer. The TWGedcomRecordSet initializer is called by the TWGedcomFile initializer after that initializer has analyzed the character set of the original file and read it into an NSString, which is always in Unicode format.

- (id) initWithString: (NSString*) string errors: (TWGedcomErrors*) errors;

This is the initializer that converts an NSString into sets of Gedcom records and validates those records for errors.

- (TWGedcomPersonRecord*) personWithKey: (NSString*) key;

- (TWGedcomFamilyRecord*) familyWithKey: (NSString*) key;

- (TWGedcomSourceRecord*) sourceWithKey: (NSString*) key;

These three methods retrieve the records with the given key.

- (NSArray*) personsWithName: (NSString*) name;

- (NSArray*) personsWithNameKey: (NSString*) nameKey;

These two methods retrieve the set of persons who match a given name or Soundex name key. Person records are indexed by name as well as by key, and the indexing is done based on Soundex. When using the method personsWithName:, the name string argument can be loosely formatted. There are two requirements on the string; first, that the Soundex code of the person can be computed from it; and two, that the characters found in the string are a subset of the characters in the actual name, and appear in the same order as in the name. Both of these methods return an NSArray of all person records that have NAME lines matching the name argument.

- (TWGedcomPersonRecord*) fatherOfPerson: (TWGedcomPersonRecord*) person;

- (TWGedcomPersonRecord*) motherOfPerson: (TWGedcomPersonRecord*) person;

- (NSArray*) childrenOfPerson: (TWGedcomPersonRecord*) person;

- (TWGedcomFamilyRecord*) natalFamilyOfPerson: (TWGedcomPersonRecord*) person;

- (NSArray*) natalFamiliesOfPerson: (TWGedcomPersonRecord*) person;

- (NSArray*) spousalFamiliesOfPerson: (TWGedcomPersonRecord*) person;

- (NSArray*) spousesOfPerson: (TWGedcomPersonRecord*) person;

These methods retrieve persons related to a given person, or families that the given person belongs to in some role.

- (TWGedcomPersonRecord*) husbandOfFamily: (TWGedcomFamilyRecord*) family;

- (TWGedcomPersonRecord*) wifeOfFamily: (TWGedcomFamilyRecord*) family;

- (NSArray*) childrenOfFamily: (TWGedcomFamilyRecord*) family;

- (NSArray*) husbandsOfFamily: (TWGedcomFamilyRecord*) family;

- (NSArray*) wivesOfFamily: (TWGedcomFamilyRecord*) family;

These methods return persons who play the indicated roles in families.

TWGedcomRecord

The TWGedcomRecord class is both the superclass for record classes that have distinct behavior and content (person, family and source records), and the concrete class for the other record types. This is perhaps unconventional and may change.

The class has a key class method:

+ (NSArray*) gedcomRecordsFromStrings: (NSArray*) strings maxCount: (NSInteger) maxCount errors: (TWGedcomErrors*) errors;

This method reads an array of NSStrings, each containing a Gedcom line, and converts the strings into an array of TWGedcomRecords (using the subclass types for persons, families and sources). This method does the character-level parsing of the Gedcom data, converting it first into trees of TWGedcomNodes and then creating a TWGedcomRecord for each tree. When done, the method returns the set Gedcom records as an NSArray of TWGedcomRecords. This method adds any errors encountered to the TWGedcomErrors object. This class method is called by the TWGedcomRecordSet initializer. Once that initializer gets the array of records, it places each record in its specific subset and indexes it. The TWGedcomRecordSet initializer also does more extensive checks on the set of records as a whole, which may add more errors to the TWGedcomErrors object.

TWGedcomPersonRecord, TWGedcomFamilyRecord, TWGedcomSourceRecord

Three of the Gedcom record types have their own classes in the Objective-C library, these being the person (INDI), family (FAM) and source (SOUR) records. This allows them to have methods specific to their own type.

TWGedcomNode

A TWGedcomNode holds the data of a single line of Gedcom. A TWGedcomNode is a subclass of the TWNode class, which is the class that DeadEnds software uses to hold XML-based data. Because of this, Gedcom data within a DeadEnds program can be treated as if it were XML-based, including being written or transmitted in XML format.

DeadEnds Genealogical Software