Name: ttm.txt Created: July 22, 1998 Author: George J. Carrette Subject: Text Template Manipulation Library (Documentation) Revision: $Id: ttm.txt,v 1.17 1998/08/06 23:50:32 gjc Exp $ (C) COPYRIGHT 1998, George J. Carrette, Concord Massachusetts. All Rights Reserved. This document describes commercial software available for licensed used. For information email GJC@DELPHI.COM The Text Template Manipulation Library TTM is an ANSI C language implementation of a set of functions for reading, writing, linking, iterating over and making substitutions into text templates. Its purpose is to allow a C programmer to engage in and present the same kinds of syntactically pleasing expressions of purpose possible in scripting languages but without suffering a lot of runtime overhead for it, or danger from making template customization features available to users. **************************** * Overview of Architecture * **************************** The manipulation of text templates in this system is broken down into four major phases, 1. parsing. 2. linking (optional). 3. loading (typical). 4. interpretation (e.g. generation of output). In usual practice phases 1 and 2 take place at application compile-time, and phases 3 and 4 take place at application runtime. Consider a template file to be initially unstructured except for the fact that it is composed of a sequence of characters. In TTM a structure may be imposed on the text by declaring that special sequences represent symbols or bracketing markers. These declarations are commonly kept in an interface definition file. A simple parsing mechanism is then used to recognize these symbols and brackets and return an object structure for further manipulation. The command line utility ttmparse generates a cpu architecture neutral representation of the nested object structure derived from a text file when it is parsed with respect to an interface definition file. This parsing mechanism is general but not rich in syntax. It is flexible enough to allow your template files to appear to be as complex as required for many applications; but beware that creating too much of an appearance of a rich syntax can make it difficult for the person editing templates to understand how to correct parsing errors that result from deviations from the syntax. The interface definition is made up of two kinds of declarations: 1. A symbol must be declared to have a name and an identifying string. 2. A bracket must have a type, name, and strings to identify the start and the end of the bracketed text. Example: The interface definition file ttm-test1.int contains, symbol DATA_LIST $DATUM$ symbol COLNAMES $COLNAME$ symbol standard_copyright $COPYRIGHT$ bracket FOR NCOLS "" bracket WHILE ROWS "" The template file ttm-test1.html contains, EXAMPLE $COPYRIGHT$
$COLNAME$
$DATUM$
The resulting parse tree is: {OBJECT,, "EXAMPLE\n", standard_copyright, "\n\n" {FOR, NCOLS, " "}, " \n", {WHILE, ROWS, "\n ", {FOR, NCOLS, " "}, " \n"}, "\n
", COLNAMES, "
", DATA_LIST, "
\n\n"} Note that the entire template is represented as a bracketed object with no name. A linking capability, provided by the command line utility ttmlink, is useful when complex application templates are to be built up from collections of smaller components. For example: bracket OBJECT standard_copyright $COPYRIGHT$ $/COPYRIGHT$ COPYRIGHT $COPYRIGHT$© 1994-1998 Foobar Inc.$/COPYRIGHT$ Results in: {OBJECT,"", "COPYRIGHT\n\n", {OBJECT,"standard_copyright", "© 1994-1998 Foobar Inc."}, "\n\n"} The ttmlink command used with the above examples will replace any {SYMBOL,"standard_copyright"} in its input argument with the {OBJECT,"standard_copyright",...} found in its library argument. During the running of an application, especially during application startup, a loading phase is typically performed using input from a file or possibly from an embedded string compiled into program. This loading phase converts the architecture-neutral character code representation of a compiled template into a nested C data structure which is easier to interpret quickly. The final intepretation of templates depends very much on the ability to associate values or sequences of values with the symbols present in the templates. Therefore a set of symbol-table support procedures implementing an associative array of lists of strings are included with the TTM library. During interpretation different types of objects and certain types of brackets have special meaning. STRING: invoke the output callback on the string. SYMBOL: invoke the lookup callback on the name, output the return value. OBJECT: output the object once, the name is ignored. FOR: lookup the value of the name once, use it as a repeat count. WHILE: lookup the value of the name each time, use it as a boolean. ************************************************* * Architecture-neutral compiled template format * ************************************************* This format is used for the output file of the parser command, the input and output of the linker command, and the input to the loading phase of the application runtime support. It is a text format, in so far as it is made up of a sequence of characters. However, since it depends on precisely counted string lengths for proper operation it will not function properly if the files are corrupted with respect to end of line sequences or the special opcodes or length representations, by file transfer, source code revision management or other manipulation. Therefore files should be opened and/or transfered in an appropriate text mode (which could in fact be binary as far as the operating system is concerned). The general object representation is made up of a character opcode, followed by a count encoded in the conventional hindu-arabic way using decimal characters, followed by a space character, and finally followed by the characters or objects which make up the object being represented. OPCODE meaning \n a NOOP. # comment. All data up to and include a newline are ignored. C a string. length follows, then the data characters. S a symbol. same encoding as for strings. B a bracket. length follows, then the data which is read recursively. The first element of data is the bracket type, and the second is the name. The rest of the data follows. ************************ * Command line options * ************************ ttmparse input-filename [-i interface-filename] [-o output-filename] [-v verbose-level] Reads the input file, parsing it with respect to the syntax specified in the interface file. The interface-filename defaults to input-filename-int. The interface specification is a sequence of command lines. The arguments to commands are delimited by whitespace, and may be enclosed in double quote marks if the argument contains embedded spaces. The commands are: # comment symbol SYMBOL_NAME SYMBOL_MARKER bracket BRACKET_TYPE BRACKET_NAME STARTING_MARKER ENDING_MARKER -------------- ttmisplit input-filename [-istart istart] [-iend iend] [-i output-interface] [-t output-template] [-cit character-trans-permutation] [-v verbose-level] Extracts and splits an interface specification and a template from a file containing both. The interface section is the first text which appears between the istart and iend sequences. In the output of the interface any character found at index K in the -cit string is replaced with the character found at index (K + L/2) mod L where L is the length of the string. This allows a fairly arbitrary interface specification to be embedded in a template file. istart defaults to "" The character-trans-permutation defaults to "{}<>" -------------- ttmlink input-template-filename [-o output-filename] [input-object-filenames] [-v verbose-level] The input-template-filename and all of the input-object-filenames are loaded. Every symbol in the input template which has a bracket OBJECT of the same name in one of the input-object-filenames is replaced by the corresponding object, which is itself processed recursively for potential replacements. Symbols which are undefined are left alone. Note: This command will be implemented only if time permits. -------------- ttmdebug input-filename Display the parse-tree object representation of the data in the input file in a way that is useful for debugging. -------------- ttm2c input-filename [-o output-filename] [-t template] [-p procname] This command produces a C language source file (output-filename defaults to input-filename.c) which defines the function procname (defaults to my_ttm), with this call semantics: int my_ttm(TTM_obj *ptr); The *ptr is assigned to point to a structure which is the same as what would be created by loading the input-filename using the library function TTM_load("input-filename"). The return status is one of the following: TTM_OK ... ok to use the ptr, then free later with TTM_obj_free(); TTM_CANNOT_ALLOCATE ... cannot allocate storage required. TTM_RETURN_IS_STATIC ... ok to use the ptr, but do not free. TTM_ERROR ... some other error. The C code is generated by interpreting the template file with the symbol BCODE_DATA_AS_C_STRING bound to a C syntax string constant. That is a string starting and ending with double quotes, and with internal slashes, double quotes, tabs, newlines and other characters properly escaped. The default template for the C source code (the -t argument) is ttm2c-template.c-bin (generated from ttm2c-template.c and ttm2c-template.c-int). int $PROCNAME$(TTM_obj *obj) {return(TTM_load_string($BCODE_DATA_AS_C_STRING$, obj));} This can be customized of course. A symbol named BCODE_DATA_AS_C_INTS is also available to the template. This might be used to initialize a constant string. ******************** * Library Routines * ******************** A TTM_Char datatype is used throughout, and string and character constants are defined in special files in order that the code work with 8 bit or large character code sets. -------------- TTM_syntax TTM_syntax_new(void); int TTM_syntax_free(TTM_syntax table); These routines create and free a parser syntax table structure. The temporary storage used by the TTM_parse procedure is attached to the TM_syntax object and is also freed. -------------- int TTM_set_syntax(int argc,TTM_Char **argv,TTM_syntax table) Used internally by the interface definition file parser. -------------- int TTM_cmdparse(Const TTM_Char *str,TTM_Char *buffer,int *argc,TTM_Char **argv, int *flags) Used internally by the interface definition file parser and the TTM_aeval procedure. Provides the conventional argc/argv style command line parsing. Set argc to be the available dimension of the argv array. The argv array will receive pointers into the buffer str, which should be the same length as the input str. Arguments are broken up by whitespace. Double quoted strings may contain spaces and a limited number of special sequences such as \\ and \" and \n. The flags array is set during parsing. If an argument was collected by double quote parsing then its flag will be 1, else it will be 0. -------------- int TTM_parse(TTM_syntax table, int (*gets)(TTM_Char *,int,void *),void *cb_arg, TTM_obj *result); Parses a text template. The gets callback routine should return TTM_OK, TTM_EOF, or TTM_ERROR, and assign the TTM_Char *ptr with characters from the input stream. It should read the precise number of characters specified by its second argument. If it cannot then it should return TTM_ERROR in the case of partial reads. The parse tree is returned in the *result pointer. -------------- int TTM_obj_free(TTM_obj result); Free the text template object, which was either parsed are loaded from someplace. -------------- int TTM_load(int (*gets)(TTM_Char *,int,void *),void *cb_arg, TTM_obj *result); Same callback and return semantics as for TTM_parse, but the sequence of characters should be in the architecture-neutral compiled template format. -------------- int TTM_load_string(TTM_Char *str,TTM_obj *result) The str contains the architecture-neutral compiled template format. -------------- int TTM_unload(int (*puts)(TTM_Char *,int,void *),void *cb_arg, TTM_obj data); This procedure converts the TTM_obj into the architecture-neutral compiled template format. The puts callback should write the precise number of characters specified in its second argument. If it cannot then it should return TTM_ERROR, otherwise TTM_OK. -------------- TTM_obj TTM_obj_lookup(TTM_obj data,TTM_Char *type,TTM_Char *name); Find and return the first (depth first search) bracket of the specified type with the specified name. Or NULL if none found. -------------- int TTM_interpret(TTM_obj data, TTM_Char * (*sym_lookup)(TTM_Char *,void *),void *cb1, int (*output)(TTM_Char *,int,void *), void *cb2); Recursively descend the TTM_obj tree, using the sym_lookup callback on symbols which need to be evaluated and the output callback on the output generated. The OBJECT, FOR, and WHILE bracket types are specially handled as described in the architectural overview. The output callback should return TTM_OK or TTM_ERROR. Upon seeing TTM_ERROR the intepreter will immediately return. The action routines for the special bracket types are defined in the interpret_table array in ttmruntime.c which could be judiciously extended. -------------- int TTM_interpret_size(TTM_obj obj, TTM_Char *(*lookup)(TTM_Char *,void *),void *cb_arg, long *limit) Does interpretation but only computes the size of the output. Set *limit to the maximum size wanted, or -1 if no limit. When the return value is TTM_OK then *limit will be set to the actual output size. Note that because WHILE constructs can easily result in infinite output it might be very important to call TTM_interpret_size first in an application, or to otherwise guard against infinite output by putting size checks in the output callback. -------------- TTM_array TTM_anew(int size_hint) int TTM_afree(TTM_array table); These procedures create and free an associative-array string values list object. The size_hint should be a reasonable approximation of how many keys are expected to be stored in the array. -------------- TTM_Char *TTM_aref(TTM_array a,TTM_Char *name); Look up the name in the associative array and return the first item in its list of values, if any. If the list is empty because it has been popped to the end then instead of returning NULL return the previous value. In effect this causes all lists to be infinitely long, circular in their last element. A initially counterintuitive but useful effect. -------------- TTM_Char *TTM_apop(TTM_array a,TTM_Char *name); Like TTM_aref, but modify the current list pointer to point to the next element. TTM_Char *TTM_apop_cb(TTM_Char *name,void *array) Argument pattern suitable for use as a callback to TTM_interpret. -------------- TTM_list *TTM_alookup(TTM_array a,TTM_Char *name); Return the current list of values associated with the given name. -------------- int TTM_aconc(TTM_array a,TTM_Char *name,TTM_Char *vstart,TTM_Char *vend); int TTM_aconc_l(TTM_array a,TTM_Char *name,long value); Insert a value at the end of the list assocated with the specified name. Create a new list if needed. If vend is NULL then the entire zero-terminated string at vstart is used as the value. Return status: TTM_OK, TTM_CANNOT_ALLOCATE -------------- int TTM_aset(TTM_array a,TTM_Char *name,TTM_Char *vstart,TTM_Char *vend); int TTM_aset_l(TTM_array a,TTM_Char *name,long value); Change the current (first) value in the list associated with the specified name, create a new list if needed. Return status: TTM_OK, TTM_CANNOT_ALLOCATE -------------- int TTM_alength(TTM_array a,TTM_Char *name) Return the current length of the list of values associated with the specified name. -------------- int TTM_arewind(TTM_array a,TTM_Char *name); Reset all the popped list pointers (caused by apop calls) for all symbols in the table or for only the specified name. Returns: TTM_OK, TTM_ERROR. -------------- int TTM_astream(TTM_array,TTM_Char *name, TTM_Char *(*cb)(int,TTM_Char *,TTM_Char *,TTM_Char *,void *), void *callback_arg); Registers a callback to handle the following array operations for the specified name: TTM_AFREE, TTM_AREF, TTM_APOP, TTM_ACONC, TTM_ASET, TTM_AREWIND. The first argument to the callback is the operation. The second is the name being operated on. The third and fourth are the start and end -------------- int TTM_acircular(TTM_array,const TTM_Char *name,int mode) Sets the circularity mode of the named string list. The mode controls the behavior of the TTM_aref and TTM_apop operations: TTM_CIRCULAR_NONE. no circularity. TTM_CIRCULAR_LAST. last element repeats forever (the default mode). TTM_CIRCULAR_WRAP. wraps around to start of list. -------------- TTM_Char *TTM_abuffer(TTM_array a,TTM_Char *name,size_t len); Returns an internally allocated (to be freed with the array) buffer of at least the specified size long. This is used internally by TTM_aeval. -------------- TTM_Char *TTM_aeval(TTM_Char *expression,TTM_array a) This is the most common callback to use for symbol lookup in calls to the TTM_interpret procedure. If the expression does not start with any special characters ("." or "'" or "@") then it is passed to TTM_apop. If the expression starts with a dot, e.g. ".NCOLS" then the TTM_aref procedure is called. If the expression starts with a single quote, e.g. "'HELLO GUYS" then the result is the rest of the string, unprocessed. If the expression starts with an atsign, e.g. "@length COLS" then the rest of the string is parsed with TTM_cmdparse and an action procedure is called. The action routine table is defined in ttmaeval.c and point to procedures which are TTM_Char *(*)(int argc,TTM_Char **argv,TTM_array env) In the evaluation of the arguments to "@" procedures the general rule is that if an argument starts with "." then TTM_aref is used to get the value for further processing, if "'" then no evaluation is done, otherwise TTM_apop is used. There is no other processing such recursive evaluation. The predefined action procedures are: "@length NAME" Return the current length of the list which NAME is bound to. Useful in a FOR construct. In an exception of the argument evaluation rule the state of NAME is not modified, and if it is given as .NAME the original (such as after a rewind) length of the value list is returned. "@not_empty NAME" return true if NAME is bound to a non-empty list, false if empty. Useful in a WHILE construct. "@not NAME" or "@not .NAME" Either TTM_apop or TTM_aref on NAME. Then, logically negate, such that "true" -> "false", "false" -> "true", (N!=0) -> 0, 0 -> 1. "@select N value0 value1 value2 ..." Fetch the value of N then use it to select from the rest of the arguments. Fetch the value of the selected argument. "@if_equal key1 key2 value1 value2" If the value of key1 is equal to the value of key2 then return the value of value1 else use value2. "@and x1 x2 x3 ..." Logical and operation. "@or x1 x2 x3 ..." Logical or operation. "@default key default_string" If the key is bound then its value otherwise return the value of the default_string. "@debug_print n" Return a string (limit to size n, default 2048) containing a record of the current values of all the keys in the associative array. "@url value" Return the url-encoded string for value (actually it gets html encoded too after being url coded, for obvious reasons). "@html value" Return the html-encoded string for value. "@query_string a b c" return url style query string. e.g. a=value_of_a&b=value&c=value. Properly url encoded and then html encoded. TTM_Char *TTM_aeval_cb(TTM_Char *expression,void *a) This version is suitable for use as an argument in a call to TTM_interpret without having to use a cast. ***************** * Example Usage * ***************** There is no need to call the syntax and parse related procedures in normal usage because the functionality will be taken care of by the command line utilities. The vast majority of use is probably covered by: 1. a call to TTM_load 2. a call to TTM_array_new 3. multiple calls to TTM_aconc to establish values for symbols. 4. possibly calls to TTM_astream to register callbacks for symbol lookup. 5. a call to TTM_interpret to determine the size of the output. 6. a call to TTM_arewind to reset the string value list and stream pointers. 7. another call to TTM_interpret to generate actual output. 8. calls to TTM_array_free and TTM_obj_free. The source file ttm-test1.c illustrates all of these issues, including nested repeating objects, the FOR and WHILE constructs, when used with the ttm-test1.html template file and ttm-test1.int interface file. It may be used to "test-drive" other templates as well. Its first argument is the name of the template, and the rest of the arguments are in pairs of NAME VALUE. ************** * Test Cases * ************** These are from the makefile. test1 ... tests the parser. Verbose output shows how a file is broken up into tokens and the resulting parse tree. test2 ... tests the interface-section splitter. test3 ... against split output from test2. test4 ... tests the compiled template runtime loader The parse trees shown should be the same as from test1 and test3. test5 ... tests the parser against known bad input. A "best guess" parse tree is show, along with the error trace from the parser. test6 ... tests the runtime loader and interpreter. test7 ... tests the C source static data structure feature, using the ttm2c command and the C compiler. ************************ * Implementation Notes * ************************ The most natural and efficient way to code the tokenizer phase of the parser would be with a good regular expression recognizer. However an existing library with a suitable call interface (one which would avoid the need for a string buffer, which rules out the unmodified Henry Spencer) was not at hand, so in the interest of a providing a low cost implementation in a context where having the ultimate efficiency at compile time was not a big issue an algorithm involving multiple string-searches and recursive substring partitioning was used instead. The bracket syntax parsing itself is of course the simplest possible. The parsed object representation is convenient for both linking and interpretation, because it is a straightforward C data structure involving C pointers. It is possible but not likely that memory fragmentation and locality of reference issues may be important in some applications. There is of course storage overhead involved in each call to malloc used to create a parse tree, and wasted most-significant bit overhead in pointers which are relatively near the object from which a pointer radiates. Unfortunately you cannot expect that straightforward C or C++ will be as efficient as a good LISP implementation would be when dealing with tree structures. But avoiding the use of casts and overlaid data structures is important for code maintainability. An alternative is to give up most C provided structure and write lowlevel routines to manipulate a bytecoded representation (such as what is saved to disk) directly. The associative array and linked list of values implementation is conventional. Thread reentrancy issues: There are no global or static variables in the entire library. However, multiple simultaneous calls to functions operating on the same TTM_syntax or TTM_array structures should not be made, because these structures have internal pointers which are updated by most of the functions which operate on them. The TTM_obj structure on the other hand is not modified by any procedure. String constants: These are defined in ttmprivate.h and are the minimal required to implement the functionality required. A text encoding of objects in the compiled templates is only slightly slower to read and internalize than would a more binary oriented format because most of the time is probably spent in the counted-read operations, which would be the same in a binary format. Under systems with Unix inspired operating system call semantics beware of infecting the callback routines used by procedures such as TTM_load and TTM_interpret with partial read or write artifacts of implementation. In extending the special functions table in ttmaeval.c beware not to make it too long.