Here is the outline of the class:
Class to handle SAX parsing functions =
class LiterateDocumentHandler(saxlib.DocumentHandler):
{Class-wide constants}
def __init__(self):
{Initialize object variables}
{Overrided document handling methods}
{Auxillary document methods}
|
Since our program is based on processing instructions, the
SAX processing instruction handler is the key function in the
program.
Overrided document handling methods =
def processingInstruction(self, target, data):
{Initialize processing instruction variables}
{Parse processing instruction into attribute-value pairs}
{Call appropriate method for processing instruction target}
|
Processing instructions present a parsing problem. Although we
want to structure our processing instructions like elements, with
attribute-value pairs, XML does not specify anything about
how they are formatted, so the parser just hands you the entire
content in one string. Therefore, we
need to write code to parse the data string into attribute-value
pairs. To make this simple, we will use regular expressions.
To simplify, we will also force that the attributes be in
double-quotes, not single quotes.[1]
The following code will parse the variable data into the
dictionary pi_attrs.
Parse processing instruction into attribute-value pairs =
while 1:
try:
match = self.PIRegex.search(data, regex_start)
pi_attrs[match.group(1)] = match.group(2)
regex_start = regex_start + match.end() + 1
except:
break
|
The variables used here are initialized in
Initialize processing instruction variables. Here is what each variable does -
- data
This is the character string after the processing instruction
target. This is passed as a parameter
- match
This object holds all of the information about the match made.
- regex_start
This is the position in the data string that we are currently searching.
It starts at 0, so we have to initialize it at the beginning
of the function:
- pi_attrs
This is the dictionary that holds the result of our parsing.
It has to be initialized at the beginning of the function.
- self.PIRegex
This is a precompiled regular expression object.
This is initialized when the object is initialized
of the object.[2] It is initialized
as follows:
Initialize object variables +=
self.PIRegex = sre.compile('([a-z-]+)="([^"]*)"')
|
As you can see it matches any alphabetic character string (which can include dashes as well), followed by an equal sign and a quoted expression. It would be nice to get this closer to the actual parsing of element attributes, but I don't have the XML spec handy.
The parsing section is wrapped in a try/except block. This could
be avoided with boundary checking and "no-match" checking, but
simply doing it this way meant I could avoid dealing with these
issues. The one drawback to this method is that errors within
the processing instructions neither caught nor reported. This
could be improved.
Finally, after the processing instruction is parsed, a
method is dispatched based on the processing instruction
target. Right now, this is just a sequence of ifs. I
think I will move it to a target-method dictionary in a
future version. Also, I need to move the string constants
to the constants section of the program.
Call appropriate method for processing instruction target =
if target == 'lp-section-id':
self.start_section_id(pi_attrs, data)
elif target == 'lp-section-id-end':
self.end_section_id(pi_attrs, data)
elif target == 'lp-code':
self.start_code(pi_attrs, data)
elif target == 'lp-code-end':
self.end_code(pi_attrs, data)
elif target == 'lp-ref':
self.start_ref(pi_attrs, data)
elif target == 'lp-ref-end':
self.end_ref(pi_attrs, data)
elif target == 'lp-file':
self.match_filename_to_section(pi_attrs, data)
|
This section will concentrate on how the sections of code
are read and stored. The basic data structure for storage
consists of lists of code fragments, which can also contain
lists. Then, there is a dictionary matching each section
id to the appropriate code fragment list for that section.
This list is later walked to produce the actual code for output.
Therefore, we need to initialize our section id to code fragment
list dictionary at object-creation time.
Initialize object variables +=
self.sections = {}
|
However, not only do we need to be able to find sections, we
also need to find out what section id should be the top-level
section of each file. Therefore, we have the declaration
Initialize object variables +=
self.files = {}
|
Now, when a programmer specifies the name of a section, they
will probably do it in a nice, human-readable form. However,
we need to normalize that into a form that can be keyed off
of. The reason that the human-readable form can't be keyed off
of is because of problems with spacing, capitalization, and
potential symbols within the text. Therefore, we have the
following method to normalize the data.
[3]
Overrided document handling methods +=
def normalize_id(self, id):
id = self.matchNonLetterRegex.sub('', id)
#sre.gsub('[^a-zA-Z]+', '', id)
id = string.lower(id)
return id
Initialize object variables +=
self.matchNonLetterRegex = sre.compile('[^a-zA-Z]+')
|
This first removes any non-alphabetic character, and then
converts it all to lower-case, thus giving the normalized
version of the id.
In order for a section to contain code, it has to be able to
read in both a section ID and the code that goes with it. In
addition, it has to be able to append multiple code fragments
and references to other sections within its text. Therefore,
we need to modify what the characters callback function is doing
based on what the last processing instruction was. The way that
we modify the characters callback is simply by having our standard
characters callback only be a dispatch method. It is simply this:
Overrided document handling methods +=
def characters(self, ch, start, length):
func = self.characters_cb
func(ch)
|
The instance variable characters_cb is the function that handles the callbacks (usually
a bound method) which takes one parameter - the character string. However, this means
that we need a default callback initialized when the document handler object is created.
Initialize object variables +=
self.characters_cb = self.default_ch_cb
|
The default characters callback does nothing.
Overrided document handling methods +=
def default_ch_cb(self, ch):
pass
|
Now, when the processing instruction method gets an lp-section-id processing instruction,
it dispatches to the function start_section_id, which sets up the characters callback to
read in the current section id.
Auxillary document methods +=
def start_section_id(self, attrs, data):
self.current_section_id = ''
self.characters_cb = self.read_section_id_ch_cb
|
The current_section_id instance variable is where the read_section_id_ch_cb will read the
section name into.
Auxillary document methods +=
def read_section_id_ch_cb(self, ch):
self.current_section_id = self.current_section_id + ch
|
Finally, when we hit the lp-section-id-end processing instruction, that turns
off the section id reader.
Auxillary document methods +=
def end_section_id(self, attrs, data):
self.characters_cb = self.default_ch_cb
self.current_section_id = self.normalize_id(self.current_section_id)
|
Notice that it sets the characters callback back to the default and normalizes the section
id. However, this is worthless if no code sections are ever placed here. The lp-code
processing instruction is used for that. It dispatches to the following function:
Auxillary document methods +=
def start_code(self, attrs, data):
id = self.current_section_id
if self.sections.has_key(id):
self.current_section = self.sections[id]
else:
self.current_section = []
self.sections[id] = self.current_section
self.characters_cb = self.read_section_data_ch_cb
|
This checks to see if the current id is yet in the sections instance dictionary. If it
isn't, it creates a new list to hold the data, and then stores that list in the dictionary
for that id. If it is in the dictionary, it simply pulls that list into the current_section
instance variable. It then sets the characters callback function to read section data. Note
that the current_section variable is never used except after we have assigned it a value.
Therefore, we don't need to initialize it at object creation time. Anyway, the section
reader function looks like this:
Auxillary document methods +=
def read_section_data_ch_cb(self, ch):
self.current_section.append(ch)
|
Finally, when the lp-code-end processing instruction is found, it simply resets the
characters callback to the default.
Auxillary document methods +=
def end_code(self, attrs, data):
self.characters_cb = self.default_ch_cb
|
The current section list and section id are maintained in case the user wants to add
additional lp-code sections later under the same id.
Within the code sections, there can also be references to other code sections. This
is accomplished by switching the characters callback to read in the section id. If the
section id does not yet exist, it is created as empty, and it is included as an object
reference in the current list.
Auxillary document methods +=
def start_ref(self, attrs, data):
self.current_reference = ''
self.characters_cb = self.read_ref_ch_cb
def read_ref_ch_cb(self, ch):
self.current_reference = self.current_reference + ch
def end_ref(self, attrs, data):
ref = self.current_reference
ref = self.normalize_id(ref)
self.characters_cb = self.read_section_data_ch_cb
if not self.sections.has_key(ref):
self.sections[ref] = []
self.current_section.append(self.sections[ref])
|
Since this is only allowed to be called from lp-code sections, after we're done
we simply reset the characters callback to read_section_data_ch_cb.
I decided to match sections to files with the lp-file processing instruction, which has
a file attribute for the filename and an id attribute for the section id to put in the file.
The processing instruction is dispatched to this function:
Auxillary document methods +=
def match_filename_to_section(self, attrs, data):
real_id = self.normalize_id(attrs['id'])
if attrs.has_key('id') and attrs.has_key('file'):
self.files[attrs['file']] = attrs['id']
|
Which normalizes the id, verifies that all the parameters are in place, and then
makes the dictionary mapping.
At the end of the program, we have to write out all of the files to disk. Therefore, we
have this handy-dandy method:
Auxillary document methods +=
def write_files(self):
for file in self.files.keys():
ostream = open(file, "w")
ostream.write(self.flatten_array_to_string(self.sections[self.files[file]]))
|
Which looks up the section associate with each file, flattens the code fragment list to
a single string, and then writes it to the given file. In the future, I plan to do this
so that it doesn't take up so much memory, like has a list walker, which goes through
each element and executes a callback function. The flattening function iterates through
each element, checks to see if it is a string or a list. If it is a string, it adds it
onto the string it is building, otherwise it calls itself with the sublist and adds the
result onto the string.
Auxillary document methods +=
def flatten_array_to_string(self, array):
new_str = ''
for item in array:
if type(item) == type([]):
new_str = new_str + self.flatten_array_to_string(item)
else:
new_str = new_str + item
return new_str
|