VI. Appendices
[ Previous: F. CSS Examples | Next: H. Tag Parameters ]
G. How the Parser Works
This section is intended primarily for experienced software engineers and
computer scientists. It's heavy on theory, light on practice, and if you're
uncomfortable around concepts like computational orders, finite-state machines,
and push-down automata, you're going to be lost. If you need an introduction to
these concepts, we recommend the classic Dragon Book
and Hopcroft and Ullman's Little Book of Evil,
but be forewarned that compilers are a dense subject, and each one of those books
could readily be the subject of a year or two of study at a decent university.
For what it's worth, you don't need to know any of the stuff in this section
to be able to use NBBC; this section is primarily of academic interest.
NBBC is a compiler, and is broken down into two main layers: There's the lexical analyzer,
which is responsible for breaking characters sequences into tokens, and is implemented in
nbbc_lex.php; and there's the parser, which is responsible for analyzing the token
sequences and generating proper output. This appendix is not intended to be a complete
analysis of the structure of NBBC: Rather, it's intended to be a broad overview that will
help you to understand the rationale embodied by the source code, which is well-commented,
so should be relatively easy to follow once you've read this short discussion.
The Lexer
Most of the discussion and most of the complexity is focussed in the parser, so let's
quickly get the lexical analyzer (lexer) out of the way. Like most lexers, NBBC's is
built on regular expressions. The general principle used by the lexer is to
use preg_split to initially divide the input into an array of strings that
alternate between tags/whitespace/newlines in the odd array indices and non-whitespace
text strings in the even array indices. A simple flag is used to track whether the
next item in the array is a tag/whitespace/newline or a text string.
For this discussion, we'll use the following input as our primary example:
Before.
[center]Hello, [i]World[/i]![/center]
After.
The lexer breaks this input down into the following stream of tokens (wrapped here for
clarity's sake; note that the empty array elements are omitted in this discussion for
clarity):
"Before." WS NL [center] "Hello," WS [i] "World" [/i] "!" [/center] NL "After."
Each token is of type BBCODE_TEXT (like "Before."), BBCODE_NL
(which is a newline matching either Un*x, Windows, or Mac formats), BBCODE_WS (any
sequence of non-newline whitespace characters), BBCODE_TAG (a start tag), or
BBCODE_ENDTAG (an end tag). These are retrieved by the parser one at a time
by calling BBCodeLexer::NextToken().
Overview of the Parser
The parser is a push-down automaton that recognizes an LL(1) grammar.
Its design is inspired by that of simple push-down mathematical expression parsers, where
the state of the stack and a single token of lookahead is used to determine what operation
to perform next. This means that the parser is implicitly constructing a document tree
based on the tokens and the tags' class contraints, but because it's an implict instead
of explicit construction, it's much faster than a genuine tree-builder: We don't need to
allocate, store, and free tree nodes for this to work.
When provided with a string, the parser gives it to the lexer, which breaks it down
into tokens; then the parser reads each successive token, one at a time, and processes
the token according to its current state and the contents of the stack. This is an
approximate description of what happens for each incoming token, ignoring the many edge
cases, error-detection and error-correction, and whitespace cleanups:
- BBCODE_TEXT: Plain text is filtered, and then pushed onto the top
of the stack to await further processing.
- BBCODE_WS: Non-newline whitespace is pushed just like text.
- BBCODE_NL: Newlines are pushed onto the stack mostly like text.
- BBCODE_TAG: One of several things can happen, depending on what kind of tag this is:
- If the tag is a verbatim tag, run a BBCODE_CHECK on it, then collect successive
tokens until we reach its matching end tag. Pack the intervening tokens together,
and pass the result to the tag's processing function. The output of the processing
function is then pushed back onto the stack as text.
- If the tag is an isolated tag (it has no end tag), run a BBCODE_CHECK on it,
and then immediately pass the tag to the tag's processing function. The output of
the processing function is then pushed back onto the stack as text.
- For all other start tags, including those with both required and optional end
tags, we simply push the tag onto the stack, since we won't know how to fully
process it until its body has been processed.
- BBCODE_ENDTAG: Walk backwards up the stack, searching for a matching
start tag. If an ending-optional tag is found in between, a suitable end tag is generated
for it on the spot and processed, until there is nothing between this end tag and its
start tag except for processed HTML text. The processed HTML text is collected as a string,
and passed to the tag's processing function; the output of
the processing function is then pushed back onto the stack as text.
At the end of processing the input, the stack will contain the output as multiple stack
entries; these are collected to a single string, which is returned as the output.
Performance analysis
This algorithm runs in O(n) time except when we encounter an end tag and have to go back
up the stack; then it's O(n) to walk back to the start tag, and O(n) to collect tokens between
the start tag and end tag and collapse them into a string. So this is effectively O(n+n*m),
where "n" is the number of input tokens and "m" is the number of end tags; however, that's
a worst-case analysis, assuming that each end tag contains an entire document's worth of tokens
between it and its start tag. In practice, the number of tokens between a start and an end tag is
fairly small, approaching some constant k (which is around 20 or 30 tokens, in practice), so in
typical, average cases, this algorithm runs in O(n+m) time.
(As a future optimization, a constant-time speedup could be achieved by using
hash tables to track start tag locations, instead of searching for them, but the
intervening tokens must be collected anyway when the start tag is found, so the resulting
search performance would still be O(n).)
Error-correction
While BBCODE_TEXT, BBCODE_WS, and BBCODE_NL tokens are always
legal, BBCODE_TAG and BBCODE_ENDTAG tokens may be illegal in some cases,
and the parser needs to cope with this. There are several possibilities:
- BBCODE_TAG with no matching BBCODE_ENDTAG: When text is being collected
at the end, the initial tag will have an invisible BBCODE_ENDTAG added to the end of the input.
- BBCODE_ENDTAG with no matching BBCODE_TAG: This is caught during the
processing of BBCODE_ENDTAG: If the search up through the stack finds no
matching BBCODE_TAG, this end tag is simply converted to plain text and pushed
as a new BBCODE_TEXT entry.
- Tag inside a class that doesn't allow it: If, for example, a user places a
[center] inside an [i] tag, that's illegal, and NBBC has a special
routine to handle this case. When a BBCODE_TAG is encountered, it is checked
against the current containing class, which can be determined from the top of the stack;
if the containing class is disallowed, this BBCODE_TAG will be converted to
plain text and pushed as a BBCODE_TEXT token.
- Mis-ordered [i][b]tag nesting[/i][/b]: This is detected during
BBCODE_ENDTAG processing: If the walk back up the stack crosses over any
non-matching start tags, NBBC generates a matching end tag for each of those start tags
just before the current end tag, and sets a flag to indicate that any single following
unmatched end tag with the same name should be silently removed from the input. We can
be sure that any start tags that are crossed are not start tags of a higher class,
because they would have been turned into plain text (by the previous error-handling rule)
if they were. So in the example of [i][b]tag nesting[/i][/b], upon reaching the [/i], the
parser would search for the [i], find the [b] first, and generate an
additional [/b] before the [/i]; then, to help "fix" the output, NBBC
will silently remove any one unmatched [/b] it finds later in the input stream.
So the output would be <i><b>tag nesting</b></i>,
which is about what the user might expect to get. Note that for some really ugly mis-nestings,
this rule will produce weird results, but they'll still at least be valid XHTML 1.0 Strict.
Example processing
Let's look at how this would then process our example input. First, this is again our
incoming token stream:
"Before." WS NL [center] "Hello," WS [i] "World" [/i] "!" [/center] NL "After."
This is what happens during the parse of this stream:
- Initial class is "block".
- "Before." read, and pushed onto the stack.
- WS read, and pushed onto the stack.
- NL read. WS on the stack is popped (NL consumes nearby WS), and the NL is pushed onto the stack.
- [center] read. Class is checked, and [center] is allowed within "block". Pop the NL
from the stack, since [center] consumes newlines. [center] is
then pushed onto the stack, and class is changed to "block".
- "Hello," read, and pushed onto the stack.
- WS read, and pushed onto the stack.
- [i] read. Class is checked, and [i] is allowed within "block". [i]
is pushed onto the stack, and class is changed to "inline".
- "World" read, and pushed onto the stack.
- [/i] read. Search down the stack for a [i]. Remove all tokens in between and
convert them to a string. Pass string to [i]'s processing function, and pop [i] from stack.
Push the resulting HTML, "<i>World</i>", onto the stack.
- "!" read, and pushed onto the stack.
- [/center] read. Search down the stack for a [center]. Remove all tokens in between and
convert them to a string. Pass string to [center]'s processing function, and pop [center] from stack.
Push the resulting HTML, "<div style="text-align:center;">Hello, <i>World</i>!</div>", onto the stack.
Remove any trailing NL or WS tokens, since [/center] consumes newlines.
- (The next NL was skipped by [/center], so it's not seen by the main parsing loop.)
- "After." read, and pushed onto the stack.
- End-of-input. Main parsing loop ends. Everything still on the stack is collected as a string,
"Before.<div style="text-align:center;">Hello, <i>World</i>!</div>After.",
which is then returned to the caller.
For what it's worth, you can test this yourself by enabling debug mode:
In debug mode, the parser prints to the browser each action it performs, separating each token with horizontal
rules. In addition to being useful for debugging new tags, this mode can also be useful for divining the
basic workings of the parser itself.
[ Previous: F. CSS Examples | Next: H. Tag Parameters ]
Copyright © 2010, the Phantom Inker. All rights reserved.