Parsing/Validating HTML and Other Tricks - NBBC: The New BBCode Parser

III. Using NBBC

[ Previous: H. Supporting Local Images | Next: J. Limited-Length and Plain-Output Modes ]

I. Parsing/Validating HTML and Other Tricks

Overview

So you say that you don't really want BBCode, and just want a safer version of HTML instead? Maybe you just prefer the syntax and structure of HTML, or maybe you want to use a Javascript WYSIWYG HTML editor on your site? Believe it or not, NBBC can do HTML syntax too. In this section, we'll discuss what it takes to implement a "safe HTML" using NBBC, and look at some of the issues and caveats involved.

BBCode is not, in general, that different from HTML. They both use "tags" to represent document structure, and although their tags aren't the same, there is some overlap. BBCode tends to focus on presentation, while HTML tends to focus on structure, but some tags, like [b] and  are nearly identical in behavior. These are the major differences between BBCode and HTML:

BBCode uses [brackets] while HTML uses <angle brackets>.
BBCode treats a newline as a paragraph break; HTML ignores it.
HTML allows the use of entities, character codes produced by the "&" character, such as < and & and é.
BBCode has mostly presentation tags, while HTML has mostly structural tags, leaving presentation to CSS.

NBBC has specific features to address points 1, 2, and 3 above; point 4, the issue of translating input pseudo-HTML tags into valid output HTML entities (i.e., replacing the Standard BBCode Library) is left up to you (although in future versions of NBBC, we may add a Standard HTML Library if enough people demand it).

Let's tackle each of points 1, 2, and 3 separately, and then put them all together at the end.

Switching Tag Markers

NBBC lets you use any of [brackets], <angle brackets>, {curly braces}, or (parentheses) to delineate your tags. (Most likely, you'll want either [brackets] or <angle brackets>, but the other two are offered in case you need them.)

Switching from using [brackets] to using <angle brackets> is easy:

Code:

$htmlparser = new BBCode; $htmlparser->SetTagMarker('<'); ... $output = $htmlparser->Parse($input);

The SetTagMarker function changes the current tag marker to your desired marker. Note that NBBC still behaves otherwise the same: It simply uses a different character for marking the start and end of tags. [[Wiki-links]] are fully supported no matter what tag marker you use, and always use the current tag marker; for example, if the tag marker is '<', a valid wiki-link might look like this: <<keyword>>

The default tag marker is '[', and you can determine the current tag marker by calling GetTagMarker.

Disabling Newline Breaks

Normally, NBBC treats a newline as the end of a paragraph: An HTML   tag is inserted anywhere a newline appears, except when it's close to a tag that prohibits newlines near it. While this is very convenient for the user, this is very much un-HTML-like, as HTML is a fully free-formatted language: Newlines mean nothing special in HTML.

NBBC can be told to treat newlines as plain whitespace, just like HTML does. To do this, you use the SetIgnoreNewlines function:

Code:

$htmlparser = new BBCode; $htmlparser->SetIgnoreNewlines(true); ... $output = $htmlparser->Parse($input);

When "ignore-newlines" is true, NBBC will treat newlines almost exactly the same as it treats whitespace, and will not generate   tags in the output. (In fact, the only difference between newlines and other whitespace is that newlines are regularized to Un*x format: Whether they're "\r\n" or "\n" or "\r" in the input, they'll always be "\n" in the output.)

By default, "ignore-newlines" is false, and you can determine the current state by calling GetIgnoreNewlines.

Allowing HTML Entities

Normally, NBBC takes all input characters and makes them safe for HTML output: For example, a < symbol in the input will be turned into a < entity in the output. Usually, this is desirable; however, when you have set the tag marker to '<', you probably want HTML behavior, and want to be able to type < in the input to get a < symbol in the output.

To allow entities, you need to allow the ampersand character ('&') to be passed through unchanged to the output. Normally, NBBC, upon seeing a & symbol in the input, will turn it into & in the output, which means that if you type < in the input, you'll see < in the output (which is actually &lt; if you look at the HTML source). But this isn't what you want when you're trying to process HTML: You want a & in the input to be an & in the output.

NBBC includes a convenient pair of functions to control how the ampersand character is processed, whether it's translated to safe HTML or whether it's passed through unchanged. You can allow & to be passed through unchanged like this:

Code:

$htmlparser = new BBCode; $htmlparser->SetAllowAmpersand(true); ... $output = $htmlparser->Parse($input);

When "allow-ampersand" is true, the ampersand will be passed to the output entirely unchanged, which is exactly what you want when processing HTML. The default is false, and you can determine the current state by calling GetAllowAmpersand.

Putting It All Together

So now let's assemble all these pieces into a single short script that can parse HTML. Our HTML tags will match the BBCode tags, but if you need HTML-specific tags, you can always add support for them with AddRule. (That said, in a future version of NBBC, there may be a Standard HTML Library added if enough people want it.) So this code is generally what you'll want if you're implementing HTML:

Code:

$htmlparser = new BBCode; $htmlparser->SetTagMarker('<'); // HTML uses <angle brackets>. $htmlparser->SetIgnoreNewlines(true); // HTML is free-formatted. $htmlparser->SetAllowAmpersand(true); // HTML uses & for escaping entities. $htmlparser->DisableSmileys(); // HTML doesn't have built-in smileys. ... $htmlparser->ClearRules(); // No BBCode rules. $htmlparser->AddRule("p", ...); // Allow the element. $htmlparser->AddRule("b", ...); // Allow the element. $htmlparser->AddRule("i", ...); // Allow the element. $htmlparser->AddRule("a", ...); // Allow the <a> element. $htmlparser->AddRule("pre", ...); // Allow the <pre> element. ... ...more rules... ... $output = $htmlparser->Parse($input);

First, we switch to <HTML> tag markers, and we treat newlines as plain whitespace, and we allow amperands to be used to provide entities. Then we remove all the BBCode rules, and add rules specific to HTML. And that's all it takes.

[ Previous: H. Supporting Local Images | Next: J. Limited-Length and Plain-Output Modes ]