Sunday, March 25, 2007

RFC-Compliant URI Validation

Recently, as part of another project, I needed some code to validate a URI string based on RFC-2396. The goal here was the ability to ensure that a URI was RFC compliant. As such, I decided to use a set of regular expressions which were directly modelled from the ABNF definitions in the RFC. ABNF is by it's nature a very close match for regular expressions in terms of usage, syntax and purpose, and so using them seemed like a logical method of building the URI validation code.

I started by creating an expression for the simplest (and first) definitions in the RFC. 'lowalpha' is defined by the ABNF as being one of the characters a-z inclusive, while 'upalpha' is defined as A-Z inclusive. 'alpha' is defined as either a 'lowalpha' or an 'upalpha' character. 'digit' is defined as one of the characters 0-9 inclusive. Lastly, 'alphanum' is defined as being either an 'alpha' or a 'digit' character. Based on these five definitions, I could create five matching regular expressions which would serve the purpose of indicating whether an arbitrary string matches one of these definitions or not.

<?php

define('LOWALPHA', '[a-z]');
define('UPALPHA', '[A-Z]');

define('ALPHA', '(?:'.LOWALPHA.'|'.UPALPHA.')');
/// (?:[a-z]|[A-Z])

define('ALPHA_OPT', '[a-zA-Z]');

define('DIGIT', '[0-9]');

define('ALPHANUM', '(?:'.ALPHA.'|'.DIGIT.')');
/// (?:(?:[a-z]|[A-Z])|[0-9])

define('ALPHANUM_OPT', '[a-zA-Z0-9]');

?>
The defined expressions ending in _OPT are optimized versions of the regular expression - ie: it's much more efficient to execute a single expression which is a range like [a-zA-Z] than it is to execute two adjacent ranges such as [a-z]|[A-Z].

Within the final implementation, expressions have been optimized where possible but for the most part they mirror the ABNF in the document more or less directly. Almost all the optimization that is present occurs at the lowest level, ie: in the simplest, base expressions from which the further, more complicated expressions are constructed. This approach seems to work since any optimization can loosely be thought of as having an exponential benefit, relative to how low of a level the optimization is performed at.

UriValidator

No comments:

Post a Comment