Dangerous Programmer: regex

Thursday, April 19, 2007

Command-Line Highlighter

I was grepping through some logs the other day at home and I figured "wouldn't it be nice if I could pipe this through something that would highlight lines matching a regex instead of just having grep pull those lines out?" Wouldn't you know it, such a tool doesn't exist, as far as I can tell. Which is very weird, since I've already found it VERY useful...

grep *will* give you context lines if you ask for them explicitly, ie: give me 2 lines before and 4 after the matching line, but I wanted to see ALL the output, with the matches highlighted for easy-spotting.

Anyways, I wrote a little perl script to do the work - it just inserts 'standard' shell color-escape sequences before and after the matching word/line to highlight it (in bright-green-on-black by default... if you're a Matrix fan/l33t hax0r and use a green-on-black setup anyways, there's an option that lets you change the highlighting colors.

It's fairly basic at the moment, but I plan on porting this to C as soon as I have time (possibly tonight) now that it's up on SourceForge, and modify it a bit so that the options/usage syntax matches grep wherever possible/appropriate.

HighLite

Sunday, March 25, 2007

RFC-Compliant URI Validation

Recently, as part of another project, I needed some code to validate a URI string based on RFC-2396. The goal here was the ability to ensure that a URI was RFC compliant. As such, I decided to use a set of regular expressions which were directly modelled from the ABNF definitions in the RFC. ABNF is by it's nature a very close match for regular expressions in terms of usage, syntax and purpose, and so using them seemed like a logical method of building the URI validation code.

I started by creating an expression for the simplest (and first) definitions in the RFC. 'lowalpha' is defined by the ABNF as being one of the characters a-z inclusive, while 'upalpha' is defined as A-Z inclusive. 'alpha' is defined as either a 'lowalpha' or an 'upalpha' character. 'digit' is defined as one of the characters 0-9 inclusive. Lastly, 'alphanum' is defined as being either an 'alpha' or a 'digit' character. Based on these five definitions, I could create five matching regular expressions which would serve the purpose of indicating whether an arbitrary string matches one of these definitions or not.

<?php

define('LOWALPHA', '[a-z]');
define('UPALPHA', '[A-Z]');

define('ALPHA', '(?:'.LOWALPHA.'|'.UPALPHA.')');
///   (?:[a-z]|[A-Z])

define('ALPHA_OPT', '[a-zA-Z]');

define('DIGIT', '[0-9]');

define('ALPHANUM', '(?:'.ALPHA.'|'.DIGIT.')');
///   (?:(?:[a-z]|[A-Z])|[0-9])

define('ALPHANUM_OPT', '[a-zA-Z0-9]');

?>

The defined expressions ending in _OPT are optimized versions of the regular expression - ie: it's much more efficient to execute a single expression which is a range like [a-zA-Z] than it is to execute two adjacent ranges such as [a-z]|[A-Z].

Within the final implementation, expressions have been optimized where possible but for the most part they mirror the ABNF in the document more or less directly. Almost all the optimization that is present occurs at the lowest level, ie: in the simplest, base expressions from which the further, more complicated expressions are constructed. This approach seems to work since any optimization can loosely be thought of as having an exponential benefit, relative to how low of a level the optimization is performed at.

UriValidator

Dangerous Programmer

Thursday, April 19, 2007

Command-Line Highlighter

Sunday, March 25, 2007

RFC-Compliant URI Validation

Blog Archive

Tags