Dangerous Programmer: php

Showing posts with label php. Show all posts

Tuesday, March 31, 2009

365 DoC - W1, D2 - C port of PHP Array functions

So today, I decided to 'port' some of the more useful/relevant PHP builtin array handling functions to what their equivalents in C might look like. Sounds kind of boring, I know, but the end result is both useful and re-usable, and is the kind of thing that people write over and over again... although C doesn't really have true arrays, pointer-pointers do the trick for string arrays (char **'s), etc.

Keep in mind that because of the inherent difference between these two languages, there will be a) some significant differences in the way the resulting function 'ports' operate, and b) there will likely be many more ways to implement a version of the same function in addition to what I've done here, both inside the function itself (different logical implementation) and in the declaration and behaviour of the function as well (ie: destructive or otherwise over-writing functions, vs. versions of the function that return/fill in a new array, etc)...

Problem: create functional counterparts of the following PHP functions in C: array_diff, array_intersect, array_merge, array_reverse, array_search, array_slice, array_splice, array_unique.

Note: there are many more PHP array handling functions than those listed in the problem... some don't apply to C, or aren't suitable for a port for various reasons, some are more complex and will be explored separately on another day (ie: reduce, walk, diff/intersect-type functions with user-specified callbacks, etc), and finally the sorting functions are ignored as we'll ultimately spend whole days dedicated to implementing various sorting algorithms, and so don't need to repeat that stuff here.

Solution: the above functions were implemented in the context of a C function performing the same duties, on string-based "arrays" (ie: char **'s), in a non-destructive manner (ie: array_splice returns a filled-in, separate array, not a modified version of the original, etc.

Note: the pre-processor directive

#include <string.h>

is omitted from the following examples, but they will all need at least that one header to function properly.

array_diff and array_intersect were implemented in the same function, due to the extreme similarities between the logic of each (one returns similarities, one differences), however I'll explain the initial array_diff first, and then how it was modified to do both functions:


int array_diff_str(
  char **array1,
  size_t array1Size,
  char **array2,
  size_t array2Size,
  char **arrayResult)
{
  int count1, count2, found, numFound = 0;

  for ( count1 = 0; count1 < (int)array1Size; count1 ++ ) {
    found = 0;

    for ( count2 = 0; count2 < (int)array2Size; count2 ++ )
      if ( 0 == strcmp(*(array1 + count1), *(array2 + count2)) ) {
        found = 1;

        break;
      }

    if ( !found )
      *(arrayResult + numFound ++) = *(array1 + count1);
  }

  return numFound;
}

Pretty simple... the arguments in order are the first array and it's size, the second array and it's size, and an empty pointer array to hold the result (initialized outside the function to be big enough to hold the largest possible result, ie: size of the first array). The function returns the number of 'elements' in the result array. We basically just loop through the first array, and for each element we loop through the second until we either find it, at which point we set a flag and break, or we don't find it and so add it to the result array.

Now the cool part - to make this work as an array_intersect as well, we just add one argument to flag whether we're doing a diff or not, and then change the

if ( !found )

if ( found ^ diff )

The XOR here will make the expression evaluate to true only if found is false and diff is true (we're calculating a difference), or the other way around (we're calculating an intersection)... if the element was found AND we're doing a diff, it doesn't end up in the result... likewise, if it wasn't found AND we're doing an intersection, it also won't end up in the result. Here's the final function that does both jobs:


int array_diff_intersect_str(
  char **array1,
  size_t array1Size,
  char **array2,
  size_t array2Size,
  char **arrayResult,
  int diff)
{
  int count1, count2, found, numFound = 0;

  for ( count1 = 0; count1 < (int)array1Size; count1 ++ ) {
    found = 0;

    for ( count2 = 0; count2 < (int)array2Size; count2 ++ )
      if ( 0 == strcmp(*(array1 + count1), *(array2 + count2)) ) {
        found = 1;

        break;
      }

    if ( found ^ diff )
      *(arrayResult + numFound ++) = *(array1 + count1);
  }

  return numFound;
}

Next, we have an implementation of array_merge, which we re-name and modify slightly as array_union_str, which instead of simply appending one array onto another, will product an actual union of the values in each array.


int array_union_str(
  char **array1,
  size_t array1Size,
  char **array2,
  size_t array2Size,
  char **arrayResult)
{
  int count, arrayDiffSize;

  char *arrayDiff[array1Size];

  arrayDiffSize = array_diff_intersect_str(array1, array1Size, array2, array2Size, arrayDiff, 1);

  for ( count = 0; count < array2Size; count ++ )
    arrayResult[i] = array2[i];

  for ( count = 0; count < arrayDiffSize; count ++ )
    arrayResult[i + array2Size] = arrayDiff[i];

  return array2Size + arrayDiffSize;
}

First, we use our previously created array_diff_intersect_str() function to produce the diff of the two arrays to be married (cue drum roll). Once we have that, we just 'append' the elements in the diff to the second array. The first for loop adds all of the second array to the result array, and the next for loop adds the elements from the diff result array. As with array_diff_intersect_str, we're passing two arrays and their sizes, along with an 'empty' result array to be filled in, and we're returning the size of the result array.

array_reverse is somewhat simpler than the previous two, and unlike them it does modify the original array (seemed too simple to bother with a result array and copying pointers when we can just move them around in place!)...


void array_reverse_str(
  char **array,
  size_t arraySize)
{
  int count;

  char *tmp;

  for ( count = 0; count < (arraySize - (arraySize % 2)) / 2; count ++ ) {
    tmp = array[count];
    array[count] = array[arraySize - count - 1];
    array[arraySize - count - 1] = tmp;
  }
}

The modulus in the for expression basically causes it to ignore the middle element for an array with an odd number of elements... if the array size is odd, %2 will return 1, and we'll subtract one from the size of the array before dividing it in half to get the number of iterations to perform. Other than that, it's pretty self-explanatory...


int array_search_str(
  char **arrayHaystack,
  size_t haystackSize,
  char *needle)
{
  int count;

  for ( count = 0; count < (int)haystackSize; count ++ )
    if ( 0 == strcmp(arrayHaystack[count], needle) )
      return count;

  return -1;
}

Another pretty simple one: we just enumerate through arrayHaystack and compare each 'element' to needle... if it's a match, we return the index of the element, otherwise we'll end up returning -1 to indicate no match was found.


int array_slice_str(
  char **array,
  size_t arraySize,
  int offset,
  int length,
  char **resultArray)
{
  if ( 0 == length )
    length = arraySize - offset;
  else if ( 0 > length )
    length = arraySize - offset + length;

  resultArray = array + length;

  return length;
}

Because we're working with C, we can take a shortcut here and not bother removing the 'old' elements that originally followed the slice - we just set resultArray to the position in array where the slice begins, and calculate & return the size of the slice (resultArray)... again, because we're writing C here, the calling code can't assume there's anything beyond the X elements we tell it are in resultArray, and so although there are still extra trailing elements, they can safely be left as-is and we don't have to do nearly as much work as we would (because, you know, it's VERY difficult to null-ify a pointer... what a pain that would be!).


int array_splice_str(
  char **array,
  size_t arraySize,
  int offset,
  int length,
  char **replacementArray,
  size_t replaceSize,
  char **resultArray,
  size_t resultSize)
{
  int count, tailSize;

  if ( 0 == length )
    tailSize = arraySize - offset;
  else if ( 0 > length )
    tailSize = -length;
  else
    tailSize = arraySize - offset - length;

  for ( count = 0; count < offset; count ++ )
    resultArray[count] = array[count];

  for ( count = 0; count < replaceSize; count ++ )
    resultArray[offset + count] = replacementArray[count];

  for ( count = 0; count < tailSize; count ++ )
    resultArray[offset + replaceSize + count] = array[arraySize - tailSize + count];

  return offset + tailSize + replaceSize;
}

Since it's the most complicated, array_splice_str() deserves more explanation than the last few... this is roughly a port of array_splice, and lets you insert an array into another (possibly replacing/overwriting a slice in the array we're inserting into). First, we calculate tailSize, which is the length of the of the portion of the original array that falls after the slice we're replacing. If length is zero, then the tailSize is simply the size of the original array minus the value of offset. If it's negative (the slice ends X places from the end of the array), then the tailSize is simply the negation of the length value (ie: a length of -2 means the slice will end 2 places from the end of the array, so the tail size is simply -(-2), or 2). Otherwise, for a normal, positive length value, the tail size is the array size, minus the offset, minus the length of the slice.

Next, we proceed to build the new (spliced) array in three stages: first, we add elements from the original array, up to the offest. Then, we append all the elements from the replacement array. Lastly, we append tailSize elements from the original array, starting with the index equal to the original array size minus the tailSize.

Finally, we return the size of the spliced array: offset plus tailSize plus replaceSize.


int array_unique_str(
  char **array,
  size_t arraySize,
  char **resultArray)
{
  int count, resultSize = 0;

  for ( count = 0; count < arraySize; count ++ )
    if ( 0 > array_search_str(resultArray, resultSize, array[count]) )
      resultArray[resultSize ++] = array[count];

  return resultSize;
}

Our last array-handling PHP function port is array_unique, which returns an array comprised of all unique values in the original. For this implementation, like others, we build on previous ported functions, and make use of array_search_str to do the heavy lifting. That way, all we really need to do now is loop through the elements of array, and check if we have each value in resultArray yet... if not, we can 'append' it and continue.

That's it for today! Stay tuned for another better-late-than never post (day 3 - some JavaScript date/time handling with a few other goodies mixed in!).

Saturday, March 28, 2009

365 DoC - W1, D1 - Subset Permutations

Problem: based on a given finite set of elements, find every possible combination of X elements, where X indicates 1 to X.

Example: in the finite set of elements (A, B, C), all possible combinations of up to three elements would be ((A), (B), (C), (A, B), (A, C), (B, C), (A, B, C)) - this assumes that we would treat (A, B) and (B, A) as the same permutation of a 2-element subset, etc.

Solution: if you think of the set of elements as digits in a number system, then you simply need to enumerate between 'zero' and the max for the number of elements in each permutation, while eliminating duplicates. For example, decimal (base-10) is a number system comprised of a set of 10 possible elements: the digits 0 through 9. So, given this set, if we wanted to find all the permutations of up to 3 elements, we simply count from 0 to 999, and eliminate duplicates/repeats, ie: '999' would be eliminated because we would already have '9' in the list of possible permutations... likewise, '21' would be omitted, since at that point we'd already have '12' in the set of possible permutations of digit combinations ('21' is just a re-arrangement of the digits in '12').

Sounds complicated, but we really just need two things: a function to 'increment' a set of elements (the permutation), and an easy/efficient way to check for duplicates/repetitions.

The first item will end up being a simple implementation of an elementary school place-value lesson... the pseudo-code will look something like this:

possible elements: A, B, C
permutation to 'increment': (A, C)

- set the current 'place' we're working on to zero (the first 'digit', 'C' in this case if we're working right-to-left)
- is the 'value' of the element at the current place the last possible value in the set of possible elements?
yes: set the value of the current place to the first element in the set, move the place up one, and repeat
no: set the value of the current place to the next element in the set

'A, C' would be "incremented" to 'B, A'

This is a actually a good candidate for recursion, although I didn't really plan on talking about recursion in today's excercise... here's a simple implementation of the above 'algorithm' in PHP...


$possibleElements = array('A', 'B', 'C');

function incrementPermutation(&$perm, $place = 0)
{
  global $possibleElements;

  if ( !isset($perm[$place]) )
    $perm[$place] = $possibleElements[0];
  else {
    $valueIndex = array_search($perm[$place], $possibleElements) + 1;

    if ( $valueIndex < count($possibleElements) )
      $perm[$place] = $possibleElements[$valueIndex];
    else {
      $perm[$place] = $possibleElements[0];

      $place ++;

      incrementPermutation($perm, $place);
    }
  }
}

The implementation is slightly incomplete - there's not argument checking (the $perms parameter needs to be an array), no checking of the $possibleElements array, and there's some other updates we could make so it's more general-purpose, but it will work as-is. PHP might optimize it out anyways, but for the sake of writing good code we've kept the $possibleElements declaration/initialization outside of the function body so that it's not re-declared every time we (potentially) recurse.

Instead of commenting the function to explain it's operation, let's dissect it piece by piece:


$possibleElements = array('A', 'B', 'C');

This is just setting up an array to hold the full set of elements from which we'll be generating permutations - if we were working with a real numbering system, this would hold our digits, in order... ie: array('0', '1', ..., '9');


function incrementPermutation(&$perm, $place = 0)
{
  global $possibleElements;

Our function declaration... we're passing $perm, which is an array representing the existing permutation to increment, by reference so we don't need to write clumsy $var = func($var) type statements all over. $place defaults to zero so that outside of the recursive calls, ie: elsewhere in our code, we just call incrementPermutation($perm) without worrying about explicitly telling the function to start at the "least-significant" spot in the permutation.

Again using the example of a numbering system, $place indicates the place-position in the number that we're working with. This corresponds to the 'ones' or 'tens' or 'hundreds' position in a decimal numbering system, if you remember back to the explanation you got in grade school.


if ( !isset($perm[$place]) )
    $perm[$place] = $possibleElements[0];

First we check if the $place we're working with has already been set/defined... if not, incrementing is easy: we just set it to the first element in $possibleElements and we're done.


else {
    $valueIndex = array_search($perm[$place], $possibleElements) + 1;

Otherwise, we'll need to do some real work - first we need to figure out where in the set of possible elements the current value of the element at $place in our permutation resides. We increment that because in both places where we're about to use it we need to increment it anyways, so we might as well do that right off the bat...


if ( $valueIndex < count($possibleElements) )
      $perm[$place] = $possibleElements[$valueIndex];

This expression, if True, indicates that we don't need to "carry-over" anything - we can increment the value at the current place without running out of elements in the original set to use. $valueIndex represents the next possible index/key in the array of possible elements. If it's not less than the size of $possibleElements, that means we're already at the last possible element and need to 'carry-over'. Otherwise, it will 'point' to the next possible element, which is used as the incremented value for this $place in our permutation.


else {
      $perm[$place] = $possibleElements[0];

      $place ++;

      incrementPermutation($perm, $place);
    }

If we do have to carry-over a value, we set the current place's value to the first in the set of possible elements, increment the $place variable, and recurse, passing in the new value of $place, in order to perform the whole increment operation on the next place over.

Next, we need to deal with the task of identifying duplicates/repetitions in our permutations. The solution is relatively simple... we're going to write a function that generates a numerical value for our resulting permutation, based on a bitmask that assigns a different (bit-)value to each possible element in our original set. To generate the value of our permutation, we perform a bitwise OR on the bits corresponding to each 'digit' in the permutation, and then obtain the resulting decimal representation of the number. If this matches the value of a permutation already in our 'found permutations' array, which we're tracking, then we can safely ignore it.

Here's the function:


function permutationValue($perm)
{
    global $possibleElements;

    $value = 0;

    foreach ( $perm as $element )
        $value |= 1 << array_search($element, $possibleElements);

    return $value;
}

It's pretty simple - we're just finding the position (index) of each element in our permutation, and using that as a bitshift operand in our |= operation, which 'adds' bits to $value.

And that's pretty much it - using these two functions, we can enumerate all possible combinations of the elements in a given, arbitrary set, and by keeping track of our permutation 'values', we can check for duplicates/repetition before adding each found permutation to the final list (array) of permutations discovered.

Here's the complete version of the code, with a built-in example... you can see the output here.


// define the members of the entire set
$possibleElements = array('A', 'B', 'C');

// the max size of a permutation is the number of elements in the original set
$maxSetSize = count($possibleElements);

// initialize an array to store the found permutations
$permutations = array();

// initialize an array to store the "values" of each permutation found
$permutationValues = array();

// determine the "value" of a permutation
function permutationValue($perm)
{
    global $possibleElements;

    $value = 0;

    foreach ( $perm as $element )
                    $value |= (1 << array_search($element, $possibleElements));

    return $value;
}

// "increment" a permutation, using each element of the original/whole set as a possible "digit"
// in our arbitrary number system
function incrementPermutation(&$perm, $place = 0)
{
  global $possibleElements;

  if ( !isset($perm[$place]) )
    $perm[$place] = $possibleElements[0];
  else {
    $valueIndex = array_search($perm[$place], $possibleElements) + 1;

    if ( $valueIndex < count($possibleElements) )
      $perm[$place] = $possibleElements[$valueIndex];
    else {
      $perm[$place] = $possibleElements[0];

      $place ++;

      incrementPermutation($perm, $place);
    }
  }
}

$currentPerm = array();

while ( count($currentPerm) <= $maxSetSize ) {
  // "increment" the current permutation
  incrementPermutation($currentPerm);

  // calculate the value of the new permutation
  $currentPermValue = permutationValue($currentPerm);

  // if it's not already in our array of values, add the new
  // permutation to the list of ones we've found
  if ( FALSE === array_search($currentPermValue, $permutationValues)) {
    array_push($permutations, $currentPerm);
    array_push($permutationValues, $currentPermValue);
  }
}

// Dump out the final list of permutations in semi-pretty/human-readable format
print_r($permutations);

Sunday, March 25, 2007

RFC-Compliant URI Validation

Recently, as part of another project, I needed some code to validate a URI string based on RFC-2396. The goal here was the ability to ensure that a URI was RFC compliant. As such, I decided to use a set of regular expressions which were directly modelled from the ABNF definitions in the RFC. ABNF is by it's nature a very close match for regular expressions in terms of usage, syntax and purpose, and so using them seemed like a logical method of building the URI validation code.

I started by creating an expression for the simplest (and first) definitions in the RFC. 'lowalpha' is defined by the ABNF as being one of the characters a-z inclusive, while 'upalpha' is defined as A-Z inclusive. 'alpha' is defined as either a 'lowalpha' or an 'upalpha' character. 'digit' is defined as one of the characters 0-9 inclusive. Lastly, 'alphanum' is defined as being either an 'alpha' or a 'digit' character. Based on these five definitions, I could create five matching regular expressions which would serve the purpose of indicating whether an arbitrary string matches one of these definitions or not.

<?php

define('LOWALPHA', '[a-z]');
define('UPALPHA', '[A-Z]');

define('ALPHA', '(?:'.LOWALPHA.'|'.UPALPHA.')');
///   (?:[a-z]|[A-Z])

define('ALPHA_OPT', '[a-zA-Z]');

define('DIGIT', '[0-9]');

define('ALPHANUM', '(?:'.ALPHA.'|'.DIGIT.')');
///   (?:(?:[a-z]|[A-Z])|[0-9])

define('ALPHANUM_OPT', '[a-zA-Z0-9]');

?>

The defined expressions ending in _OPT are optimized versions of the regular expression - ie: it's much more efficient to execute a single expression which is a range like [a-zA-Z] than it is to execute two adjacent ranges such as [a-z]|[A-Z].

Within the final implementation, expressions have been optimized where possible but for the most part they mirror the ABNF in the document more or less directly. Almost all the optimization that is present occurs at the lowest level, ie: in the simplest, base expressions from which the further, more complicated expressions are constructed. This approach seems to work since any optimization can loosely be thought of as having an exponential benefit, relative to how low of a level the optimization is performed at.

UriValidator

Dangerous Programmer