Monday, April 9, 2007
Tokenize a String with C# Regular Expressions
Using C# and .net regular expressions it easy to parse even the most complex string into tokens very easily.
For example convert the following string:
Here are the steps.
1. Define a Token helper class.
" give the group a name. The "\s" matches on a white space character, with the "\s*" subexpression matching on zero or more whitespace characters in a row.
Overall the pattern directs to both match on whitespace and on one of the following groups or subexpressions such as "variable", "integer". Notice the "" character (shift backslash on keyboard) after each group which is the logical 'or' in C#; however in regular expressions it is known as the 'alternation' character with each subexpression known as the alternative. When the first alternative match is found then matching stops.
The "invalid" group will match on character string not matched by any of the other alternative groups. This supports simple syntax check.
3. Generate the regular expression code.
Note: When using regular expressions it is very easy for a single character to change the meaning of a regular expression. Techniques such as code inspection rarely will reveal a problem. It is important with all code, but especially code using regular expressions, to build a good set of unit tests that exercise most, if not all, of the combinations.
For example convert the following string:
"365 + 6 *(6.3 + Count)"into something like this:
Token[0], Integer, "365"With out using regular expressions it becomes quite a programming exercise. With regular expressions it becomes quite simple. The regular expression technique even supports simple syntax checking for invalid characters or character sequences.
Token[1], Plus, "+"
Token[2], Integer, "6"
Token[3], Mulitply, "*"
Token[4], OpenBracket, "("
Token[5], Double, "6.3"
Token[6], Plus, "+"
Token[7], Variable, "Count"
Token[8], CloseBracket, ")"
Here are the steps.
1. Define a Token helper class.
public class Token2. Define the regular expression pattern using named groups looking something like this.
{
public readonly string Name;
public readonly string Value;
public Token(string name, string value)
{
Name = name;
Value = value;
} >
}
private static string pattern =The regular expression explained: The round brackets"(...)" define a group that support matching a subexpression. The "?
@"(?<whitespace>\s*)|" +
@"(?<variable>[a-zA-Z_$][a-zA-Z0-9_$]*)|" +
@"(?<integer>[0-9]+)|" +
@"(?<plus>\+)|" +
@"(?<minus>-)|" +
@"(?<multiply>\*)|" +
@"(?<invalid>[^\s]+)";
Overall the pattern directs to both match on whitespace and on one of the following groups or subexpressions such as "variable", "integer". Notice the "" character (shift backslash on keyboard) after each group which is the logical 'or' in C#; however in regular expressions it is known as the 'alternation' character with each subexpression known as the alternative. When the first alternative match is found then matching stops.
The "invalid" group will match on character string not matched by any of the other alternative groups. This supports simple syntax check.
3. Generate the regular expression code.
Regex regexPattern = new Regex(pattern);4. Perform a "foreach" on matches to generate the tokens.
MatchCollection matches = regexPattern.Matches("365 + 6 * Count");
ListThat is all that is to it. A very small amount of code to quickly parse a string.tokenList = new List ();
foreach (Match match in matches)
{
int i = 0;
foreach (Group group in match.Groups)
{
string matchValue = group.Value;
bool success = group.Success;
// ignore capture index 0 and 1 (general and WhiteSpace)
if ( success && i > 1)
{
string groupName = regexPattern.GroupNameFromNumber(i);
tokenList.Add(new Token(groupName , matchValue));
}
i++;
}
}
Note: When using regular expressions it is very easy for a single character to change the meaning of a regular expression. Techniques such as code inspection rarely will reveal a problem. It is important with all code, but especially code using regular expressions, to build a good set of unit tests that exercise most, if not all, of the combinations.
Labels: C Sharp, Regular Expressions
Comments:
<< Home
Good use for capturing groups. This was just what I needed for my custom formatting function, thanks!
Generally I do not post on blogs, but I would like to say that this post really forced me to do so! really nice post.
rH3uYcBX
rH3uYcBX
Hey very nice blog!! Man .. Beautiful .. Amazing .. I will bookmark your blog and take the feeds also...
rH3uYcBX
rH3uYcBX
I find this article kool, but your example is not working. When I run the code, I get ae empty list . What am I doing wrong ?
A working version of this code can be found in Jetfire source code. In the source code search for a class named "TjToken".
Interesting blog as for me. I'd like to read a bit more about that matter. Thank you for posting that information.
Post a Comment
<< Home
Subscribe to Posts [Atom]


