Regular Expressions C# Help

Regular expressions are part of those small technology areas that are incredibly useful in a wide range of programs, yet rarely used among developers. You can think of regular expressions as a mini programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the Unix environment and is commonly used with the Perl programming language. Microsoft ported it onto Windows, where up until now it has been used mostly with scripting languages. Today, regular expressions are, however, supported by a number of .NET classes in the namespace System. Text. RegularExpressions. You can also find the use of regular expressions in various parts of the .NET Framework. For instance, you will find that they are used within the ASP.NET Validation server controls.

If you are not familiar with the regular expressions language, this section introduces both regular expressions and their related .NET classes. If you are already familiar with regular expressions, you will
probably want to just skim through this section to pick out the references to the .NET base classes. You might like to know that the .NET regular expression engine is designed to be mostly compatible with Perl S regular expressions, although it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is designed specifically for string processing. It contains two features:

  1. a A set of escape codes for identifying specific types of characters. You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re lists the files with names beginning with Re.) Regular expressions use many sequences like this to represent items such as anyone character, a word break, one optional character, and so on.
  2.  A system for grouping parts of substrings and intermediate results during a search operation. With regular expressions, you can perform quite sophisticated and high-level operations on strings. For example, you can:
  3.  Identify (and perhaps either flag or remove) all repeated words in a string (for example, “The computer books books” to “The computer books”)
  4.  -Convert all words to title case (for example “this is a Title” to “This Is A Title”)  Convert all words longer than three characters to title case (for example, “this is a Title” to “This is.a Title”) .
  5.  Ensure that sentences are properly capitalized
  6.  Separate the various elements of a URl (for example, given http://www.csharpaid. com, extract the protocol, computer name, me name, and so on).

Of course, all of these tasks can be performed in C# using the various methods on System. String and System. Text. StringBuilder. However, in some cases, this would require writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed to just a couple of lines. Essentially, you {instantiate a System. Text. RegularExpressions .RegEx object (or, even simpler,
invoke a static RegEx () method), pass it the string to be processed, and pass in a regular expression (a string containing the instructions in the regular expressions language), and you’re done.

A regular expression string looks at first sight rather like a regular string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if you wanted to indicate you were looking for the characters th at the beginning of a word, you would search for the regular expression, \bth (that is, the sequence word boundary-e-n), If you wanted to search for all occurrences of th at the end of a word, you would write th\b (the sequence t-h-word boundary). However, regular expressions are much more sophisticated than that and include, for example, facilities to store portions of text that are found in a search operation. This section merely scratches the surface of the power of regular expressions.

For more on regular expressions, please review the book Beginning Regular Expressions (ISBN
978-0-7645-7489-4).

Suppose your application needed to convert U.s. phone numbers to an international format. In the United States, the phone numbers have this format: 314-123-1234, which is often written as (314) 123-1234. When converting this national format to an international format you have to include +1 (the country code of the United States) and add brackets around the area code: +1 (314) 123-1234. As find-and-replace operations go, that’s not too complicated. It would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available from System. String). The regular expressions language allows you to construct a short string that achieves the same result.

This section is intended only as a very simple example, so it concentrates on searching strings to identify certain substrings, not on modifying them.

The Regular Expressions Play around Example

For the rest of this section, you develop a short example, called Regular Expressions Play around, that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text you are going to use as your sample document is an introduction to a csharpaid book on ASP.NET (Professional ASP.NET 3.5: in C# and VB, ISBN 978-0-470-18757-9).

string Text = @’This comprehensive compendium provides a broad and thorough investigation of all aspects of programming with ASP.NET. Entirely revised and updated for the 3.5 Release of .NET, this book will give you the information you need to master ASP.NET
and build a dynamic, successful, enterprise web application.’;

This code is valid C# code, despite all the line breaks. It nicely illustrates the utility of verbatim strings that are prefixed by the @ symbol.

This.text is referred to as the input string. To get your bearings and get used to the regular expressions .NE~classes, you start with a basic plain text search that does not feature any escape sequences or regular expression commands. Suppose that you want to find all occurrences of the string ion. This search string is referred to as the pattern. Using regular expressions and the Text variable declared previously, you can write this:

Img

This code Uses the static method Matches () of the Regex class in the System. Text. RegularExpressions namespace. This method takes as parameters some input text, a pattern, and a set of optional flags taken from the Regex Options enumeration. In this case, you have specified that all searching should be case insensitive. The other flag, Explicit Capture, modifies the way that the match is collected in a way that, for your purposes, makes the search a bit more efficient – you see why this is later (although it does have other uses that we won’t explore here). Matches () returns a reference to a Match Collection object. A match is the technical term for the results of finding an instance of the pattern in the expression. It is
represented by the class System. Text. RegularExpressions. Match.  Therefore, you return a Match Collection that contains all the matches, each represented by a Match object. In the. preceding code, you simply iterate over the collection and use the Index property of the Match class, which returns the index in the input text of where the match was found. Running this code results in three matches. The following table details some of the Regex Options enumerations.

Img

So far, nothing is really new from the preceding example apart from some .NET base classes. However, the power of regular expressions really comes from that pattern string. The reason is that the pattern string does not have to contain only plain text. As hinted earlier, it can also contain what are known as meta-characters, which are special characters that give commands, as well as escape sequences, which work in much the same way as C# escape sequences. They are characters preceded by a backslash  (\) and have special meanings .•

For example, suppose that you wanted to find words beginning with n. You could use the escape sequence \b, which indicates a word boundary (a word boundary is just a point where an alphanumeric character precedes or follows a whitespace character or punctuation symbol). You would write this:

Img

Notice the @ character in front of the string. You want the \b to be passed to the .NET regular expressions engine at runtime – you don’t want the backslash intercepted by a well-meaning C!t compiler that thinks it’s an escape sequence intended for itself! If you want to find words ending with the sequence ion, you write this:

string Pattern = @’ion\b’;

If you want to find all words beginning with the letter a and ending with the sequence ion (which has as its only match the word application in the example), you will have to put a bit more thought into your code. You clearly need a pattern that begins with \ba and ends with ion\b, but what goes in the middle? You need to somehow tell the application that between the a and the ion there can be any number of characters as long as none of them are whitespace. In fact, the correct pattern looks like this:

string Pattern = @”\ba\S”ion\b”;

Eventually you will get used to seeing weird sequences of characters like this when working with regular expressions. It actually works quite logically. The escape sequence \S indicates any character that is not a whitespace character. The ” is called a quantifier. It means that the preceding character can be . repeated any number of times, including zero times. The sequence \S· means any number of characters as ‘r: as they are /lot whitespace characters. The preceding pattern will, therefore, match any single word that begins with a and ends with ion.

The following table lists some of the main special characters or escape sequences that you can use. It is not comprehensive, but a fuller list is available in the MSDN documentation.

Img

If you want to search for one of the meta-characters, you can do so by escaping the corresponding character with a backslash. For example, . (a single period) means any single character other than the newline character, whereas \. means a dot.

You can request a match that contains alternative characters by enclosing them in square brackets, For example, [11 c] means one character that can be either 1 or c. If you wanted to search for any occurrence of the words map or man, you would use the sequence ma [n 1 p]. Within the square brackets, you can also indicate a range, for example [a-z] to indicate any single lowercase letter, [A-E] to indicate any uppercase letter between A and E (including the letters A and E themselves), or [0- 9] to represent ~ single digit. If you want to search for an integer (that is, a sequence that contains only the characters 0 through 9), you could write [0-9] + (note the use of the + character to indicate there must be at least one such digit, but there may be more than one – so this would match 9, 83, 854, and so on).

 Displaying Results

In this section, you code the Regular Expressions Play around example, so you can get a feel for how the regular expressions work.

The core of the example is a method called writeMatches (),which writes out all the matches from a Match Collection in a more detailed format. For each match, it displays the index of where the match was found in the input string, the string of the match, and a slightly longer string, which consists of the match plus up to ten surrounding characters from the input text – up to five characters before the match and up to five afterward, (It is fewer than five characters if the match occurred within five characters of the beginning or end of the input text.) In other words, a match on the word messaging that occurs near the end of the input text quoted earlier would display and messaging of d (five characters before and after the match), but a match on the final word data would display g of data. (only one character after the match), because after that you get to the end of the string. This longer string lets you see more clearly where the regular expression locates the match:

Img

The bulk of the processing in this method is devoted to the logic of figuring out how many characters in the longer substring it can display without overrunning the beginning or end of the input text. Note that you use another property on the Match object, Value, which contains the string identified for the match. Other than that, Regular Expressions play around simply contains a number of methods with names like Findl,Find2, and so on, which perform some of the searches based on the examples in this section. For example, Find2 looks for any string that contains a at the beginning of a word:

Img

Along with this comes a simple Main () method that you can edit to select one of the Find<n> () methods:

Img

The code also needs to make use of the RegularExpressions namespace:

using System;
using System.Text.RegularExpressions;

Running the example with the Find1 () method shown previously gives these results:

RegularExpressionsPlayaround
Original text was:

This comprehensive compendium provides a broad and thorough investigation of all aspects of programming with ASP.NET. Entirely revised and updated for the 3.5 Release of .NET, this book will give you the information you need to master ASP.NET and build a dynamic, successful, enterprise Web application.

No. of matches: 1
-Index: 291, String: application, Web application.

Matches, Groups, and Captures

One nice feature of regular expressions IS that you can group characters. It works the same way as compound statements in C#. In C#, you can group any number of statements by putting them in braces, and the result is-treated as one compound statement. In regular expression patterns, you can group any characters (including meta-characters and escape sequences), and the result is treated as a single character. The only difference is that you use parentheses instead of braces. The resultant sequence is known as a group.

For example, the pattern (an) + locates.any recurrences of the sequence an. The + quantifier applies only to the previous character, but because you have grouped the characters together, it now applies to repeats of an treated as a unit. This means that if you apply (an) + to the input text, bananas came to Europe late in the annals of history, the anan from bananas is identified. Yet, if you write
~ an+, the program selects the ann from annals, as well as two separate sequences of an from bananas. The expression (an) + identifies occurrences of an, anan, ananan, and so on, whereas the expression an+ identifies occurrences of an, ann, annn, and SO on.

You might wonder why with the preceding example (an) + picks out banana from the word banana but doesn’t identify either of the two occurrences of an from the same word. The rule is that matches must not overlap. If there are a couple of possibilities that would overlap, then by default the longest possible sequence will be matched.

However, groups are actually more powerful than that. By default. when you form part of the pattern into a group, you are also asking the regular expression engine to remember any matches against just that group, as well as any matches against the entire pattern. In o~er words, you are treating that group as a pattern to be matched and returned in its own right. This can actually be extremely useful if you want to break up strings into component parts.

For example, URIs have the format <protocol>: / /<address>: <port>, where the port is optional. An example of this is http://www.csharpaid.com : 4355. Suppose that you want to extract the protocol, the address, and the port from a URI, where you know that there may or may not be whitespace (but no punctuation) immediately following the URI. You could do so using this expression:

\b(\S+) ://I(\S+) (?:: (\S+) )?\b

Here is how this expression works: First, the leading and trailing \b sequences ensure that you-consider only portions of text that are entire words. Within that, the first group, (\S+) : / I, identifies one or more characters that don’t count as whitespace, and that are followed by : / / – the http: 1/ at the start of an HTTP URI. The brackets cause the http to be stored as a group. The subsequent (\S+) identifies the string www.csharpaid.com int he URI.This group will end either when it encounters the end of the word (the closing \b) or a colon (:) as marked by the next group.

The next group identifies the port (: 4355). The following? indicates that this group is optional in the match – if there is no :xxxx, this won’t prevent a match from being marked. This is very important because the port number is not always specify in a URI – in fact, it is absent most of the time. However, things are a bit more complicated than that. You want to indicate that the colon might or might not appear too, but you don’t want to store this colon in the group. You’ve achieved this by having two nested groups. The inner (\S+) identifies anything that follows the colon (for example, 4355). The outer group contains the inner group preceded by the colon, and tl-is group in turn is preceded by the sequence? :. This sequence indicates that the group in question should not be saved (you only want to save 4355; you don’t need: 4355 as well!). Don’t get confused but the two colons following each other the first colon is part of the?: sequence that says “don’t save this group,” and the second is text to be searched for.

If you run this pattern on the following string, you’ll get one match: http://www.csharpaid.com.

Hey I’ve just found this amazing URI at http:// what was it — oh yes
http://www.csharpaid.com

Within this match, you-will find the three groups just mentioned \IS well as a fourth group, which represents the match itself. Theoretically, it is possible that each group itself might return no, one, or. more than one match. Each of these individual matches is known as a capture. So, the first group, (\5+), has one capture, http. The second group also has one capture (www.csharpaid.com). The third group, however, has no captures, because there is no port number on this URI.

Notice that the string contains a second http://.Although this does match upto the first group, it will not be captured by the search because the entire search expression does not match this part of the text. There isn’t space to show examples of C# code that uses groups and captures, but you should know that the .NET RegularExpressions classes support groups and captures, through classes known as Group and Capture. Also, the GroupCollection and CaptureCollection classes represent collections of groups and captures. The Match class exposes the Groups () method, which returns the corresponding GroupCollection object. The Group class correspondingly implements the Captures () method, which returns a CaptureCollection. The relationship between the objects is shown.

Img

You might not want to return a Group object every time you just want to group some characters. A fair amount of overhead is involved in instantiating the object, which is not necessary if all you want is to group some characters as part of your search pattern. You can disable this by starting the group with the character sequence ?: for an individual group, as was done for the URI example, or for all groups by specifying-ehe RegExOptions. Explicit Captures flag on the RegEx .Matches () method, as was done in the earlier examples.

Summary

You have quite a number of available data types at your disposal when working with the .NET Framework. One of the most used types in your applications (especially applications that focus on the submission and retrieval of data) is the string data type. The importance of string is the reason that this book has a complete chapter focused on how to use the string data type and manipulate it in your applications.

When working with strings in the past, it was quite common to just slice and dice the strings as needed using concatenation. With the .NET Framework, you can use the StringBuilder class to accomplish a lot of this task with better performance than before.

Last, but hardly least, advanced string manipulation using regular expressions is an excellent tool to search through and validate your strings.

Posted on October 29, 2015 in Strings and Regular Expressions

Share the Story

Back to Top