|
|
Regular Expressions
Regular expressions are programming constructs that look like #!?#@!!# and can be powerful tools.
Regular expressions are used to recognize patterns within textual data. Their use has become so widespread that they appear in configuration files, mail filters, text editors, and programming languages. Any application that acts on text may use them.
Regular expressions evaluate text data and return an answer of true or false. That is, either the expression correctly describes the data, or it doesn't. What data the expression evaluates and what happens after a successful match depends on the application. We could substitute new text in the place of the text matched by a regular expression. We can save the matched text in a variable for later use. We might want to execute a new program when we see a correct match. And so on.
There are several variants, but all regular expressions consist of characters to be matched as well as a series of special characters that can be said to further describe the data.
In Unix, the grep utility is a simple starting point for understanding the work of regular expressions. The expression can be a simple string, and the input data can be a named list of files.
Her are some examples using grep for you to get a sense of how regular expressions work.
Let's say we want to find all the <title> tags in a directory of HTML files. The code would look like this:
grep -i '<title>' *.html
grep evaluates whether or not each line in each *.html file matches the description <title>. If the line is a match, then grep's standard behavior is to print out the file name and the matching line.
Pretty soon, we'll want to ask more sophisticated questions of our text data. We may want to add further restrictions and qualifications, or we may want to make our expression more general. In short, we'll need to start using regular expressions' set of descriptive "metacharacters." Let's look at a few cases.
Placeholders and repetition:
Let's say our directory of HTML files has 100 files and 100 <title> tags, and we want to narrow our search a little to see only the titles that make reference to "worms."
grep -i '<title>.*worms'
The "." means "any character." The "*" means 0 or more instances of the previous character. What we've said here is "match any line that contains a 'begin title' tag followed by any number of characters, as long as the word 'worms' appears before the end of the line." The "." is very important. If we'd said:
grep -i '<title>*worms'
then we'd be looking for lines that looked like this:
<title>>>>>>>>>>>>>>>>>worms.
(The * character would be looking for 0 or more instances of >, which is not very useful.)
Range:
We frequently find that we want to make our expressions much more general. It would be quite inconvenient to enter 10 regular expressions if we're only interested in matching any of the characters from 0 to 9. The range symbol [] allows us to conveniently group characters together. We can also use [\.\*] to match either of those punctuation characters.
(NOTE: We put backslashes before dots and stars in order to turn off their behavior as special characters. This is called "escaping" the characters.)
One especially powerful feature of the range function is the ability to negate it. We can match "anything but" the list of characters. In [^1234], the caret inside this range operator means "match anything but the characters 1-4."
Here's a useful example: Find all the hrefs that point to URLs that mistakenly have a space in them. This example uses the enhanced regular expressions of egrep.
egrep -i 'href="[^"]* [^"]*"' *.html
In other words, find the href lines that have a space between the begin quote and end quote. We use the range operator here to signify any character other than a quote.
Position:
There are two main characters that enable us to restrict our match to a location within the string. We can match either the beginning (^) or the end ($) of our input data. This is more useful than it might immediately seem.
For example, let's say we want to find the HTML tags that are not closed before the line break.
egrep '<[^>]*$' *.html
In other words, we're looking for a "less than" followed by a continuous chain of characters other than "greater thans" all the way to the end of the line.
That is introductory start. The set of special descriptive characters will differ across regular-expression implementations, but if you keep in mind that their uses fall into a few basic categories, you'll have no trouble learning them. Position, range, repetition, and placeholders are the foundations of regular expressions.
Back to Unix Commands
|
|
- Account Log ins
Control Panel Sample
- Getting Started
Chosing a Password
FTP
Index Files
Quick Html
Virus Tracker
Anti Spam Tips
- Creating Databases
Moving Databases
- Htaccess
Hot Linking
Password Protection
- Unix Commands
Chmod
Find
Grep
Ls
Regular Expressions
Telnet / SSH
|