Instalment 2:
- How I use POSIX utilities to process text
One of the problems I have very often is that I would like to extract usable links from a very large web page the source code of which looks messy (when I look at it using "View Source").
Have you ever looked at a cool web page with nice visual effects and the first thing that comes to your mind is "How did he do that? I would also like to have those effects on my own site."
Or otherwise the page has lots of links to pretty pictures and you just want the links without the other "fat".
And you go to view the source and it looks like this:
CODE
<html><head><title>Fun Pics</title></head><body><a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1</a><br><a href=http://www.someserverontheinternet.com/images/pic2.jpg>Picture 2</a><br><a href=http://www.someserverontheinternet.com/images/pic3.jpg>Picture 3</a><br><a href=http://www.someserverontheinternet.com/images/pic4.jpg>Picture 4</a><br><a href=http://www.someserverontheinternet.com/images/pic5.jpg>Picture 5</a><br><a href=http://www.someserverontheinternet.com/images/pic6.jpg>Picture 6</a><br><a href=http://www.someserverontheinternet.com/images/pic7.jpg>Picture 7</a></body></html>
Now you want to make some sense out of that never ending HTML string.
You probably already have your favorite tool handy to do that easily, and that's fine, but my objective here is to showcase the use of POSIX utilities, that's why I will show you what might look like the hard way of doing it.
For simplicity, I will assume that the above HTML code is stored in file messyHTML.html.
My objective is to obtain a clean list of all the links pointing to jpg files, each one on its own line.
Something like this:
CODE
First of all, I would like to get a rough overview of the structure of the web page by having each HTML tag on its own line:
CODE
# cat messyHTML.html | sed 's#<#\
> <#g'
<html>
<head>
<title>Fun Pics
</title>
</head>
<body>
<a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic2.jpg>Picture 2
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic3.jpg>Picture 3
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic4.jpg>Picture 4
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic5.jpg>Picture 5
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic6.jpg>Picture 6
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic7.jpg>Picture 7
</a>
</body>
</html>
#
OK, so what did I do?
Let's look at the command again, it might seem complex and cryptic at first, so let's break it down.
To eliminate some confusion related to how I configured my shell prompts (the sets of characters "# " and "> "), let's see what the command looks like without them:
CODE
cat messyHTML.html | sed 's#<#\
<#g'
The two utilities cat and sed communicate with each other using a pipe (the character "|"). This pipe redirects the output of the cat command from where it would normally go (the terminal output) into the standard input of the sed command.
Of course I could have done this using only the sed command by pointing it to the file, but I have chosen to use two commands for several reasons, one of which is to make the sed command as easy to understand as possible, which is not an easy task. The other reason was to give this tutorial some integrity. You will see what I mean later on, when I use sed one more time, but in a slightly more complex manner.
You are already wondering why the command spans two lines. Well, try to use your imagination and put the two lines together again, but keep in mind that they are still separated by the binary representation of the carriage return.
The imaginary command might look like this:
CODE
cat messyHTML.html | sed 's#<#\{binary representation of the carriage return}<#g'
The carriage return is a special character that the bash shell will interpret according to its own specification, unless we "escape" it. Escaping a character means instructing the bash shell not to give that character any special mening and to treat it "as-is". Escaping is done in the bash shell by putting a backslash in front of the special character.
Now we need to look at the sed command in more detail.
sed is a stream editor that many people use primarily as a tool to mass replace patterned strings of text.
A very basic example of how sed works is this:
CODE
# echo "abcdefghi" | sed 's#de#WWWWW#'
abcWWWWWfghi
#
In this example I have used sed to replace the first occurence of "de" with "WWWWW" in the string "abcdefghi".
Look at the following example where I append an additional occurence of "de" to the end of our string "abcdefghi" and run the same command again:
CODE
# echo "abcdefghide" | sed 's#de#WWWWW#'
abcWWWWWfghide
#
If our objective is to replace ALL occurences of "de", we simply specify the "g" switch, which does a global replace.
CODE
# echo "abcdefghide" | sed 's#de#WWWWW#g'
abcWWWWWfghiWWWWW
#
On a side note, I personally tend to use the hash mark (the "#" character) as the separator in sed, because I often find myself needing to replace strings that contain slashes, but most people I have seen use the slash as a separator, like this:
CODE
# echo "abcdefghide" | sed 's/de/WWWWW/g'
abcWWWWWfghiWWWWW
#
Excellent !!!
I'm only interested in links to pictures, so after quickly eyeballing the structure of the web page, I decide that I would only like to keep those lines that contain the string "images".
CODE
# cat messyHTML.html | sed 's#<#\
> <#g' | grep images
<a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1
<a href=http://www.someserverontheinternet.com/images/pic2.jpg>Picture 2
<a href=http://www.someserverontheinternet.com/images/pic3.jpg>Picture 3
<a href=http://www.someserverontheinternet.com/images/pic4.jpg>Picture 4
<a href=http://www.someserverontheinternet.com/images/pic5.jpg>Picture 5
<a href=http://www.someserverontheinternet.com/images/pic6.jpg>Picture 6
<a href=http://www.someserverontheinternet.com/images/pic7.jpg>Picture 7
#
grep is a very powerful and very underutilized tool - "underutilized" in the sense that it's not utilized to its full potential: 99% of the time people only use 1% of its power. I will probably spend some time in a future instalment exploring its capabilities.
It is that 1% that I am using here as well: basic filtering using plain (unpatterned) text.
At this point we are one step away from reaching our objective and, as always in UNIX, there is more than one way of performing the next step. Let me explain.
In this particular example, all the links that we are interested in begin at the same offset and have the same length, so one is strongly tempted to leverage this feature and use the cut command. For educational purposes, I will go ahead and show you how cut works, but in real life situations I always use an advanced feature of sed called backreferencing, which I will show you further below, so keep reading.
Let's look at how our output lines are structured:
CODE
<a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1
12345678901234567890123456789012345678901234567890123456789012345678901234567890
| | | | | | |
10 20 30 40 50 60
So where's the "beef"?
Well, the "beef" begins at position 9 and ends at position 62.
Let's perform a single test:
CODE
# echo "<a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1" | cut -c9-62
http://www.someserverontheinternet.com/images/pic1.jpg#
It works as expected, let's do the whole thing in one fell swoop:
CODE
Excellent !!! Are we done?
Well, yes and no, this was the lazy man's way, let me show you now the right way.
CODE
Does that look like command line garbage or what?
Take a deep breath, I'm getting ready to explain to you that scary thing at the end:
CODE
's#^.*\(http.*jpg\).*$#\1#'
But before I do, let us make a quick mental note that, although seemingly vastly more complex, this approach makes absolutely no assumptions whatsoever about where the links begin and where they end.
The following is a regular expression pattern that matches the whole line (each and every line in its entirety):
CODE
^.*\(http.*jpg\).*$
It matches the whole line because it begins with the character "^" and it ends with the character "$".
When used in a regular expression pattern, the character "^" always matches the beginning of the line and the character "$" always matches the end of the line.
Note that I am making use of terms which I have not already explained, terms like "regular expression" and "pattern". If you want a formal definition for these terms, please feel free to google them, then come back to this tutorial. I prefer to define these entities by showing you how they are used in practice.
When used in a regular expression pattern, a dot matches any character.
A dot followed by an asterisk matches any string of characters, including the empty string.
This is because, when used in a regular expression pattern, an asterisk matches zero or more occurences of the character preceding it. For instance, the pattern a* will match any one of the following strings:
- the empty string
- a
- aa
...
- aaaaaaaaa
etc. ... you get the idea
The regular expression pattern:
CODE
^.*\(http.*jpg\).*$
makes use of four anchors, two of which we have already discussed (beginning and ending of line).
The other two are: "(http" and "jpg)"
Here I am showing the parentheses without their respective preceding backslashes, but keep in mind that these parentheses are special characters that need to be escaped so that the bash shell leaves them alone and they therefore become available for use by sed.
These two anchors "(http" and "jpg)" mean that whatever text matches the pattern "http.*jpg" inside the parentheses will temporarily be stored in a special location in sed's memory called a backreference. From that moment on, the character string stored in that special location in memory can be referenced by the name of "\1".
sed provides a maximum of nine such backreferences, \1 through \9. Backreferences \2 and above become available as soon as you have more than one set of parentheses.
It should now be obvious that the command
CODE
sed 's#^.*\(http.*jpg\).*$#\1#'
entirely replaces each and every line from the standard input with the string we want.
One more caveat to pattern matching:
sed's regular expression engine is "greedy", meaning that, should there have been several http.*jpg links on one line, the backreferencing trick would only partially provide the expected results.
Let's give an example using a slightly modified version of the web page:
CODE
# cat test.html
<html>
<head>
<title>Fun Pics
</title>
</head>
<body>
<a href=http://www.someserverontheinternet.com/images/pic1.jpg>Picture 1
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic2.jpg>Picture 2
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic3.jpg>Picture 3
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic4.jpg>Picture 4
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic5.jpg>Picture 5
</a>
<br>
<a href=http://www.someserverontheinternet.com/images/pic6.jpg>Picture 6</a><br><a href=http://www.someserverontheinternet.com/images/pic7.jpg>Picture 7</a>
</body>
</html>
#
#
# cat test.html | grep images | sed 's#^.*\(http.*jpg\).*$#\1#'
http://www.someserverontheinternet.com/images/pic1.jpghttp://www.someserverontheinternet.com/images/pic2.jpghttp://www.someserverontheinternet.com/images/pic3.jpghttp://www.someserverontheinternet.com/images/pic4.jpghttp://www.someserverontheinternet.com/images/pic5.jpghttp://www.someserverontheinternet.com/images/pic7.jpg#
In this case, the first and second patterns of ".*" both behave in a greedy manner, each one of them trying to "swallow" the character string:
CODE
http://www.someserverontheinternet.com/images/pic6.jpg>Picture 6</a><br><a href=
but precedence is given to the first.
Comment/Reply (w/o sign-up)