sed example from 3

sed example from 3/9 lecture
adapted from Do It With Sed

In lecture on Friday, I showed the following sed script to take a text file and center every line (assuming lines have 80 columns):

#!/usr/bin/sed -f
# center all lines of a file, on a 80 columns width
# to change that width, the number in \{\} must be replaced, and the number
# of added spaces also must be changed

# del leading and trailing spaces
# this first line has a tab in the first place and a space in the second place
y/    / /
s/^ *//
s/ *$//

# add 80 spaces to end of line
# the following line has 10 spaces in the second place
s/$/          /
s/ *$/&&&&&&&&/

# keep 1st 80 chars
s/^\(.\{80\}\).*$/\1/

# split trailing spaces, into two halves, 1st for beg, 2nd to end of line
s/\( *\)\1$/#\1%\1/
s/^\(.*\)#\(.*\)%\(.*\)$/\2\1\3/

The problem of centering the file is broken into four sub-problems. First, extra spaces are removed. Second, enough spaces are added to the end of each line that every line has at least 80 characters. Third, all but the first 80 characters of each line are discarded. Finally, any spaces at the end of a line are split in half and half are moved to the front of the line. You should be able to convince yourself that performing these operations will center the lines of the file.

Now, let's look at each piece of this code in more detail.

# del leading and trailing spaces
# this first line has a tab in the first place and a space in the second place
y/    / /
s/^ *//
s/ *$//

To remove any extra white space, we want to remove tabs and spaces at the front or back of the line. First, we turn all tabs into spaces. The command "y/ / /" (where the first white space is a tab, and the second is a single space) replaces each character in the first list (here just a tab character) with the corresponding character in the second list (here a space). Now all tabs have been re-written as spaces. To identify leading spaces, we can use the regular expression "^ *" - the beginning of the line (^) followed by any number of spaces. The s command replaces the longest substring matching this regular expression (as many spaces as can be found at the front of a line) with nothing, because no characters are given in the second slot of the s command. The third line is analogous; spaces at the end of the line are matched with the regular expression " *$" because $ matches the end of the line, and the s command replaces these spaces with nothing.

So now, the file has lines with no extra spaces at the beginning of end of lines. Next, we add 80 spaces to the end of each line. This ensures that every line has at least 80 characters in it - if it has less than 80, the line will be filled out with spaces. The code to do this is:

# add 80 spaces to end of line
# the following line has 10 spaces in the second place
s/$/          /
s/ *$/&&&&&&&&/

The first line substitutes 10 spaces for the end of every line (the $ does not have to be repeated - this is equivalent to appending 10 spaces to the end of the line). In the next line, all of the spaces at the end of the line are identified by the regular expression " *$". Since we removed all the spaces at the end of the line before adding 10 spaces, exactly 10 spaces will be found at the end of every line. These ten spaces are substituted by eight repetitions of those 10 spaces. Remember, & stands for whatever string matched the regular expression in the first part of the s command - here, every & is matched by the 10 spaces. So, 8 repetitions of 10 spaces gives us 80 spaces at the end of every line.

Note that some (most) lines will have more than 80 characters now. We want to remove all but the first 80 characters on the line. To do this, we have:

# keep 1st 80 chars
s/^\(.\{80\}\).*$/\1/

We see a new symbol here, the "\1". Remember that the syntax $RE$ sets a regular expression apart. In addition, it labels the regular expression. The first regular expression encountered within $ ... $ is labeled \1, the second is labeled \2, and so on. These labels can be used like the & - they resolve to being whatever string was found on that line to match the regular expression between the parentheses.

So here, we are looking for the regular expression "^$.\{80\}$.*$". Reading from the inside out, ".\{80\}" will match any 80 characters, since . matches any character. (Note that these are curly braces, not parentheses). This regular expression is nested inside parentheses, meaning that \1 will refer to whatever string matches this regular expression - those 80 characters. We then see that the regular expression must come immediately after the start of the line because we have "^$.\{80\}$", so the 80 characters that are matched are the first 80 characters of the line. However, we do not stop there - we complete the regular expression by saying that the 80 characters may be followed by 0 or more other characters, and then the end of the line. This is because when we perform the s substitution, we want the entire line to be replaced, not just the first 80 characters. So the regular expression "^$.\{80\}$.*$" will match any line with at least 80 characters (which we know all lines have), and replace that entire line with the substring matching \1, the first 80 characters of the line.

Now, every line has 80 characters. If the line has less then 80 characters of content, it is filled out with spaces to the end of the line. To finish centering the line, we want to move half of the space to the front of the line. We divide this task into two pieces. First, we will take all of the spaces at the end of the line, put a "#" before the spaces begin, and put a "%" in the exact middle of the spaces. (Note: if there are an odd number of spaces, one of the spaces is lumped with the non-space text, and the remainder are used to center the line.) After we label the regions of spaces in this way, we move all of the spaces between the # and the % to the end of the line, while also removing the markers "#" and "%" from the end of the file. To do this, we have the following code:

# split trailing spaces, into two halves, 1st for beg, 2nd to end of line
s/\( *\)\1$/#\1%\1/
s/^\(.*\)#\(.*\)%\(.*\)$/\2\1\3/

The first line is placing the markers. The regular expression that we are trying to match is "$ *$\1$". We have some number of spaces matched by " *", and labeled \1 by enclosing the expression in parentheses. That regular expression is followed by the label "\1$". Because regular expressions try to match the longest possible substring, this means that the regular expression " *" will try to match the longest string of spaces such that it can be followed by exactly that many spaces again and then the end of the line. So, \1 will be a string of spaces that is half the length of the string of spaces at the end of the line. And the entire regular expression "$ *$\1$" will match the longest even-lengthed string of spaces at the end of the line. This entire region of spaces will be replaced by "#\1%\1" - that is, the exact same number of spaces, but with a # inserted before any of them, and a % inserted between the two halves.

Finally, we move half the spaces to the front of the line. In the last substitution, we have three labeled regular expressions, all of the form "$.*$". The first one, labeled \1, is immediately preceeded by the front of the line and immediately followed by a #. The regular expression labeled \2 is immediately preceeded by a # and immediately followed by a %. Finally, the third regular expression, labeled \3, is immediately preceeded by a % and immediately followed by the end of the line. So, \1 will get matched by all of the text before the # - the actual content of the line. \2 will match the first half of the trailing spaces, and \3 will match the second half of the trailing spaces. And the entire regular expression will match the entire line. So, the entire line will get replaced by "\2\1\3". This is just a reordering of the sections, half of the trailing spaces, those matched by \2, are moved to the front of the line. The # and % were not matched inside any labeled regular expression so they are not retained.

To test this script, you can save it in a file script.sed and run it on a file "testfile" using the command "sed -f script.sed testfile". Note that a side effects of this script includes changing all tabs in the file into spaces.