Clean Up Your Writing With Linux Utilities

I kept noticing that I frequently make the error of doubling a word in what I write. When I write the source document in HTML or LaTeX, the line breaks are arbitrary. That means that the error may be obvious in the output, but not so obvious in the input. For example, the file I create might contain this:

Now is the the time for all
good people to come to the
the aid of the party.

The word “the” appears at the end of one line, and then again at the beginning of the next line.

This is a common mistake. Here’s a sign I saw in a train station in Salerno, Italy. It is intended to direct us to access the tracks through the new underpass of the railway station square.

linux utilities
Access the tracks by the new underground passageway of the railway station square.

Magic-Marker-wielding language sticklers have corrected it, crossing out the duplicated “dal” in addition to replacing what they saw as sloppy slang and misuse of prepositions.

Let’s figure out a way to spot these errors, and turn this project idea into a Linux shell script!

Fix It Or Just Find It?

You might object, “But isn’t a doubled word appropriate at times?”

Why this is is mysterious, but I had had a suspicion that that would come up. (See what I did there? Three doubles in one sentence!)

There will be legitimate doubled sequences within the text. Also, I find myself starting sections like the following HTML fragment, where a one-word header is followed by a sentence starting with the same word:

<h2> Paris </h2>

Paris is "The City of Light."
As the capital of France, ...

So we aren’t going to automatically change anything. We will just highlight possible problems for the user to examine and fix if needed.

What Would It Take To Find Doubled Words?

Let’s list the needed steps:

  1. Doubled words frequently slip into writing like my first example, where they span a line break. So let’s start by converting the file into one long line.
  2. Mark the ends of headers, so we avoid false alarms like the header and leading word example I showed above.
  3. Strip out the HTML markup.
  4. Remove punctuation.
  5. Force what remains to all lower case.
  6. Find the doubled words and display the blocks in which they appear.

How Can We Do That With Linux Utilities?

Actually any Unix-family utilities, including a command-line environment in Mac OS X, and even in Windows if you add the GNU utilities. Here are the tools in steps numbered as in the above description:

  1. tr to replace every newline with space.
  2. sed to replace the end-of-header HTML markup,</h1>,</h2>, etc.
  3. sed to replace HTML markup with one blank space each. The trick here is to replace every string that is a “<“, followed by arbitrarily many characters that aren’t “>”, and finally a “>”.
  4. sed again to replace any one of a list of punctuation marks with one space each.
  5. tr to replace the upper case with lower case.
  6. awk to scan through the result looking for identical adjacent words. For every pair, print them in a context of up to ten words before through ten words after, but be careful in case the pair is close to the beginning or the end.

Let’s See The Code!


for FILE in $*
    echo "Doubled words in $FILE:"

    cat $FILE | tr '\n' ' ' |
        sed 's@@ end-of-header @' |
        sed 's/<[^>]*>/ /g' |
        sed 's/[\.,:;?!"|{}()]/ /g' |
        tr 'A-Z' 'a-z' |
        awk '{  for (i = 1; i < NF-1; i++) {
            if ($i == $(i+1) ) {
                mini = i - 7;
                if (mini < 1) {
                    mini = 1;
                maxi = i + 7;
                if (maxi > NF) {
                    maxi = NF;
                printf("DOUBLE: %s\n", $i);
                for (j = mini; j <= maxi; j++)
                     printf("%s ", $j);

To learn more about the individual utilities and how to build powerful scripts that use them, check out Learning Tree’s Linux power tools course.

Type to search

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.