Problem #1

grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" frankenstein.txt | sort -f | uniq -ci | sort -nr | head -n 100

Explanation


Problem #2

grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" frankenstein.txt | sort -f | uniq -ci | sort -nr | grep -ivwf prepositions.txt | head -n 100

Explanation

Everything is just like problem #1, except that before returning the top 100 we cut out the prepositions from the list.

Problem #3

Here I was fine if you just did the analysis on everything that is not part of an HTML tag (i.e. everything <tag NOT HERE> ... but here! ... </tag>).

After taking out the tags, this problem turns into problem #1.

tr '\n' ' ' < superbowl_print.html | sed 's/<[^<]*>//g' | grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" | sort -f | uniq -ci | sort -nr | head -n 100

Explanation


Problem #4

tr '\n' ' ' < superbowl_print.html | sed 's/<[^<]*>//g' | grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" | grep -B 5 -A 5 -Ei "giants|patriots"

Explanation

This starts by striping away the HTML tags as before and then going through grep to print out every word on a single line.

Problem #5

         gawk -F ',' '
             /start work/ {work -= $3}
             /end work/ {work += $3}
             /start run/ {run -= $3}
             /end run/ {run += $3}
             /start farmers market/ {market -=$3}
             /end farmers market/ {market += $3}
             END {
                print "Total work hours", work
                print "Total run hours", run
                print "Total farmers market hours", market }
           ' activity_log.csv
      

Explanation

This is a simple computational task over a structured file, gawk/awk is the perfect tool for this. First, we tell gawk that the columns of this file are separated by commas instead of the default (which is spaces). This is accomplished via the -F ',' flag (or equivalently by setting FS = ',' in the code).

Next, as gawk runs through the file line by line, it tries to match different expressions, and depending on what it matches it updates a variable value. Remember all numerical variables are initialized to 0 by gawk. The formula for number of hours spent at an activity is hours = end - start. So when we see a line that matches a start of an activity, we subtract the time field (the third field which we refer to by $3) from the running total for that activity. Similarly, when we see an end of an activity, we add the time field to the running total. This gives us the total number of hours spent at the end of the file.

After running through all the lines in the file, gawk will execute the END block, which here just tells it to print the various totals.