grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" frankenstein.txt | sort -f | uniq -ci | sort -nr | head -n 100
grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" frankenstein.txt
This prints every word on a single line
The -o option tells grep to only print parts of the file that match the given pattern which amounts to printing every word on a separate line, while the -E option tells it to use extended regular expression semantics.
Extended regular expressions allow me to use the + and | operators without escaping them. Without the -E the regular expression would look like this:
[[:alpha:]]\+\|[[:alpha:]]\+'[[:alpha:]]\+ which has the same meaning but is just uglier to look at.
Now the regular expression tells us to match two cases, either a set of one or more characters ([[:alpha:]]+) or one or more characters then an apostrophe then one ore more characters ([[:alpha:]]+'[[:alpha:]]+).
The second case handles situations like you're or don't.
sort -f
This sorts the output from the previous grep by alphabetical order. This is needed to do the counting next as uniq only counts consecutive identical lines.
The -f flag tells sort to essentially ignore cases (actually it tells it to just put words with upper case letter next to words with lower cases of the same letter).
uniq -ci
This counts (with the -c flag) the number of unique consecutive lines and ignores case differences (with the -i flag).
sort -nr
This sorts the counted list numerically (with the -n flag) in reverse order (with the -r flag). This way, the most frequent words are at the top of the list.
head -n 100
This returns the first 100 lines from the list returned by the prior command. Thus, this results in the top 100 most frequent words.
grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" frankenstein.txt | sort -f | uniq -ci | sort -nr | grep -ivwf prepositions.txt | head -n 100
grep -ivwf prepositions.txt
We are going to read the list of prepositions from a file, and that is what the -f flag is for.
We want grep to match full words (and not substrings, such that "futon" does not count as a match for the preposition "on"), and that is done using the -w flag.
We also tell grep to ignore case using the -i flag.
Finally, we also tell grep to invert the match and only print lines that do not match the prepositions (that is, we delete the lines with prepositions in them), using the -v flag.
Here I was fine if you just did the analysis on everything that is not part of an HTML tag (i.e. everything <tag NOT HERE> ... but here! ... </tag>).
After taking out the tags, this problem turns into problem #1.
tr '\n' ' ' < superbowl_print.html | sed 's/<[^<]*>//g' | grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" | sort -f | uniq -ci | sort -nr | head -n 100
tr '\n' ' ' < superbowl_print.html
This converts all newline characters to spaces such that the entire file is on one line.
sed 's/<[^<]*>//g'
This removes all tags from the text by replacing them with nothing. The regular expression given to sed to match dictates a string that begins with a %lt; and ends with a > with no other < inside it, this is needed because otherwise sed will do a greedy match which will match the longest tag (which may contain other tags inside it).
This way we force it to a none greedy match by not tolerating starting another tag inside a match (i.e. no < can follow the initial one).
The g flag at the end of the substitution command tells sed to perform a substitution globally per line, otherwise, it will only perform the first substitution it encounters.
tr '\n' ' ' < superbowl_print.html | sed 's/<[^<]*>//g' | grep -Eo "[[:alpha:]]+|[[:alpha:]]+'[[:alpha:]]+" | grep -B 5 -A 5 -Ei "giants|patriots"
grep -B 5 -A 5 -Ei "giants|patriots"
Remember that we have every word on a single line from before.
The -B 5 flag tells grep to print the 5 lines before a match (which corresponds to the 5 words before a match).
Similarly, the -A 5 flag tells grep to print the 5 lines after a match (which corresponds to the 5 words after a match).
As before, the -E flag tells grep to use extended regular expressions, this is not necessary but allows us to use | instead of \|.
The -i flag tells us to look for matches ignoring case. The expression "giants|patriots" matches either giants or patriots.
gawk -F ',' '
/start work/ {work -= $3}
/end work/ {work += $3}
/start run/ {run -= $3}
/end run/ {run += $3}
/start farmers market/ {market -=$3}
/end farmers market/ {market += $3}
END {
print "Total work hours", work
print "Total run hours", run
print "Total farmers market hours", market }
' activity_log.csv
This is a simple computational task over a structured file, gawk/awk is the perfect tool for this.
First, we tell gawk that the columns of this file are separated by commas instead of the default (which is spaces). This is accomplished via the -F ',' flag (or equivalently by setting FS = ',' in the code).
Next, as gawk runs through the file line by line, it tries to match different expressions, and depending on what it matches it updates a variable value. Remember all numerical variables are initialized to 0 by gawk.
The formula for number of hours spent at an activity is hours = end - start. So when we see a line that matches a start of an activity, we subtract the time field (the third field which we refer to by $3) from the running total for that activity.
Similarly, when we see an end of an activity, we add the time field to the running total. This gives us the total number of hours spent at the end of the file.
After running through all the lines in the file, gawk will execute the END block, which here just tells it to print the various totals.