Unix Text Processing Pipelines
Discussions revolve around using shell command-line tools like sort, uniq, cut, awk, and grep in pipelines for tasks such as deduplication, sorting, and frequency counting of words or lines, often comparing them to scripts in languages like Perl or Python.
Activity Over Time
Top Contributors
Keywords
Sample Comments
cat | cut -d | sort | uniq or when in doubt, just write a few lines of perl.
Sort line of text | filter to only print unique lines | print just the first word of each line | filter the print just the unique lines and the count | sort the counts numerically | print the top 10Given a bunch of lines of text, this reports on the frequency of first starting words.Probably the -C should be lower case?
depending on what you're doing, awk / sort / uniq / specialized scripts in ruby / sqlite / Rprovide more details
a set replaces 'sort -u' or 'sort | uniq'. A dictionary replaces 'sort | uniq -c'
Sure it can, but so can one python script. For that matter, so can the four or five pipes. What's so wrong with `grep regex myFile | sort -u | cut -d ',' -f 3` etc.?
I tried a perl script versus the shell pipeline, see my other comments in the thread. It's significantly faster, because using sort and uniq -c is a pretty high effort way to count words, especially for big lists. Your words seem to work fine with it, it's just splitting on whitespace.
The "namecount" example is rather silly though. It can be solved by a shell script much shorter than one line (sort|uniq -c).
My intuitions start with: cut, wc, sort, uniq
isn't the "frequency" perl script the same as "sort | uniq -c" ?
I was wondering if there's a CLI that can replace sort | uniq -c | sort -nr. I feel like this is a very commonly used pattern.