I am pursuing the subject of Scripting in my Master Degree of Data Science and I did the previous question below:
How to filter all values in Bash's regular expression (Linux) except one?
I learned that ERE is not exactly Bash. But I am stuck again.
Traducing the statement of the assingment:
The regular expression must be used within a Bash script called
script.sh that runs a grep command. The script should work
running as follows:
./script.sh ./headlines_words.csv
The sample data of the file headline_words.csv is:
Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
14;279;2012;India;cricketer;0;3634;0.0;9;20;20;empowerment
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes
9887;57556;2021;all countries;hate;312;234227;0.001332041;4;436;429;crime and violence
And I have three conditions to achieve:
- The years are from 2010 to 2011
- The value of the freq_rank column is greater than or equal to 910
- The country column is not "all countries"
So, the correct output of ./script.sh ./headlines_words.csv must be:
Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes
I used grep -E (I know it is not suitable but is mandatory to use it) with the following code for script.sh:
#!/bin/bash
grep -E '^([^;]*;){2}201[01];([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s]);[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;9[1-9][0-9]{2,}' $1
The part of:
^([^;]*;){2}201[01];
And:
([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s])
Worked separately and together.
But the last part:
[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;9[1-9][0-9]{2,}'
I have an error because when I run the code ./script.sh ./headlines_words.csv in Ubuntu the terminal doesn’t give me any row.
Thank you very much for your time!
>Solution :
SInce this your project requirement to use grep -E, here is a grep solution:
grep -E '^([^;]*;){2}201[01];([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s])([^;]*;){6}(9[1-9][0-9]|[1-9][0-9]{3,});' file
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes
However I must add that awk is far more suitable tool for this job and an awk solution would be this:
awk -F ';' '
NR == 1 || ($3 ~ /^201[01]$/ && $4 != "all countries" && $10 >= 910)
' file
Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;931;female stereotypes