Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to end up this regex in ERE

I am pursuing the subject of Scripting in my Master Degree of Data Science and I did the previous question below:

How to filter all values in Bash's regular expression (Linux) except one?

I learned that ERE is not exactly Bash. But I am stuck again.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Traducing the statement of the assingment:

The regular expression must be used within a Bash script called
script.sh that runs a grep command. The script should work
running as follows:

./script.sh ./headlines_words.csv

The sample data of the file headline_words.csv is:

Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
14;279;2012;India;cricketer;0;3634;0.0;9;20;20;empowerment
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes
9887;57556;2021;all countries;hate;312;234227;0.001332041;4;436;429;crime and violence

And I have three conditions to achieve:

  • The years are from 2010 to 2011
  • The value of the freq_rank column is greater than or equal to 910
  • The country column is not "all countries"

So, the correct output of ./script.sh ./headlines_words.csv must be:

Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes

I used grep -E (I know it is not suitable but is mandatory to use it) with the following code for script.sh:

#!/bin/bash

grep -E '^([^;]*;){2}201[01];([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s]);[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;9[1-9][0-9]{2,}' $1

The part of:

^([^;]*;){2}201[01];

And:

([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s])

Worked separately and together.

But the last part:

[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;9[1-9][0-9]{2,}'

I have an error because when I run the code ./script.sh ./headlines_words.csv in Ubuntu the terminal doesn’t give me any row.

Thank you very much for your time!

>Solution :

SInce this your project requirement to use grep -E, here is a grep solution:

grep -E '^([^;]*;){2}201[01];([^a]|a[^l]|al[^l]|all[^ ]|all [^c]|all c[^o]|all co[^u]|all cou[^n]|all coun[^t]|all count[^r]]|all countr[^i]]|all countri[^e]|all countrie[^s])([^;]*;){6}(9[1-9][0-9]|[1-9][0-9]{3,});' file

8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;935;female stereotypes

RegEx Demo

However I must add that awk is far more suitable tool for this job and an awk solution would be this:

awk -F ';' '
 NR == 1 || ($3 ~ /^201[01]$/ && $4 != "all countries" && $10 >= 910)
' file

Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
8964;52224;2010;USA;witchcraft;0;1912;0.0;10;935;931;female stereotypes
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading