Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What regular expression should I use to extract stock symbols from text file?

I would like to extract all the symbols from a text file with a regular expression, ALLY, AMZN, AXP, AON, etc…

The following data is in a text file

Symbol  Holdings    Stake   Mkt. price  Value   Pct of portfolio
Ally Financial Inc  ALLY    29,000,000  9.6%    $25.59  $742,110,000    0.2%
Amazon.com, Inc.    AMZN    10,551,000  0.1%    $137.09 $1,446,436,590  0.4%
American Express Company    AXP 151,610,700 20.8%   $149.79 $22,709,766,753 6.7%
Aon PLC AON 4,335,000   2.1%    $315.64 $1,368,299,400  0.4%
Apple Inc   AAPL    915,560,382 5.9%    $176.54 $161,633,029,838    47.4%
Bank of America Corp    BAC 1,032,852,006   13.0%   $27.25  $28,145,217,164 8.3%
BYD Co. Ltd BYDDF   98,603,142  9.0%    $30.11  $2,968,940,606  0.9%
Capital One Financial Corp. COF 12,471,030  3.3%    $103.99 $1,296,862,410  0.4%
Celanese Corporation    CE  5,358,535   4.9%    $115.37 $618,214,183    0.2%
Charter Communications Inc  CHTR    3,828,941   2.6%    $409.20 $1,566,802,657  0.5%
Chevron Corporation CVX 123,120,120 6.5%    $147.07 $18,107,276,048 5.3%
Citigroup Inc   C   55,244,797  2.9%    $40.83  $2,255,645,062  0.7%
Coca-Cola Co    KO  400,000,000 9.3%    $56.98  $22,792,000,000 6.7%

What regular expression should I use to extract all the stock symbols in the text file?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Ok, so this one will work:

\b[A-Z]+?(?=\s+\d+)\b

It looks for uppercase characters group followed by space and number. Realized that some companies might have the same pattern in their name.

Explanation:

- \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
- Match a single character present in the list below [A-Z]
     +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
     A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
- Positive Lookahead (?=\s+\d+)
    Assert that the Regex below matches
      \s matches any whitespace character (equivalent to [\r\n\t\f\v ])
      + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
      \d matches a digit (equivalent to [0-9])
      + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
      \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)

See it working here: https://regex101.com/r/ruBdVj/1

This is from https://www.debuggex.com/
enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading