I would like to extract all the symbols from a text file with a regular expression, ALLY, AMZN, AXP, AON, etc…
The following data is in a text file
Symbol Holdings Stake Mkt. price Value Pct of portfolio
Ally Financial Inc ALLY 29,000,000 9.6% $25.59 $742,110,000 0.2%
Amazon.com, Inc. AMZN 10,551,000 0.1% $137.09 $1,446,436,590 0.4%
American Express Company AXP 151,610,700 20.8% $149.79 $22,709,766,753 6.7%
Aon PLC AON 4,335,000 2.1% $315.64 $1,368,299,400 0.4%
Apple Inc AAPL 915,560,382 5.9% $176.54 $161,633,029,838 47.4%
Bank of America Corp BAC 1,032,852,006 13.0% $27.25 $28,145,217,164 8.3%
BYD Co. Ltd BYDDF 98,603,142 9.0% $30.11 $2,968,940,606 0.9%
Capital One Financial Corp. COF 12,471,030 3.3% $103.99 $1,296,862,410 0.4%
Celanese Corporation CE 5,358,535 4.9% $115.37 $618,214,183 0.2%
Charter Communications Inc CHTR 3,828,941 2.6% $409.20 $1,566,802,657 0.5%
Chevron Corporation CVX 123,120,120 6.5% $147.07 $18,107,276,048 5.3%
Citigroup Inc C 55,244,797 2.9% $40.83 $2,255,645,062 0.7%
Coca-Cola Co KO 400,000,000 9.3% $56.98 $22,792,000,000 6.7%
What regular expression should I use to extract all the stock symbols in the text file?
>Solution :
Ok, so this one will work:
\b[A-Z]+?(?=\s+\d+)\b
It looks for uppercase characters group followed by space and number. Realized that some companies might have the same pattern in their name.
Explanation:
- \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
- Match a single character present in the list below [A-Z]
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
- Positive Lookahead (?=\s+\d+)
Assert that the Regex below matches
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equivalent to [0-9])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
See it working here: https://regex101.com/r/ruBdVj/1
This is from https://www.debuggex.com/
