Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is my antlr4 grammar not parsing this text correctly

I have the following grammar in antlr4 that I am using to parse betting lines.

grammar betting;

/*
 * Parser Rules
 */

market
    :   zero
    |   array
    |   grid
    ;

zero
    :   (outcomePrice NEWLINE)+ outcomePrice
    ;
    
array
    :  TEXT NEWLINE list NEWLINE TEXT list NEWLINE TEXT list
    ;

grid
    :   (TEXT NEWLINE (linePrice NEWLINE)+)+
    ;

list
    :   (NEWLINE NUMBER)+
    ;
    
outcomePrice
    :   TEXT NEWLINE NUMBER
    ;
    
linePrice
    :   line NEWLINE NUMBER
    |   SCORE NEWLINE NUMBER
    ;
    
line
    :   SINGLE_LINE
    |   SPLIT_LINE
    ;

/*
 * Lexer Rules
 */
 
fragment DIGIT 
    :   [0-9]
    ;
    
fragment CHARACTER
    :   [ \u00a0\u00c0-\u01ffa-zA-Z'\-0-9]
    ;

SPLIT_LINE
    :   SINGLE_LINE ',' SINGLE_LINE
    ;
    
SINGLE_LINE
    :   [+\-] NUMBER
    ;
    
SCORE
    :   DIGIT+ '-' DIGIT+
    ;
    
NUMBER         
    :   DIGIT+ (.DIGIT+)? 
    ;
    
TEXT
    :   CHARACTER+
    ;
    
NEWLINE
    :   '\n'
    ;
    
WS
    :   [ \t\r]+ -> skip
    ;

I am testing against the following text,

Kodagu FC
-1.0,-1.5
1.575
-1.5
1.725
-1.5,-2.0
1.925
-2
2.050
-2.0,-2.5
2.300
Rebels FC
+1.0,+1.5
2.250
+1.5
1.975
+1.5,+2.0
1.875
+2
1.675
+2.0,+2.5
1.550

For some reason suddenly a number is not being recognised, I expect the pattern to continue and the ‘line’ to break/exit at the ‘\n’ character. Why is this the case and are there any tools to easily discover this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

>Solution :

:   DIGIT+ (.DIGIT+)? 

. means "any character". The above pattern matches either a sequence of digits or two sequences of digits with any character in between. So this will match "12.34", but it can also match "12,34" or, yes, "12\n34".

Put the dot inside quotes so it only matches literal dots.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading