I am trying to extract the SQLCODE which typically is 4 integers from error text like below :
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id 629709103
3. SQLSTATE: 40001, SQLCODE: 2631 Session Id 594700603
4. SQLSTATE: T7547, SQLCODE: 754Session Id613234380
I have below pattern matching regex for the same currently. However, there is aedge case like 4 which is failing.
error_cd = re.findall(r'SQLCODE:\s([^.,\s]+)', err_log)
If there are no 4 integers after SQLCODE, i would like to extract the SQLSTATE text after the letter ‘T’
Expected Output :
1. 3535
2. 2631
3. 2631
4. 7547
Any suggestions on how to achieve this is appreciated. Thank you.
>Solution :
This could probably be done purely in regex, using a lookahead/lookbehind to handle the conditional extraction, but that could get really messy.
SOLUTION 1: Pure Regex:
EDIT: Here’s the pure regex solution… simpler than I thought it would be (though definitely more error-prone than the mixed approach… would need some additional logic to make it more robust):
re.findall(r'((?:(?<=SQLSTATE: T)[0-9]{4})|(?:(?<=SQLCODE: )[0-9]{4}))', err_log)
SOLUTION 2: Regex and Python:
The following solution uses regex to pull both the SQLSTATE and SQLCODE values, and uses a list comprehension to do the conditional extraction:
err_log = '''
1. SQLSTATE: 22018, SQLCODE: 3535.
2. SQLSTATE: 40001, SQLCODE: 2631 Session Id 629709103
3. SQLSTATE: 40001, SQLCODE: 2631 Session Id 594700603
4. SQLSTATE: T7547, SQLCODE: 754Session Id613234380
'''
error_st_cd = re.findall(r'SQLSTATE: [A-Z]*([0-9]+), SQLCODE:\s([0-9]{4})?', err_log)
error_cd = [codes[1] or codes[0] for codes in error_st_cd]
for i, cd in enumerate(error_cd):
print(f'{i+1}. {cd}')
Output:
- 3535
- 2631
- 2631
- 7547