Capture first letter of duplicate

November 24, 2021

So, I’ve been trying to create a program that converts from romaji (romanization of japanese) to the hiragana alphabet. I would like to match the first letters of ‘kk’, ‘ss’, ‘tt’, and ‘pp’.

My attempt:

re.sub(r'(?=([kstp]))\1', 'っ', string)

I expect 'tooka' to output 'tooka' and 'yokka' should output 'yoっka', but my regex appears to just be matching [kstp].

Is there an easy way I can fix this?

>Solution :

You put positive-lookahead (?=...) in the wrong position. Try:

import re

lst = ['tooka', 'yokka', 'chotto', 'koppu']
print([re.sub(r'([kstp])(?=\1)', 'っ', s) for s in lst])
# ['tooka', 'yoっka', 'choっto', 'koっpu']

Or a simpler one re.sub(r'([kstp])\1', r'っ\1', s) works too.