Regular Expression for Range chinese chars and selected groups of chars

I’m trying to get all Chinese sentences from strings with addtional group of chars like [NAME] and [PLACE].

I have this string

<DisplayName>凡人战争</DisplayName>
<Desc>[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦。</Desc>
<Display>劝停战争</Display>  
<OKResult><![CDATA[me:AddMsg(XT("[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争 ...

and I want find

凡人战争
[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦
[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争,消弭了这场祸事
此举手段温和,虽无人知晓,但却顺应天道,[NAME]获得了一些功德

I know for chinese chars regex is [\u4e00-\u9fff\uFF0C]+
and for group chars (\u005BNAME\u005D) and (\u005BPLACE\u005D) but how to combine this.

I try this way written in python

Array_of_words = re.findall(r'[\u4e00-\u9fff\uFF0C(\u005BNAME\u005D)(\u005BPLACE\u005D)]+', text)

But additionally marks single letters and brackets like this:

['N', 'N', '凡人战争', 'N', '[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦', '劝停战争', '[C', 'A', 'A[', 'A', 'M', '(', '(', '[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争,消弭了这场祸事', '此举手段温和,虽无人知晓,但却顺应天道,[NAME]获得了一些功德', '))', 'A', 'P', '(', '(', '))', '()', ']]']

>Solution :

You can use

re.findall(r'(?:\[(?:PLACE|NAME)]|[\u4e00-\u9fff\uFF0C])+', text)

Details

  • (?: – start of a non-capturing group:
    • \[(?:PLACE|NAME)][, then either PLACE or NAME and then ]
    • | – or
    • [\u4e00-\u9fff\uFF0C] – a Chinese char pattern of yours
  • )+ – end of the group, match one or more occurrences.

To match any uppercase ASCII letters inside square brackets, replace \[(?:PLACE|NAME)] with \[[A-Z]+].

Leave a Reply