Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I add a new line based on keyword for unstructured data python?

I’ve some text like this

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates
Emirates_Neon_Group:  No address
International_Cricket_Council:  No address
Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates
Allsopp_&_Allsopp:  No address
Lamprell:  No address

My aim is add a new line for every address. so that it will look this.

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia Dubai , United
Arab Emirates

Allsopp_&_Allsopp:  No address

Lamprell:  No address

so the only indicator that it’s a new address is :. Here the issue is with text wrapping.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I’m trying like

with open('test.txt', 'r') as infile:
    data = infile.read()
final_list = []
for ind, val in enumerate(data.split('\n')):
    final_list.append(val)
    if val == ':':
        final_list.insert(-1, '\n')

My logic is working most of the time, but it is failing in some cases with strings having : in the middle and also fails if there is a text wrapping.

Can you guys suggest me any better way to do this?

>Solution :

Use regex substitution with re.sub on address title (recognized in format <line break><Some_address_title>:)

import re

txt = '''your_input_text'''  # assuming your text
new_text = re.sub(r'\n[^\s:]+:', r'\n\g<0>', txt)
print(new_text)

Output:

Forbes_Middle_East: 309, Building 4, Emaar Business Park , Dubai , United
Arab Emirates

Emirates_Neon_Group:  No address

International_Cricket_Council:  No address

Tourism_Development_Authority: The Ras AI Khaimah Tourism Development Authority
was established in May 2011 under the Government of
Ras AI Khaimah. Its purpose is to develop and
promote the emirate's tourism offering and
infrastructure, both domestically and abroad.
Wikipedia

Allsopp_&_Allsopp:  No address

Lamprell:  No address
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading