Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract only title from hyperlink in pandas column

I have pandas column with hyperlinks and I want to extract only the name of the domain, excluding ".com", "http//", "www."

The following code works for most of my cases but there is one where it does not return the desired string:

docs['link_title'] = docs['hyperlink'].str.extract(r'(?<=\.)(.*?)(?=\.)')

Below are examples of hyperlinks and the results:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/
-> "traveldailymedia"

https://www.instagram.com/p/BKDJcO-htRs/ -> "instagram"

But this is an example where I don’t get the title of the domain:

http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html
-> "vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes"

Because there is no leading dot (".") it does not get the name which is "dtinews".

I would appreciate help with the regex here or some alternative to my approach.

>Solution :

You can use tldextract:

import tldextract
import pandas as pd
docs = pd.DataFrame({'hyperlink':["http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/","https://www.instagram.com/p/BKDJcO-htRs/","http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html"]})
docs['link_title'] = docs['hyperlink'].apply(lambda x: tldextract.extract(x).domain)

Output:

>>> docs['link_title']
0    traveldailymedia
1           instagram
2             dtinews
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading