Grep title of a page which is written with spaces

Advertisements

I am trying to get the meta title of some website…

some people write title like

`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`

`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`

`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`

some like more ways… my current focus on above 3 ways…

I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,

`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`

I also made a code (very bad i guess)

where i can grep number of line like

`
% curl -s https://allheartweb.com/ | grep -n '<title>'                   
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology

% curl -s https://allheartweb.com/ | grep -n '</title>' 
8:</title>
`

and store it and run loop to get title item… which i guess a bad idea…

any help I can get all possible of getting title?

>Solution :

Try this:

curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'

You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.

This is quite a simple approach of course. xmllint would be better but that’s not available to all platforms by default.

Leave a ReplyCancel reply