I have an xml file like this:
<?xml version="1.0" encoding="utf-8"?><!--Generated by Screaming Frog SEO Spider 16.3-->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://orinab.com/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://orinab.com/cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%B4%D9%BE%D8%B2%D8%AE%D8%A7%D9%86%D9%87-%D8%A2%D9%85%D8%A7%D8%AF%D9%87-%D9%81%D9%84%D8%B2%DB%8C-%D8%AF%D8%B1%D8%A8-%DA%86%D9%88%D8%A8%DB%8C</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://orinab.com/sales-associates</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
...
and I want to append links with kitchen-cabinet rule to a list.
any suggestions would be appreciated.
>Solution :
I am not that good with xml, but one thing you can use is regex:
import re
reg = re.compile(r'(https:.*kitchen-cabinet.*)(?=<)')
reg.findall(xml)
>> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']
# xml variable:
xml = '''
<url>
<loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
'''
reg.findall(xml)
>>> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']
Edit:
with open('file.xml', 'r') as f:
trim = reg.findall(f.read())
print(trim)