Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting and parsing information from a website using html-agility-pack

The next code extracts from the link

https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/

a website that has a list of citations. My end goal is to extract that information and place it into a list of json objects so each one could have the information for a citation.

While this code extracts each citation, at the moment it extracts the first pmid value using:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)

it stays showing 35491994
that is the pmid of the first found citation. Why is this happening? Shouldn’t this value change for each assigned object to the citation variable?

using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
        
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath

public class authorCitation
{
    public String pmid { get; set; }

}

public class processPubReferences{
    
    public HtmlDocument getRawData(String ncbiId)
    {
        String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
        
        HtmlWeb web = new HtmlWeb();
        HtmlDocument htmlDoc = web.Load(@url);
        htmlDoc.OptionFixNestedTags = true;

        Console.WriteLine("getRawData>Data Type of htmlDoc is:");
        Console.WriteLine(htmlDoc.GetType());  

        return htmlDoc;
    }
    public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
    {
        
        HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");

        Console.WriteLine("getCitations>type of nodetree is ");
        Console.WriteLine(nodetree.GetType());
        return nodetree;
    }
}



class TestClass
{
    public static void Main(string[] args)
    {
        processPubReferences pr = new processPubReferences();
        String ncbiId = "1dAdNxivfiO5l";
        var htmlDoc = pr.getRawData(ncbiId);        
        var citations = pr.getCitations(htmlDoc);
        // var pmidNode;

        foreach (var citation in citations)
        {
            Console.WriteLine("------------------------------Start Node INfo-----------------------------");
            Console.WriteLine(citation.InnerText);
            Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
            
            Console.WriteLine("------------------------------Ende Node Info -----------------------------");

        }        
    }
}

>Solution :

The problem is caused by the usage of a relative XPath expression when choosing the pmid property. When you use "//input[@class='citation-check']", it looks for the first occurrence of this element starting from the XML document’s root, not from the current citation node. To remedy this, use a period (.) at the beginning of your XPath query to make it relative to the citation node. The revised code is as follows:

Console.WriteLine(citation.SelectSingleNode(".//input[@class='citation-check']").Attributes["pmid"].Value);
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading