Extracting and parsing information from a website using html-agility-pack

Advertisements

The next code extracts from the link

https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/

a website that has a list of citations. My end goal is to extract that information and place it into a list of json objects so each one could have the information for a citation.

While this code extracts each citation, at the moment it extracts the first pmid value using:

citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)

it stays showing 35491994
that is the pmid of the first found citation. Why is this happening? Shouldn’t this value change for each assigned object to the citation variable?

using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
        
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath

public class authorCitation
{
    public String pmid { get; set; }

}

public class processPubReferences{
    
    public HtmlDocument getRawData(String ncbiId)
    {
        String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
        
        HtmlWeb web = new HtmlWeb();
        HtmlDocument htmlDoc = web.Load(@url);
        htmlDoc.OptionFixNestedTags = true;

        Console.WriteLine("getRawData>Data Type of htmlDoc is:");
        Console.WriteLine(htmlDoc.GetType());  

        return htmlDoc;
    }
    public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
    {
        
        HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");

        Console.WriteLine("getCitations>type of nodetree is ");
        Console.WriteLine(nodetree.GetType());
        return nodetree;
    }
}



class TestClass
{
    public static void Main(string[] args)
    {
        processPubReferences pr = new processPubReferences();
        String ncbiId = "1dAdNxivfiO5l";
        var htmlDoc = pr.getRawData(ncbiId);        
        var citations = pr.getCitations(htmlDoc);
        // var pmidNode;

        foreach (var citation in citations)
        {
            Console.WriteLine("------------------------------Start Node INfo-----------------------------");
            Console.WriteLine(citation.InnerText);
            Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
            
            Console.WriteLine("------------------------------Ende Node Info -----------------------------");

        }        
    }
}

>Solution :

The problem is caused by the usage of a relative XPath expression when choosing the pmid property. When you use "//input[@class='citation-check']", it looks for the first occurrence of this element starting from the XML document’s root, not from the current citation node. To remedy this, use a period (.) at the beginning of your XPath query to make it relative to the citation node. The revised code is as follows:

Console.WriteLine(citation.SelectSingleNode(".//input[@class='citation-check']").Attributes["pmid"].Value);

Dev solutions

Solutions for development problems

Extracting and parsing information from a website using html-agility-pack

>Solution :

Leave a ReplyCancel reply

>Solution :

Share this:

Leave a ReplyCancel reply

Discover more from Dev solutions