Home Extracting and parsing information from a website using html-agility-pack

Questions

Extracting and parsing information from a website using html-agility-pack

September 11, 2023

The next code extracts from the link

https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/

a website that has a list of citations. My end goal is to extract that information and place it into a list of json objects so each one could have the information for a citation.

While this code extracts each citation, at the moment it extracts the first pmid value using:

citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)

it stays showing 35491994
that is the pmid of the first found citation. Why is this happening? Shouldn’t this value change for each assigned object to the citation variable?

using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
        
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath

public class authorCitation
{
    public String pmid { get; set; }

}

public class processPubReferences{
    
    public HtmlDocument getRawData(String ncbiId)
    {
        String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
        
        HtmlWeb web = new HtmlWeb();
        HtmlDocument htmlDoc = web.Load(@url);
        htmlDoc.OptionFixNestedTags = true;

        Console.WriteLine("getRawData>Data Type of htmlDoc is:");
        Console.WriteLine(htmlDoc.GetType());  

        return htmlDoc;
    }
    public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
    {
        
        HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");

        Console.WriteLine("getCitations>type of nodetree is ");
        Console.WriteLine(nodetree.GetType());
        return nodetree;
    }
}



class TestClass
{
    public static void Main(string[] args)
    {
        processPubReferences pr = new processPubReferences();
        String ncbiId = "1dAdNxivfiO5l";
        var htmlDoc = pr.getRawData(ncbiId);        
        var citations = pr.getCitations(htmlDoc);
        // var pmidNode;

        foreach (var citation in citations)
        {
            Console.WriteLine("------------------------------Start Node INfo-----------------------------");
            Console.WriteLine(citation.InnerText);
            Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
            
            Console.WriteLine("------------------------------Ende Node Info -----------------------------");

        }        
    }
}

>Solution :

The problem is caused by the usage of a relative XPath expression when choosing the pmid property. When you use "//input[@class='citation-check']", it looks for the first occurrence of this element starting from the XML document’s root, not from the current citation node. To remedy this, use a period (.) at the beginning of your XPath query to make it relative to the citation node. The revised code is as follows:

Console.WriteLine(citation.SelectSingleNode(".//input[@class='citation-check']").Attributes["pmid"].Value);

html-agility-pack

byMR

Published September 11, 2023

Add a comment

jQuery: How do I make an already open accordion slide collapse when I click on a different slide?

byMR

September 11, 2023

Questions

How do I create a class in Python that contains a udp socket?

byMR

September 11, 2023

Questions

Before the form is displayed, all radio buttons belong to the same radio group

byMR

September 11, 2023

Questions

Is `typedef` of a function standard C syntax, and how does it differ from a `typedef` of a function pointer?

byMR

September 11, 2023

Questions

func (*TCPConn) Read update the bytes value even though argument is not pointer

byMR

September 11, 2023

Questions

Many to Many Models Creating Records Via Console is not Working

byMR

September 11, 2023

Extracting and parsing information from a website using html-agility-pack

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

jQuery: How do I make an already open accordion slide collapse when I click on a different slide?

How do I create a class in Python that contains a udp socket?

Before the form is displayed, all radio buttons belong to the same radio group

Is `typedef` of a function standard C syntax, and how does it differ from a `typedef` of a function pointer?

func (*TCPConn) Read update the bytes value even though argument is not pointer

Many to Many Models Creating Records Via Console is not Working

Keep Up to Date with the Most Important News

Extracting and parsing information from a website using html-agility-pack

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

jQuery: How do I make an already open accordion slide collapse when I click on a different slide?

How do I create a class in Python that contains a udp socket?

Before the form is displayed, all radio buttons belong to the same radio group

Is `typedef` of a function standard C syntax, and how does it differ from a `typedef` of a function pointer?

func (*TCPConn) Read update the bytes value even though argument is not pointer

Many to Many Models Creating Records Via Console is not Working

Discover more from Dev solutions