Extracting and parsing information from a website using html-agility-pack


The next code extracts from the link


a website that has a list of citations. My end goal is to extract that information and place it into a list of json objects so each one could have the information for a citation.

While this code extracts each citation, at the moment it extracts the first pmid value using:


it stays showing 35491994
that is the pmid of the first found citation. Why is this happening? Shouldn’t this value change for each assigned object to the citation variable?

using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath

public class authorCitation
    public String pmid { get; set; }


public class processPubReferences{
    public HtmlDocument getRawData(String ncbiId)
        String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
        HtmlWeb web = new HtmlWeb();
        HtmlDocument htmlDoc = web.Load(@url);
        htmlDoc.OptionFixNestedTags = true;

        Console.WriteLine("getRawData>Data Type of htmlDoc is:");

        return htmlDoc;
    public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
        HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");

        Console.WriteLine("getCitations>type of nodetree is ");
        return nodetree;

class TestClass
    public static void Main(string[] args)
        processPubReferences pr = new processPubReferences();
        String ncbiId = "1dAdNxivfiO5l";
        var htmlDoc = pr.getRawData(ncbiId);        
        var citations = pr.getCitations(htmlDoc);
        // var pmidNode;

        foreach (var citation in citations)
            Console.WriteLine("------------------------------Start Node INfo-----------------------------");
            Console.WriteLine("------------------------------Ende Node Info -----------------------------");


>Solution :

The problem is caused by the usage of a relative XPath expression when choosing the pmid property. When you use "//input[@class='citation-check']", it looks for the first occurrence of this element starting from the XML document’s root, not from the current citation node. To remedy this, use a period (.) at the beginning of your XPath query to make it relative to the citation node. The revised code is as follows:


