The next code extracts from the link
https://www.ncbi.nlm.nih.gov/myncbi/1dAdNxivfiO5l/bibliography/public/
a website that has a list of citations. My end goal is to extract that information and place it into a list of json objects so each one could have the information for a citation.
While this code extracts each citation, at the moment it extracts the first pmid value using:
citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value)
it stays showing 35491994
that is the pmid of the first found citation. Why is this happening? Shouldn’t this value change for each assigned object to the citation variable?
using System;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
using System.Text;
using System.Xml.XPath;
// https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
// https://stackoverflow.com/questions/11017583/extract-content-from-div-class-div-tag-c-sharp-regex
// https://stackoverflow.com/questions/1289756/most-elegant-way-to-query-xml-string-using-xpath
public class authorCitation
{
public String pmid { get; set; }
}
public class processPubReferences{
public HtmlDocument getRawData(String ncbiId)
{
String url = "https://www.ncbi.nlm.nih.gov/myncbi/" + ncbiId + "/bibliography/public/";
HtmlWeb web = new HtmlWeb();
HtmlDocument htmlDoc = web.Load(@url);
htmlDoc.OptionFixNestedTags = true;
Console.WriteLine("getRawData>Data Type of htmlDoc is:");
Console.WriteLine(htmlDoc.GetType());
return htmlDoc;
}
public HtmlNodeCollection getCitations(HtmlDocument htmlDoc)
{
HtmlNodeCollection nodetree = htmlDoc.DocumentNode.SelectNodes("//div[@class='citation']");
Console.WriteLine("getCitations>type of nodetree is ");
Console.WriteLine(nodetree.GetType());
return nodetree;
}
}
class TestClass
{
public static void Main(string[] args)
{
processPubReferences pr = new processPubReferences();
String ncbiId = "1dAdNxivfiO5l";
var htmlDoc = pr.getRawData(ncbiId);
var citations = pr.getCitations(htmlDoc);
// var pmidNode;
foreach (var citation in citations)
{
Console.WriteLine("------------------------------Start Node INfo-----------------------------");
Console.WriteLine(citation.InnerText);
Console.WriteLine(citation.SelectSingleNode("//input[@class='citation-check']").Attributes["pmid"].Value);
Console.WriteLine("------------------------------Ende Node Info -----------------------------");
}
}
}
>Solution :
The problem is caused by the usage of a relative XPath expression when choosing the pmid
property. When you use "//input[@class='citation-check']"
, it looks for the first occurrence of this element starting from the XML document’s root, not from the current citation
node. To remedy this, use a period (.
) at the beginning of your XPath query to make it relative to the citation
node. The revised code is as follows:
Console.WriteLine(citation.SelectSingleNode(".//input[@class='citation-check']").Attributes["pmid"].Value);