Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Perl XML::LibXML Get data outside of a tag

As a followup question to my last (Perl XML::LibXML Getting info from specific nodes)

Given the following XML data, I can not figure out how to get the data that is shown after the <tab/> tag (which has no ending tag without getting all of the data from the child nodes from within the section? See below for more specifics:

XML Sample:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

<title number="3">
<catchline>Uniform Agricultural Cooperative Association Act</catchline>
<chapter number="3-1">
<catchline>
General Provisions Relating to Agricultural Cooperative Associations
</catchline>
<section number="3-1-1">
<histories>
<history>
Amended by Chapter
<modchap sess="2010GS">378</modchap>
, 2010 General Session
</history>
<modyear>2010</modyear>
</histories>
<catchline>Declaration of policy.</catchline>
<tab/>
It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed. THIS IS THE DATA THAT I WANT TO GET
</section>
<section number="3-1-1.1">
<histories>
<history>
Amended by Chapter
<modchap sess="1996GS">79</modchap>
, 1996 General Session
</history>
<modyear>1996</modyear>
</histories>
<catchline>General corporation laws do not apply.</catchline>
<tab/>
<xref depth="1" refnumber="16-10a" start="0">
Title 16, Chapter 10a, Utah Revised Business Corporation Act
</xref>
, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections
<xref depth="3" refnumber="3-1-13.4" start="0">3-1-13.4</xref>
,
<xref depth="3" refnumber="3-1-13.7" start="0">3-1-13.7</xref>
, and
<xref depth="3" refnumber="3-1-16.1" start="0">3-1-16.1</xref>
.
</section>
</chapter>
</title>

here is my current perl script:

!/usr/bin/perl -w


use XML::LibXML;


my $dom = XML::LibXML->load_xml(location => "file.xml");
my $titleName = $dom->findvalue('/title/catchline');
print "Title $titleName\n";

my @chapters = $dom->findnodes('/title/chapter');

for $chapter (@chapters) {
        my $chapterNo = $chapter->getAttribute('number');
        my $chapterName = $chapter->findvalue('catchline');
        print " Chapter #$chapterNo - $chapterName\n";

        my @sections = $chapter->findnodes('section');

        for $section (@sections) {
                my $sectionNo = $section->getAttribute('number');
                my $sectionName = $section->findvalue('catchline');
                my $sectionData = $section->textContent;
                print "  Section #$sectionNo - $sectionName\nSECDATA: $sectionData\n\n";

        }
}

This works, but what happens, is, probably exactly what I am asking for, it prints everything in the <section> for the $sectionData variable.

What I am trying to do is just get the data from after the <tab/> tag without anything else within a tag. Like the children tags of <histories><history><xref> etc..

So for instance, the string:

, does not apply to domestic or foreign corporations governed by this
chapter, except as specifically provided in Sections

is not contained within any particular tag, how do I get to just that data?

The current output is:

Title Uniform Agricultural Cooperative Association Act
 Chapter #3-1 - 
General Provisions Relating to Agricultural Cooperative Associations

  Section #3-1-1 - Declaration of policy.
SECDATA: 


Amended by Chapter
378
, 2010 General Session

2010

Declaration of policy.

It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed.


  Section #3-1-1.1 - General corporation laws do not apply.
SECDATA: 


Amended by Chapter
79
, 1996 General Session

1996

General corporation laws do not apply.


Title 16, Chapter 10a, Utah Revised Business Corporation Act

, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections
3-1-13.4
,
3-1-13.7
, and
3-1-16.1
.

But what I am looking for is something more like:

Title Uniform Agricultural Cooperative Association Act
 Chapter #3-1 - 
General Provisions Relating to Agricultural Cooperative Associations

  Section #3-1-1 - Declaration of policy.
SECDATA: 
It is the declared policy of this state, as one means of improving the economic position of agriculture, to encourage the organization of producers of agricultural products into effective associations under the control of such producers, and to that end this act shall be liberally construed.


  Section #3-1-1.1 - General corporation laws do not apply.
SECDATA: 
, does not apply to domestic or foreign corporations governed by this chapter, except as specifically provided in Sections

>Solution :

If you wanted the text nodes that followed the tab element, you could use

my @post_tab_text_nodes = $section_node->findnodes('following-sibling:text()');

But what you want is a lot more complicated than that.

use List::Util  qw( first );
use XML::LibXML qw( XML_ELEMENT_NODE );

my @child_nodes = $section_node->childNodes();

my $tab_node_idx =
   first {
      my $node = $child_nodes[$_];
      (  $node->nodeType() == XML_ELEMENT_NODE
      && !defined( $node->namespaceURI() )
      && $node->nodeName() eq 'tab'
      )
   }
      0..$#child_nodes;

my @post_tab_children =
   defined($tab_node_idx)
      ? @child_nodes[ $tab_node_idx + 1 .. $#child_nodes ]
      : ();

Rendering the resulting nodes as text is an exercise left to the user. You appear to have a mix of element nodes (XML_ELEMENT_NODE) and text nodes (XML_TEXT_NODE), which can be differentiated using $node->nodeType.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading