XMLWalker.jl

After loading the XML file using XML.jl, all data is stored in an encapsulated, tree-like structure of Nodes. Each node element has a children field, which is a vector of child Nodes, as well as other fields like attributes, value etc. XMLWalker.jl provides tools for traversing the XML tree using a set of matchers that check the content of nodes. Matchers can be created from strings and combined into chain strings, forming a sequence of matchers to find nodes that meet multiple content restrictions. The function XMLWalker.find_nodes converts input string to machers, recursively applies these matchers and returns a vector of nodes that satisfy the specified conditions.

Quick start

import Pkg
Pkg.add(url=https://github.com/Manarom/XMLWalker.jl.git)
import XMLWalker
import XML

xml_file = XML.read(xml_file,XML.Node) #  loads data as a structure with multiple embedded nodes, root node is the `xml_file` object
matched_nodes = XMLWalker.find_nodes(starting_node,"A/[B,C]/D") # finds nodes, which match the specified path or single matching "A"=>"B"=>"D" or "A"=>"C"=>"D" route , starting_node can be the `xml_file` or any of its children

Chain strings specification

Simple chain by tags

A single string input (without the token separator symbol "/") to the XMLWalker.find_nodes function, such as "node1_tag", represents a single token. By default, this token is assumed to be the tag field of the node and XMLWalker.find_nodes(starting_node,"node1_tag") it finds all nodes (if any) matching the pattern specified by that string. Additionally, the field name can be provided as an optional second input argument.

XMLWalker.find_nodes(starting_node,"ABB",:attributes) # returns nodes, that has "ABB" value of field `attributes`

The symbol "/" is used as a token separator, indicating that all substrings separated by this symbol should be interpreted as a sequence of matchings. Each token in the string corresponds to a node in the XML tree. The function XMLWalker.find_nodes uses these chain-strings to create matchers that are checked sequentially, searching for nodes that fit the entire sequence. For the chain-string "A/B/C", XMLWalker.find_nodes returns nodes reachable by following the path with tags "A" → "B" → "C".

find_nodes(starting_node,"A/B/C") # returns a vector of nodes which can be reached following the "A"->"B"->"C" chain (starting_node must has a field :tag = "A")

Single token specification

The following sections describe a single token syntax, all of this is also applicable to the string-chain.

Additional fields values check

When a tag is followed by a dot, as in "tag_name.field_name = field_value", the field_name is interpreted as the name of the node's field, and the value after the = (i.e., field_value) is treated as the value of that field. For example, "A.value = ABC" represents a node with the tag field equal to "A" and the value field equal to "ABC". By default, the value is parsed as a string, but if it can be interpreted as a number, such as in "tag_name.field_name = 25.4", it means the field_name of the node tagged tag_name has the value 25.4 (Float64). Additionally, the annotation ::text can be added to force the value to be parsed as a string, so /tag_name.field_name = 25.4::text/ searches for the field_name with the string value "25.4"(String).

find_nodes(starting_node,"B.value = 356::text") # searches for nodes with "B" tags and field value ="356"

Nodes with dictionary fields

Node fields can also be of the AbstractDict type, allowing for searches of specific keys or key-value pairs. The syntax for this is as follows: "tag_name.field_name([key1,key2,key3])" when field name is followed by "(....)". This pattern matches nodes where the field_name dictionary of a tag_name node contains any of the keys "key1", "key2", or "key3". All keys must be enclosed in either "[]" or "{}"— where "[]" represents any of the keys, and "{}" means all keys must be present. For example, "A.attributes({p1,p2})" refers to a node with the tag "A" having the attributes field that contains both "p1" and "p2" keys.

It is also possible to search for nodes with specific key-value pairs within the field-dictionary. The syntax is similar to the key search pattern, but each key is followed by an equals sign (=). For instance, "tag_name.field_name({key1=value1,key2=value2})" matches a node with tag "tag_name" whose field_name dictionary contains both "key1"=>"value1" and "key2"=>"value2".

find_nodes(starting_node,"B.value([p1,p2])") # searches for nodes with "B" tags and field value with any of p1 or p2 keys
find_nodes(starting_node,"B.value([p1=2,p2=20])") # searches for nodes with "B" tags and field value with all of "p1" => 2.0 and "p2" => 20.0 key-value pairs

Here is the corrected and paraphrased version of your text:

Special Symbols

All special symbols in this section apply to tags, field values, and field dictionary keys.

`"[...]"` (any) and `"{...}"` (all) Patterns

To find nodes within a specific set of tags, for example "tg1", "tg2", or "tg3", these tag names must be enclosed in square brackets "[]". This pattern returns true if any of the enclosed tags match. For example, "[tg1,tg2,tg3]" searches for nodes with any of the tags "tg1", "tg2", or "tg3".

Similarly, this pattern can be used to match field values, such as "A.field_name = [a,b]", or to match field dictionary keys like "A.field_name([k1,k2])".

find_nodes(starting_node,"[A,B].field1 = [ab,bc]") # finds nodes with "A" or "B" tag field and "ab" or "ac" field1 field value

If all patterns need to be matched simultaneously, the {} block can be used. This block is especially useful for specifying field values. For example, "[A,B].field_name({a = 1, b = 2})" will match nodes tagged as "A" or "B" that contain both key-value pairs "a" => 1.0 and "b" => 2.0 in their field_name field.

`` (always match), `...` (contains) and `...::regex` (regular expression) patterns

To skip a pattern node in a search tree, the * symbol, which represents an always-match wildcard, can be used. For example, "*.field_name = c" matches nodes with any tag and a field_name value of "c". Hence, the "*.tag = A" token is equivalent to just "A".

When the * symbol appears anywhere within a string, it indicates a partial match (i.e., the pattern is contained within the string). For instance, "*Prop" matches nodes with tags containing "Prop", such as "BulkProp" or "PropertyOne". This rule also applies to field values and dictionary keys. However, for key-value pairs matching, partial match patterns using * are not supported. If any key in the key-value pairs contains *, all values in that pair are ignored. For example, "A.field_name([*abb=1,b=2])" behaves the same as "A.field_name([*abb,b])". Both field names and values in key-value pairs cannot contain the * symbol; thus, "node_tag.*partial_name = c" or "A.field_name([a=*b , b=2])" are not supported, but "node_tag.field_name = "*ca" is allowed.

If the field value or key is marked with "::regex" it is interpreted as a regular expression, the main rule here is that thus marked pattern should be matched if julia match(::Regex,str) returns not nothing. To use regular expression for tag search it should be embraced in {} or []. For instance, matcher string token for nodes containing digits in their tag field and attributes field dictionary with key id will be "[ [\\d]::regex ].attributes([id])"