Please note: This is only a first introductory example. The examples section will be extended in the near future.
As an example let’s load an XML file exported from the database of the United Nations Statistics Division. This step may take some time as the XML file is quite large, so be patient when you reproduce these results in R. Please click here to download the dataset.
library(flatxml) xml.dataframe <- fxml_importXMLFlat("https://www.zuckarelli.de/flatxml/worldpopulation.xml")
## No encoding supplied: defaulting to UTF-8.
fxml_importXMLFlat
returns a dataframe. This dataframe represents the hierarchical structure of the XML document. All other functions from the flatxml
package work with this kind of representation of the XML document.
head(xml.dataframe, 20)
## elem. elemid. attr. value. level1 level2 level3 level4
## 1 ROOT 1 <NA> <NA> ROOT <NA> <NA> <NA>
## 2 data 2 <NA> <NA> ROOT data <NA> <NA>
## 3 record 3 <NA> <NA> ROOT data record <NA>
## 4 field 4 <NA> Afghanistan ROOT data record field
## 5 field 4 name Country or Area ROOT data record field
## 6 field 5 <NA> 2016 ROOT data record field
## 7 field 5 name Year ROOT data record field
## 8 field 6 <NA> Both Sexes ROOT data record field
## 9 field 6 name Sex ROOT data record field
## 10 field 7 <NA> 27657145 ROOT data record field
## 11 field 7 name Value ROOT data record field
## 12 field 8 <NA> 1 ROOT data record field
## 13 field 8 name Value Footnotes ROOT data record field
## 14 record 9 <NA> <NA> ROOT data record <NA>
## 15 field 10 <NA> Afghanistan ROOT data record field
## 16 field 10 name Country or Area ROOT data record field
## 17 field 11 <NA> 2016 ROOT data record field
## 18 field 11 name Year ROOT data record field
## 19 field 12 <NA> Male ROOT data record field
## 20 field 12 name Sex ROOT data record field
Now, let’s extract the data from the flat representation (that still contains the hierachical information of the original document) into a dataframe to which all statistical functions can be applied. This is done with the fxml_toDataFrame
function.
As we can see from above XML element no. 3 is on the level where the actual data is located (therefore siblings.of=3
, which means, the elements carrying the data are siblings of this element). The field names are given as attributes in our example XML file, therefore we choose elem.or.attr="attr"
. The name of the attribute that contains the field names is name
, therefore we have col.attr="name"
. And finally, we want to exclude all the footnotes from our dataframe. To accomplish this we exclude those fields with exclude.fields=c("Value Footnote")
.
With this preparation, we are ready to go:
population.df <- fxml_toDataFrame(xml.dataframe, siblings.of=3, elem.or.attr="attr", col.attr="name", exclude.fields=c("Value Footnotes"))
The result looks like this:
head(population.df, 20)
## Country or Area Year Sex Value
## 1 Afghanistan 2016 Both Sexes 27657145
## 2 Afghanistan 2016 Male 14149838
## 3 Afghanistan 2016 Female 13507307
## 4 Åland Islands 2016 Both Sexes 29099
## 5 Åland Islands 2016 Male 14526
## 6 Åland Islands 2016 Female 14573
## 7 Albania 2016 Both Sexes 2886026
## 8 Albania 2016 Both Sexes 2876101
## 9 Albania 2016 Male 1456000
## 10 Albania 2016 Female 1420101
## 11 Algeria 2016 Both Sexes 40835602
## 12 Algeria 2016 Male 20680271
## 13 Algeria 2016 Female 20155331
## 14 Andorra 2016 Both Sexes 72358
## 15 Andorra 2016 Both Sexes 71732
## 16 Andorra 2016 Male 36229
## 17 Andorra 2016 Male 36590
## 18 Andorra 2016 Female 35768
## 19 Andorra 2016 Female 35503
## 20 Argentina 2016 Both Sexes 43590368