Please note: This is only a first introductory example. The examples section will be extended in the near future.

As an example let’s load an XML file exported from the database of the United Nations Statistics Division. This step may take some time as the XML file is quite large, so be patient when you reproduce these results in R. Please click here to download the dataset.

library(flatxml)
xml.dataframe <- fxml_importXMLFlat("https://www.zuckarelli.de/flatxml/worldpopulation.xml")
## No encoding supplied: defaulting to UTF-8.

fxml_importXMLFlat returns a dataframe. This dataframe represents the hierarchical structure of the XML document. All other functions from the flatxml package work with this kind of representation of the XML document.

head(xml.dataframe, 20)
##     elem. elemid. attr.          value. level1 level2 level3 level4
## 1    ROOT       1  <NA>            <NA>   ROOT   <NA>   <NA>   <NA>
## 2    data       2  <NA>            <NA>   ROOT   data   <NA>   <NA>
## 3  record       3  <NA>            <NA>   ROOT   data record   <NA>
## 4   field       4  <NA>     Afghanistan   ROOT   data record  field
## 5   field       4  name Country or Area   ROOT   data record  field
## 6   field       5  <NA>            2016   ROOT   data record  field
## 7   field       5  name            Year   ROOT   data record  field
## 8   field       6  <NA>      Both Sexes   ROOT   data record  field
## 9   field       6  name             Sex   ROOT   data record  field
## 10  field       7  <NA>        27657145   ROOT   data record  field
## 11  field       7  name           Value   ROOT   data record  field
## 12  field       8  <NA>               1   ROOT   data record  field
## 13  field       8  name Value Footnotes   ROOT   data record  field
## 14 record       9  <NA>            <NA>   ROOT   data record   <NA>
## 15  field      10  <NA>     Afghanistan   ROOT   data record  field
## 16  field      10  name Country or Area   ROOT   data record  field
## 17  field      11  <NA>            2016   ROOT   data record  field
## 18  field      11  name            Year   ROOT   data record  field
## 19  field      12  <NA>            Male   ROOT   data record  field
## 20  field      12  name             Sex   ROOT   data record  field

Now, let’s extract the data from the flat representation (that still contains the hierachical information of the original document) into a dataframe to which all statistical functions can be applied. This is done with the fxml_toDataFrame function.

As we can see from above XML element no. 3 is on the level where the actual data is located (therefore siblings.of=3, which means, the elements carrying the data are siblings of this element). The field names are given as attributes in our example XML file, therefore we choose elem.or.attr="attr". The name of the attribute that contains the field names is name, therefore we have col.attr="name". And finally, we want to exclude all the footnotes from our dataframe. To accomplish this we exclude those fields with exclude.fields=c("Value Footnote").

With this preparation, we are ready to go:

population.df <- fxml_toDataFrame(xml.dataframe, siblings.of=3, elem.or.attr="attr",
col.attr="name", exclude.fields=c("Value Footnotes"))

The result looks like this:

head(population.df, 20)
##    Country or Area Year        Sex    Value
## 1      Afghanistan 2016 Both Sexes 27657145
## 2      Afghanistan 2016       Male 14149838
## 3      Afghanistan 2016     Female 13507307
## 4    Åland Islands 2016 Both Sexes    29099
## 5    Åland Islands 2016       Male    14526
## 6    Åland Islands 2016     Female    14573
## 7          Albania 2016 Both Sexes  2886026
## 8          Albania 2016 Both Sexes  2876101
## 9          Albania 2016       Male  1456000
## 10         Albania 2016     Female  1420101
## 11         Algeria 2016 Both Sexes 40835602
## 12         Algeria 2016       Male 20680271
## 13         Algeria 2016     Female 20155331
## 14         Andorra 2016 Both Sexes    72358
## 15         Andorra 2016 Both Sexes    71732
## 16         Andorra 2016       Male    36229
## 17         Andorra 2016       Male    36590
## 18         Andorra 2016     Female    35768
## 19         Andorra 2016     Female    35503
## 20       Argentina 2016 Both Sexes 43590368