This document shows how to extract a dataset from an HTML page.

We’ll start by loading two libraries. RCurl is used to read an HTML page. XML is used to parse HTML which can be viewed as a form of XML.

library(RCurl)
## Loading required package: bitops
library(XML)

Let R know where to find the HTML page. Then download and parse it.

theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"


webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
doc <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

Use XPATH to extract all tr (table row) nodes from the HTML page. There is a lot of extraneous information in those tr nodes so we’ll filter the list from 70 elements to 67 elements.

tr <- getNodeSet(doc, "//*/tr")
tr_with_pokemon_sets <- tr[4:length(tr)-1]

Let’s look at one example of the HTML. It holds information about one Pokemon set. The pound signs at the start of the lines are not part of the data, they are just part of the printing.

tr_with_pokemon_sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr> 

In order to make sense of that HTML, we’ll use a custom function to manipulate each element in tr_with_pokemon_sets. Generally speaking, the function removes newlines and HTML syntax. It also provides data types and column names.

xmlToCsv <- function(xml) {
  a <- gsub('\n\n','\t', xmlValue(xml))
    b <- gsub('\t\t','\t \t', a)
    d <- gsub('\t\t','\t', b)
    e <- gsub('^ |\t$','', d)
    f <- gsub('\t ','\t', e)
    cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
    cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
    g <- read.table(text=f, sep="\t", header=FALSE)
    colnames(g) <- cn
    keeps <- c("EngNumber", "EngSet", "EngCardCount")
    return(g[keeps])
}

Magic happens next. We apply the custom function, convert results toa data.frame and remove NA values.

pokemon_set_dataframe <- na.omit(do.call(rbind, lapply(tr_with_pokemon_sets, xmlToCsv)))

The information is displayed so you can see the data so far.

pokemon_set_dataframe
   EngNumber                     EngSet EngCardCount
1          1                   Base Set          102
2          2                     Jungle           64
3          3                     Fossil           62
4          4                 Base Set 2          130
5          5                Team Rocket          83*
6          7              Gym Challenge          132
7          8                Neo Genesis          111
8          9              Neo Discovery           75
9         10             Neo Revelation          66*
10        11                Neo Destiny         113*
11        12       Legendary Collection          110
14        13        Expedition Base Set          165
15        14                  Aquapolis         186*
16        14                  Aquapolis         186*
17        15                   Skyridge         182*
18        15                   Skyridge         182*
19        16         EX Ruby & Sapphire          109
20        17               EX Sandstorm          100
21        18                  EX Dragon         100*
22        19 EX Team Magma vs Team Aqua          97*
23        20          EX Hidden Legends         102*
24        21     EX FireRed & LeafGreen         116*
25        22     EX Team Rocket Returns         111*
26        23                  EX Deoxys         108*
27        24                 EX Emerald         107*
28        25           EX Unseen Forces         145*
29        26           EX Delta Species         114*
30        27            EX Legend Maker          93*
31        28          EX Holon Phantoms         111*
32        29       EX Crystal Guardians          100
33        30        EX Dragon Frontiers          101
34        31           EX Power Keepers          108
35        32            Diamond & Pearl          130
36        33       Mysterious Treasures         124*
37        34             Secret Wonders          132
38        35           Great Encounters          106
39        36              Majestic Dawn          100
40        37           Legends Awakened          146
41        38                 Stormfront         106*
42        40              Rising Rivals         120*
43        41            Supreme Victors         153*
44        42                     Arceus         111*
45        43     HeartGold & SoulSilver         124*
46        44                  Unleashed          96*
47        45                  Undaunted          91*
48        46                 Triumphant         103*
49        47            Call of Legends          106
50        48              Black & White         115*
51        49            Emerging Powers           98
52        50            Noble Victories         102*
53        51             Next Destinies         103*
54        52             Dark Explorers         111*
55        53            Dragons Exalted         128*
56        54         Boundaries Crossed         153*
57        55               Plasma Storm         138*
58        56              Plasma Freeze         122*
59        57               Plasma Blast         105*
60        58        Legendary Treasures         138*
61        59                         XY          146
62        60                  Flashfire         109*
63        61              Furious Fists         113*
64        62             Phantom Forces         122*
65        63               Primal Clash         150+

Notice those extra asterisks and plus signs? The next bit of code removes them.

pokemon_set_dataframe$EngCardCount <- gsub("\\*|\\+", "", pokemon_set_dataframe$EngCardCount)

Here is the final dataset.

pokemon_set_dataframe
   EngNumber                     EngSet EngCardCount
1          1                   Base Set          102
2          2                     Jungle           64
3          3                     Fossil           62
4          4                 Base Set 2          130
5          5                Team Rocket           83
6          7              Gym Challenge          132
7          8                Neo Genesis          111
8          9              Neo Discovery           75
9         10             Neo Revelation           66
10        11                Neo Destiny          113
11        12       Legendary Collection          110
14        13        Expedition Base Set          165
15        14                  Aquapolis          186
16        14                  Aquapolis          186
17        15                   Skyridge          182
18        15                   Skyridge          182
19        16         EX Ruby & Sapphire          109
20        17               EX Sandstorm          100
21        18                  EX Dragon          100
22        19 EX Team Magma vs Team Aqua           97
23        20          EX Hidden Legends          102
24        21     EX FireRed & LeafGreen          116
25        22     EX Team Rocket Returns          111
26        23                  EX Deoxys          108
27        24                 EX Emerald          107
28        25           EX Unseen Forces          145
29        26           EX Delta Species          114
30        27            EX Legend Maker           93
31        28          EX Holon Phantoms          111
32        29       EX Crystal Guardians          100
33        30        EX Dragon Frontiers          101
34        31           EX Power Keepers          108
35        32            Diamond & Pearl          130
36        33       Mysterious Treasures          124
37        34             Secret Wonders          132
38        35           Great Encounters          106
39        36              Majestic Dawn          100
40        37           Legends Awakened          146
41        38                 Stormfront          106
42        40              Rising Rivals          120
43        41            Supreme Victors          153
44        42                     Arceus          111
45        43     HeartGold & SoulSilver          124
46        44                  Unleashed           96
47        45                  Undaunted           91
48        46                 Triumphant          103
49        47            Call of Legends          106
50        48              Black & White          115
51        49            Emerging Powers           98
52        50            Noble Victories          102
53        51             Next Destinies          103
54        52             Dark Explorers          111
55        53            Dragons Exalted          128
56        54         Boundaries Crossed          153
57        55               Plasma Storm          138
58        56              Plasma Freeze          122
59        57               Plasma Blast          105
60        58        Legendary Treasures          138
61        59                         XY          146
62        60                  Flashfire          109
63        61              Furious Fists          113
64        62             Phantom Forces          122
65        63               Primal Clash          150

With a bit more complexity the first column of numbers can be removed.

x <- as.matrix(format(pokemon_set_dataframe))
rownames(x) <- rep("", nrow(x))
print(x, quote=FALSE)
 EngNumber EngSet                     EngCardCount
  1        Base Set                   102         
  2        Jungle                     64          
  3        Fossil                     62          
  4        Base Set 2                 130         
  5        Team Rocket                83          
  7        Gym Challenge              132         
  8        Neo Genesis                111         
  9        Neo Discovery              75          
 10        Neo Revelation             66          
 11        Neo Destiny                113         
 12        Legendary Collection       110         
 13        Expedition Base Set        165         
 14        Aquapolis                  186         
 14        Aquapolis                  186         
 15        Skyridge                   182         
 15        Skyridge                   182         
 16        EX Ruby & Sapphire         109         
 17        EX Sandstorm               100         
 18        EX Dragon                  100         
 19        EX Team Magma vs Team Aqua 97          
 20        EX Hidden Legends          102         
 21        EX FireRed & LeafGreen     116         
 22        EX Team Rocket Returns     111         
 23        EX Deoxys                  108         
 24        EX Emerald                 107         
 25        EX Unseen Forces           145         
 26        EX Delta Species           114         
 27        EX Legend Maker            93          
 28        EX Holon Phantoms          111         
 29        EX Crystal Guardians       100         
 30        EX Dragon Frontiers        101         
 31        EX Power Keepers           108         
 32        Diamond & Pearl            130         
 33        Mysterious Treasures       124         
 34        Secret Wonders             132         
 35        Great Encounters           106         
 36        Majestic Dawn              100         
 37        Legends Awakened           146         
 38        Stormfront                 106         
 40        Rising Rivals              120         
 41        Supreme Victors            153         
 42        Arceus                     111         
 43        HeartGold & SoulSilver     124         
 44        Unleashed                  96          
 45        Undaunted                  91          
 46        Triumphant                 103         
 47        Call of Legends            106         
 48        Black & White              115         
 49        Emerging Powers            98          
 50        Noble Victories            102         
 51        Next Destinies             103         
 52        Dark Explorers             111         
 53        Dragons Exalted            128         
 54        Boundaries Crossed         153         
 55        Plasma Storm               138         
 56        Plasma Freeze              122         
 57        Plasma Blast               105         
 58        Legendary Treasures        138         
 59        XY                         146         
 60        Flashfire                  109         
 61        Furious Fists              113         
 62        Phantom Forces             122         
 63        Primal Clash               150         

And we can plot the number of cards per set against the set number.

plot(pokemon_set_dataframe[c(1,3)])

The EngCount column is actually a character data type which is not correct. The transform method changes the datatype.

pokemon_set_dataframe <- transform(pokemon_set_dataframe, EngCardCount = as.numeric(EngCardCount))

Now it’s possible to sum the card counts.

noquote(format(sum(pokemon_set_dataframe$EngCardCount), big.mark=","))
[1] 7,372