This document shows how to extract a dataset from an HTML page.

We’ll start by loading two libraries. RCurl is used to read an HTML page. XML is used to parse HTML which can be viewed as a form of XML.

library(RCurl)
## Loading required package: bitops
library(XML)

Let R know where to find the HTML page. Then download and parse it.

theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"


webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
doc <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

Use XPATH to extract all tr (table row) nodes from the HTML page. There is a lot of extraneous information in those tr nodes so we’ll filter the list from 70 elements to 67 elements.

tr <- getNodeSet(doc, "//*/tr")
tr_with_pokemon_sets <- tr[4:length(tr)-1]

Let’s look at one example of the HTML. It holds information about one Pokemon set. The pound signs at the start of the lines are not part of the data, they are just part of the printing.

tr_with_pokemon_sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr> 

In order to make sense of that HTML, we’ll use a custom function to manipulate each element in tr_with_pokemon_sets. Generally speaking, the function removes newlines and HTML syntax. It also provides data types and column names.

xmlToCsv <- function(xml) {
  a <- gsub('\n\n','\t', xmlValue(xml))
    b <- gsub('\t\t','\t \t', a)
    d <- gsub('\t\t','\t', b)
    e <- gsub('^ |\t$','', d)
    f <- gsub('\t ','\t', e)
    cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
    cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
    g <- read.table(text=f, sep="\t", header=FALSE)
    colnames(g) <- cn
    keeps <- c("EngNumber", "EngSet", "EngCardCount")
    return(g[keeps])
}

Magic happens next. We apply the custom function, convert results toa data.frame and remove NA values.

pokemon_set_dataframe <- na.omit(do.call(rbind, lapply(tr_with_pokemon_sets, xmlToCsv)))

The information is displayed so you can see the data so far.

pokemon_set_dataframe
   EngNumber                     EngSet EngCardCount
1          1                   Base Set          102
2          2                     Jungle           64
3          3                     Fossil           62
4          4                 Base Set 2          130
5          5                Team Rocket          83*
6          7              Gym Challenge          132
7          8                Neo Genesis          111
8          9              Neo Discovery           75
9         10             Neo Revelation          66*
10        11                Neo Destiny         113*
11        12       Legendary Collection          110
14        13        Expedition Base Set          165
15        14                  Aquapolis         186*
16        14                  Aquapolis         186*
17        15                   Skyridge         182*
18        15                   Skyridge         182*
19        16         EX Ruby & Sapphire          109
20        17               EX Sandstorm          100
21        18                  EX Dragon         100*
22        19 EX Team Magma vs Team Aqua          97*
23        20          EX Hidden Legends         102*
24        21     EX FireRed & LeafGreen         116*
25        22     EX Team Rocket Returns         111*
26        23                  EX Deoxys         108*
27        24                 EX Emerald         107*
28        25           EX Unseen Forces         145*
29        26           EX Delta Species         114*
30        27            EX Legend Maker          93*
31        28          EX Holon Phantoms         111*
32        29       EX Crystal Guardians          100
33        30        EX Dragon Frontiers          101
34        31           EX Power Keepers          108
35        32            Diamond & Pearl          130
36        33       Mysterious Treasures         124*
37        34             Secret Wonders          132
38        35           Great Encounters          106
39        36              Majestic Dawn          100
40        37           Legends Awakened          146
41        38                 Stormfront         106*
42        40              Rising Rivals         120*
43        41            Supreme Victors         153*
44        42                     Arceus         111*
45        43     HeartGold & SoulSilver         124*
46        44                  Unleashed          96*
47        45                  Undaunted          91*
48        46                 Triumphant         103*
49        47            Call of Legends          106
50        48              Black & White         115*
51        49            Emerging Powers           98
52        50            Noble Victories         102*
53        51             Next Destinies         103*
54        52             Dark Explorers         111*
55        53            Dragons Exalted         128*
56        54         Boundaries Crossed         153*
57        55               Plasma Storm         138*
58        56              Plasma Freeze         122*
59        57               Plasma Blast         105*
60        58        Legendary Treasures         138*
61        59                         XY          146
62        60                  Flashfire         109*
63        61              Furious Fists         113*
64        62             Phantom Forces         122*
65        63               Primal Clash         150+

Notice those extra asterisks and plus signs? The next bit of code removes them.

pokemon_set_dataframe$EngCardCount <- gsub("\\*|\\+", "", pokemon_set_dataframe$EngCardCount)

Here is the final dataset.

pokemon_set_dataframe
   EngNumber                     EngSet EngCardCount
1          1                   Base Set          102
2          2                     Jungle           64
3          3                     Fossil           62
4          4                 Base Set 2          130
5          5                Team Rocket           83
6          7              Gym Challenge          132
7          8                Neo Genesis          111
8          9              Neo Discovery           75
9         10             Neo Revelation           66
10        11                Neo Destiny          113
11        12       Legendary Collection          110
14        13        Expedition Base Set          165
15        14                  Aquapolis          186
16        14                  Aquapolis          186
17        15                   Skyridge          182
18        15                   Skyridge          182
19        16         EX Ruby & Sapphire          109
20        17               EX Sandstorm          100
21        18                  EX Dragon          100
22        19 EX Team Magma vs Team Aqua           97
23        20          EX Hidden Legends          102
24        21     EX FireRed & LeafGreen          116
25        22     EX Team Rocket Returns          111
26        23                  EX Deoxys          108
27        24                 EX Emerald          107
28        25           EX Unseen Forces          145
29        26           EX Delta Species          114
30        27            EX Legend Maker           93
31        28          EX Holon Phantoms          111
32        29       EX Crystal Guardians          100
33        30        EX Dragon Frontiers          101
34        31           EX Power Keepers          108
35        32            Diamond & Pearl          130
36        33       Mysterious Treasures          124
37        34             Secret Wonders          132
38        35           Great Encounters          106
39        36              Majestic Dawn          100
40        37           Legends Awakened          146
41        38                 Stormfront          106
42        40              Rising Rivals          120
43        41            Supreme Victors          153
44        42                     Arceus          111
45        43     HeartGold & SoulSilver          124
46        44                  Unleashed           96
47        45                  Undaunted           91
48        46                 Triumphant          103
49        47            Call of Legends          106
50        48              Black & White          115
51        49            Emerging Powers           98
52        50            Noble Victories          102
53        51             Next Destinies          103
54        52             Dark Explorers          111
55        53            Dragons Exalted          128
56        54         Boundaries Crossed          153
57        55               Plasma Storm          138
58        56              Plasma Freeze          122
59        57               Plasma Blast          105
60        58        Legendary Treasures          138
61        59                         XY          146
62        60                  Flashfire          109
63        61              Furious Fists          113
64        62             Phantom Forces          122
65        63               Primal Clash          150

With a bit more complexity the first column of numbers can be removed.

x <- as.matrix(format(pokemon_set_dataframe))
rownames(x) <- rep("", nrow(x))
print(x, quote=FALSE)
 EngNumber EngSet                     EngCardCount
  1        Base Set                   102
  2        Jungle                     64
  3        Fossil                     62
  4        Base Set 2                 130
  5        Team Rocket                83
  7        Gym Challenge              132
  8        Neo Genesis                111
  9        Neo Discovery              75
 10        Neo Revelation             66
 11        Neo Destiny                113
 12        Legendary Collection       110
 13        Expedition Base Set        165
 14        Aquapolis                  186
 14        Aquapolis                  186
 15        Skyridge                   182
 15        Skyridge                   182
 16        EX Ruby & Sapphire         109
 17        EX Sandstorm               100
 18        EX Dragon                  100
 19        EX Team Magma vs Team Aqua 97
 20        EX Hidden Legends          102
 21        EX FireRed & LeafGreen     116
 22        EX Team Rocket Returns     111
 23        EX Deoxys                  108
 24        EX Emerald                 107
 25        EX Unseen Forces           145
 26        EX Delta Species           114
 27        EX Legend Maker            93
 28        EX Holon Phantoms          111
 29        EX Crystal Guardians       100
 30        EX Dragon Frontiers        101
 31        EX Power Keepers           108
 32        Diamond & Pearl            130
 33        Mysterious Treasures       124
 34        Secret Wonders             132
 35        Great Encounters           106
 36        Majestic Dawn              100
 37        Legends Awakened           146
 38        Stormfront                 106
 40        Rising Rivals              120
 41        Supreme Victors            153
 42        Arceus                     111
 43        HeartGold & SoulSilver     124
 44        Unleashed                  96
 45        Undaunted                  91
 46        Triumphant                 103
 47        Call of Legends            106
 48        Black & White              115
 49        Emerging Powers            98
 50        Noble Victories            102
 51        Next Destinies             103
 52        Dark Explorers             111
 53        Dragons Exalted            128
 54        Boundaries Crossed         153
 55        Plasma Storm               138
 56        Plasma Freeze              122
 57        Plasma Blast               105
 58        Legendary Treasures        138
 59        XY                         146
 60        Flashfire                  109
 61        Furious Fists              113
 62        Phantom Forces             122
 63        Primal Clash               150         

And we can plot the number of cards per set against the set number.

plot(pokemon_set_dataframe[c(1,3)])

The EngCount column is actually a character data type which is not correct. The transform method changes the datatype.

pokemon_set_dataframe <- transform(pokemon_set_dataframe, EngCardCount = as.numeric(EngCardCount))

Now it’s possible to sum the card counts.

noquote(format(sum(pokemon_set_dataframe$EngCardCount), big.mark=","))
[1] 7,372