Using R to Fetch List of Pokemon Sets.
This document shows how to extract a dataset from an HTML page.
We’ll start by loading two libraries. RCurl is used to read an HTML page. XML is used to parse HTML which can be viewed as a form of XML.
library(RCurl)
## Loading required package: bitops
library(XML)
Let R know where to find the HTML page. Then download and parse it.
theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
doc <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
Use XPATH to extract all tr (table row) nodes from the HTML page. There is a lot of extraneous information in those tr nodes so we’ll filter the list from 70 elements to 67 elements.
tr <- getNodeSet(doc, "//*/tr")
tr_with_pokemon_sets <- tr[4:length(tr)-1]
Let’s look at one example of the HTML. It holds information about one Pokemon set. The pound signs at the start of the lines are not part of the data, they are just part of the printing.
tr_with_pokemon_sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr>
In order to make sense of that HTML, we’ll use a custom function to manipulate each element in tr_with_pokemon_sets. Generally speaking, the function removes newlines and HTML syntax. It also provides data types and column names.
xmlToCsv <- function(xml) {
a <- gsub('\n\n','\t', xmlValue(xml))
b <- gsub('\t\t','\t \t', a)
d <- gsub('\t\t','\t', b)
e <- gsub('^ |\t$','', d)
f <- gsub('\t ','\t', e)
cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
g <- read.table(text=f, sep="\t", header=FALSE)
colnames(g) <- cn
keeps <- c("EngNumber", "EngSet", "EngCardCount")
return(g[keeps])
}
Magic happens next. We apply the custom function, convert results toa data.frame and remove NA values.
pokemon_set_dataframe <- na.omit(do.call(rbind, lapply(tr_with_pokemon_sets, xmlToCsv)))
The information is displayed so you can see the data so far.
pokemon_set_dataframe
EngNumber EngSet EngCardCount
1 1 Base Set 102
2 2 Jungle 64
3 3 Fossil 62
4 4 Base Set 2 130
5 5 Team Rocket 83*
6 7 Gym Challenge 132
7 8 Neo Genesis 111
8 9 Neo Discovery 75
9 10 Neo Revelation 66*
10 11 Neo Destiny 113*
11 12 Legendary Collection 110
14 13 Expedition Base Set 165
15 14 Aquapolis 186*
16 14 Aquapolis 186*
17 15 Skyridge 182*
18 15 Skyridge 182*
19 16 EX Ruby & Sapphire 109
20 17 EX Sandstorm 100
21 18 EX Dragon 100*
22 19 EX Team Magma vs Team Aqua 97*
23 20 EX Hidden Legends 102*
24 21 EX FireRed & LeafGreen 116*
25 22 EX Team Rocket Returns 111*
26 23 EX Deoxys 108*
27 24 EX Emerald 107*
28 25 EX Unseen Forces 145*
29 26 EX Delta Species 114*
30 27 EX Legend Maker 93*
31 28 EX Holon Phantoms 111*
32 29 EX Crystal Guardians 100
33 30 EX Dragon Frontiers 101
34 31 EX Power Keepers 108
35 32 Diamond & Pearl 130
36 33 Mysterious Treasures 124*
37 34 Secret Wonders 132
38 35 Great Encounters 106
39 36 Majestic Dawn 100
40 37 Legends Awakened 146
41 38 Stormfront 106*
42 40 Rising Rivals 120*
43 41 Supreme Victors 153*
44 42 Arceus 111*
45 43 HeartGold & SoulSilver 124*
46 44 Unleashed 96*
47 45 Undaunted 91*
48 46 Triumphant 103*
49 47 Call of Legends 106
50 48 Black & White 115*
51 49 Emerging Powers 98
52 50 Noble Victories 102*
53 51 Next Destinies 103*
54 52 Dark Explorers 111*
55 53 Dragons Exalted 128*
56 54 Boundaries Crossed 153*
57 55 Plasma Storm 138*
58 56 Plasma Freeze 122*
59 57 Plasma Blast 105*
60 58 Legendary Treasures 138*
61 59 XY 146
62 60 Flashfire 109*
63 61 Furious Fists 113*
64 62 Phantom Forces 122*
65 63 Primal Clash 150+
Notice those extra asterisks and plus signs? The next bit of code removes them.
pokemon_set_dataframe$EngCardCount <- gsub("\\*|\\+", "", pokemon_set_dataframe$EngCardCount)
Here is the final dataset.
pokemon_set_dataframe
EngNumber EngSet EngCardCount
1 1 Base Set 102
2 2 Jungle 64
3 3 Fossil 62
4 4 Base Set 2 130
5 5 Team Rocket 83
6 7 Gym Challenge 132
7 8 Neo Genesis 111
8 9 Neo Discovery 75
9 10 Neo Revelation 66
10 11 Neo Destiny 113
11 12 Legendary Collection 110
14 13 Expedition Base Set 165
15 14 Aquapolis 186
16 14 Aquapolis 186
17 15 Skyridge 182
18 15 Skyridge 182
19 16 EX Ruby & Sapphire 109
20 17 EX Sandstorm 100
21 18 EX Dragon 100
22 19 EX Team Magma vs Team Aqua 97
23 20 EX Hidden Legends 102
24 21 EX FireRed & LeafGreen 116
25 22 EX Team Rocket Returns 111
26 23 EX Deoxys 108
27 24 EX Emerald 107
28 25 EX Unseen Forces 145
29 26 EX Delta Species 114
30 27 EX Legend Maker 93
31 28 EX Holon Phantoms 111
32 29 EX Crystal Guardians 100
33 30 EX Dragon Frontiers 101
34 31 EX Power Keepers 108
35 32 Diamond & Pearl 130
36 33 Mysterious Treasures 124
37 34 Secret Wonders 132
38 35 Great Encounters 106
39 36 Majestic Dawn 100
40 37 Legends Awakened 146
41 38 Stormfront 106
42 40 Rising Rivals 120
43 41 Supreme Victors 153
44 42 Arceus 111
45 43 HeartGold & SoulSilver 124
46 44 Unleashed 96
47 45 Undaunted 91
48 46 Triumphant 103
49 47 Call of Legends 106
50 48 Black & White 115
51 49 Emerging Powers 98
52 50 Noble Victories 102
53 51 Next Destinies 103
54 52 Dark Explorers 111
55 53 Dragons Exalted 128
56 54 Boundaries Crossed 153
57 55 Plasma Storm 138
58 56 Plasma Freeze 122
59 57 Plasma Blast 105
60 58 Legendary Treasures 138
61 59 XY 146
62 60 Flashfire 109
63 61 Furious Fists 113
64 62 Phantom Forces 122
65 63 Primal Clash 150
With a bit more complexity the first column of numbers can be removed.
x <- as.matrix(format(pokemon_set_dataframe))
rownames(x) <- rep("", nrow(x))
print(x, quote=FALSE)
EngNumber EngSet EngCardCount
1 Base Set 102
2 Jungle 64
3 Fossil 62
4 Base Set 2 130
5 Team Rocket 83
7 Gym Challenge 132
8 Neo Genesis 111
9 Neo Discovery 75
10 Neo Revelation 66
11 Neo Destiny 113
12 Legendary Collection 110
13 Expedition Base Set 165
14 Aquapolis 186
14 Aquapolis 186
15 Skyridge 182
15 Skyridge 182
16 EX Ruby & Sapphire 109
17 EX Sandstorm 100
18 EX Dragon 100
19 EX Team Magma vs Team Aqua 97
20 EX Hidden Legends 102
21 EX FireRed & LeafGreen 116
22 EX Team Rocket Returns 111
23 EX Deoxys 108
24 EX Emerald 107
25 EX Unseen Forces 145
26 EX Delta Species 114
27 EX Legend Maker 93
28 EX Holon Phantoms 111
29 EX Crystal Guardians 100
30 EX Dragon Frontiers 101
31 EX Power Keepers 108
32 Diamond & Pearl 130
33 Mysterious Treasures 124
34 Secret Wonders 132
35 Great Encounters 106
36 Majestic Dawn 100
37 Legends Awakened 146
38 Stormfront 106
40 Rising Rivals 120
41 Supreme Victors 153
42 Arceus 111
43 HeartGold & SoulSilver 124
44 Unleashed 96
45 Undaunted 91
46 Triumphant 103
47 Call of Legends 106
48 Black & White 115
49 Emerging Powers 98
50 Noble Victories 102
51 Next Destinies 103
52 Dark Explorers 111
53 Dragons Exalted 128
54 Boundaries Crossed 153
55 Plasma Storm 138
56 Plasma Freeze 122
57 Plasma Blast 105
58 Legendary Treasures 138
59 XY 146
60 Flashfire 109
61 Furious Fists 113
62 Phantom Forces 122
63 Primal Clash 150
And we can plot the number of cards per set against the set number.
plot(pokemon_set_dataframe[c(1,3)])
The EngCount column is actually a character data type which is not correct. The transform method changes the datatype.
pokemon_set_dataframe <- transform(pokemon_set_dataframe, EngCardCount = as.numeric(EngCardCount))
Now it’s possible to sum the card counts.
noquote(format(sum(pokemon_set_dataframe$EngCardCount), big.mark=","))
[1] 7,372