Using BaiduMap’s Search API in R
Recently I was asked by a friend for help. The background is that people in large cities, such as Shanghai, sometimes want to “escape from the city” and go to a resort in the suburban, or countryside areas. Some villages surrounding Shanghai have recognized such great opportunity, and started to build such resorts. Being a urban and rural planning major, my friend wanted to see among 9000 villages, how many have developed such resorts.
There is no official list for these villages, which means the list has to be obtained from somewhere. In the beginning, my friend and I thought about scraping data from major travel agency websites, such as ctrip, qunar or many others. After some exploration, I found with dismay that these websites have added anti-scraper code snippets in their source, which made it impossible for me to write a scraper like I did here.
My friend suggested Baidu Map search. It turns out that there are over a thousand search results! It is good indeed as it means abundance of data. How are we going to put it into an organized format? When I was doing a birth records race/ethnicity prediction (link to paper), I used Google Map’s API to obtain the geocode corresponding to each address. Google Map apparently knows less about China than Baidu Map. After some exploration on Baidu Map’s website, I found they offered developer APIs too (link). In my friend’s case, a basic web search API would suffice. According to the documentation, at most 20 search results are returned in one query, which means to get a list of all the resorts near Shanghai, more than 50 queries needs to be made. How to query effectively?
Gladly, the API offers search in rectangular regions. Therefore I partitioned the entire big region into 20x20 smaller squares, and did one query for each square. Below is a demonstration, with the only difference being that the region is partitioned into 5x5 squares to save time. Take a random rectangular region as an example. The first thing to do is partition.
latmax <- 31.728300; latmin <- 29.912180
longmax <- 122.778190; longmin <- 119.607631
## 6 endpoints, split each side into 5 pieces
latseq <- seq(latmin, latmax, length.out = 6)
longseq <- seq(longmin, longmax, length.out = 6)
Next, we make a 5x5 container list to hold the query results. A simple and straightforward way is to use double loop and loop over all 25 squares. “农家乐” is used as the keyword for query. The standard format for using this API can be found from the documentation, and we don’t go to details here.
pages <- replicate(5, (replicate(5, list())))
for (i in 1:5) {
for (j in 1:5) {
page <- paste0("http://api.map.baidu.com/place/v2/search?\
query=农家乐&bounds=",
latseq[j], ",", longseq[i], ",", latseq[j + 1],
",", longseq[i + 1],
"&page_size=10&output=json&ak=[your API key here]")
pages[[i]][[j]] <- readLines(page)
}
}
What does the queried result look like? Let’s take a peek:
## [1] "{"
## [2] " \"status\":0,"
## [3] " \"message\":\"ok\","
## [4] " \"total\":7,"
## [5] " \"results\":["
## [6] " {"
## [7] " \"name\":\"绿野仙踪渔家乐\","
## [8] " \"location\":{"
## [9] " \"lat\":30.435663,"
## [10] " \"lng\":122.29825"
## [11] " },"
## [12] " \"address\":\"浙江省舟山市岱山县衢山打水村康庄路29号\","
## [13] " \"province\":\"浙江省\","
## [14] " \"city\":\"舟山市\","
## [15] " \"area\":\"岱山县\","
## [16] " \"telephone\":\"13764044949\","
## [17] " \"detail\":1,"
## [18] " \"uid\":\"c472e937fd631bcdfda7780a\""
## [19] " },"
## [20] " {"
## [21] " \"name\":\"舟山岱山静思渔家乐\","
## [22] " \"location\":{"
## [23] " \"lat\":30.456659,"
## [24] " \"lng\":122.402184"
## [25] " },"
## [26] " \"address\":\"浙江省舟山市岱山县衢山镇冷峙村冷峙大岙7号\","
## [27] " \"province\":\"浙江省\","
## [28] " \"city\":\"舟山市\","
## [29] " \"area\":\"岱山县\","
## [30] " \"telephone\":\"15355594393\","
## [31] " \"detail\":1,"
## [32] " \"uid\":\"af721a1efcb6e609ebe582e2\""
## [33] " },"
## [34] " {"
## [35] " \"name\":\"岱山县清之新山庄\","
## [36] " \"location\":{"
## [37] " \"lat\":30.316396,"
## [38] " \"lng\":122.195111"
## [39] " },"
## [40] " \"address\":\"舟山市岱山县东沙镇陈家村128号\","
## [41] " \"province\":\"浙江省\","
## [42] " \"city\":\"舟山市\","
## [43] " \"area\":\"岱山县\","
## [44] " \"detail\":1,"
## [45] " \"uid\":\"49b1b8d569f8de534c6d465f\""
## [46] " },"
## [47] " {"
## [48] " \"name\":\"岱山县衢山镇海娃娃渔家乐\","
## [49] " \"location\":{"
## [50] " \"lat\":30.456769,"
## [51] " \"lng\":122.401615"
## [52] " },"
## [53] " \"address\":\"浙江省舟山市岱山县冷峙村王家横44号(近冷峙沙滩、衢山岛)\","
## [54] " \"province\":\"浙江省\","
## [55] " \"city\":\"舟山市\","
## [56] " \"area\":\"岱山县\","
## [57] " \"telephone\":\"18606800714\","
## [58] " \"detail\":1,"
## [59] " \"uid\":\"485c48a12ae6ed05a8175a4d\""
## [60] " },"
## [61] " {"
## [62] " \"name\":\"舟山金色海景渔家乐\","
## [63] " \"location\":{"
## [64] " \"lat\":30.464759,"
## [65] " \"lng\":122.375081"
## [66] " },"
## [67] " \"address\":\"岱山舟山岱山县衢山镇沙龙村97号\","
## [68] " \"province\":\"浙江省\","
## [69] " \"city\":\"舟山市\","
## [70] " \"area\":\"岱山县\","
## [71] " \"telephone\":\"13065616790\","
## [72] " \"detail\":1,"
## [73] " \"uid\":\"6fe43ecfea41e223ae31e58f\""
## [74] " },"
## [75] " {"
## [76] " \"name\":\"舟山休闲农家乐\","
## [77] " \"location\":{"
## [78] " \"lat\":30.44096,"
## [79] " \"lng\":122.409899"
## [80] " },"
## [81] " \"address\":\"岱山舟山岱山县衢山镇乍门村万荣路20号\","
## [82] " \"province\":\"浙江省\","
## [83] " \"city\":\"舟山市\","
## [84] " \"area\":\"岱山县\","
## [85] " \"detail\":1,"
## [86] " \"uid\":\"3b114986c705acf3f02d8320\""
## [87] " },"
## [88] " {"
## [89] " \"name\":\"良田渔家乐\","
## [90] " \"location\":{"
## [91] " \"lat\":30.438717,"
## [92] " \"lng\":122.425952"
## [93] " },"
## [94] " \"address\":\"浙江省舟山市岱山县衢山镇田涂村港城路2号\","
## [95] " \"province\":\"浙江省\","
## [96] " \"city\":\"舟山市\","
## [97] " \"area\":\"岱山县\","
## [98] " \"detail\":1,"
## [99] " \"uid\":\"885835df910a84904c6d466f\""
## [100] " }"
## [101] " ]"
## [102] "}"
Organized but also messy! The “\” and quotation marks are huge troubles here. But before we get our hands dirty and start to clean the data, let us first filter out the useful information we need. To locate the resort, we need its name, latitude/longitude, address, province, city, and area. The function to process the results:
processpage <- function(page){
for (i in 1:length(page)) {
page[i] <- trimws(page[i], which = "both")
page[i] <- gsub('[\"]', '', page[i])
page[i] <- gsub(',', '', page[i])
}
## some villages' detailed information is missing
names <- address <- province <- city <- area <- long <- lat <- c()
if (length(page) < 12) {
return()
} else {
for (i in 1:length(page)) {
if (startsWith(page[i], "name")) {
## if the line contains keyword “name”, add it to the “names” vector
names <- c(names, page[i])
## if there is an address following the name, put it into the corresponding position in the address vector, NA otherwise
if (startsWith(page[i + 5], "address")) {
address <- c(address, page[i + 5])
} else {
address <- c(address, NA)
}
if (startsWith(page[i + 6], "province")) {
province <- c(province, page[i + 6])
} else {
province <- c(province, NA)
}
if (startsWith(page[i + 7], "city")) {
city <- c(city, page[i + 7])
} else {
city <- c(city, NA)
}
if (startsWith(page[i + 8], "area")) {
area <- c(area, page[i + 8])
} else {
area <- c(area, NA)
}
if (startsWith(page[i + 2], "lat")) {
lat <- c(lat, page[i + 2])
} else {
lat <- c(lat, NA)
}
if (startsWith(page[i + 3], "lng")) {
long <- c(long, page[i + 3])
} else {
long <- c(long, NA)
}
}
}
names <- ifelse(is.na(names), NA, substr(names, 6, 1000000L))
address <- ifelse(is.na(address), NA, substr(address, 9, 1000000L))
province <- ifelse(is.na(province), NA, substr(province, 10, 1000000L))
city <- ifelse(is.na(city), NA, substr(city, 6, 1000000L))
area <- ifelse(is.na(area), NA, substr(area, 6, 1000000L))
lat <- ifelse(is.na(lat), NA, substr(lat, 5, 1000000L))
long <- ifelse(is.na(long), NA, substr(long, 5, 1000000L))
return(data.frame(names, address, lat, long, province, city, area))
}
}
The function returns a data.frame
, with each column being one attribute we are interested in. Now we can proceed, and clean all 25 results we have:
processedpages <- list()
for (i in 1:5) {
for (j in 1:5) {
## append each new result at the end of list
processedpages[[length(processedpages) + 1]] <- processpage(pages[[i]][[j]])
}
}
Join the list into one data.frame
, and convert longitude and latitude to correct data types.
result <- plyr::ldply(processedpages, data.frame)
result$long <- as.numeric(as.character(result$long))
result$lat <- as.numeric(as.character(result$lat))
head(result, 10)
## names address lat
## 1 云石度假庄 杭州市萧山区狮山村 29.98284
## 2 天地休闲农庄 浙江省杭州市富阳区富春街道铁坞口村羊坑垅 30.06246
## 3 大清谷金之满农庄 浙江省杭州市西湖区大清谷清谷路18号 30.19250
## 4 杭州伊甸山庄 浙江省杭州市富阳区伊甸山庄(320国道南280米) 30.15215
## 5 碧雪湖生态农庄 浙江省杭州市临安区锦城街道钱王铺村金头155号 30.23363
## 6 达盟山庄 保俶路39号 30.26805
## 7 乐高运动休闲农庄 云和路附近(勤丰村) 30.12030
## 8 大梁山庄 杭州市临安区万马路延伸段西墅花园后 30.25008
## 9 新沙岛茗悦农家乐 富阳区东洲街道新沙村77-2 30.05262
## 10 问溪山庄 九溪路16号 30.19852
## long province city area
## 1 120.1227 浙江省 杭州市 萧山区
## 2 119.8914 浙江省 杭州市 富阳区
## 3 120.0748 浙江省 杭州市 西湖区
## 4 120.0156 浙江省 杭州市 富阳区
## 5 119.6285 浙江省 杭州市 临安区
## 6 120.1582 浙江省 杭州市 西湖区
## 7 119.9219 浙江省 杭州市 富阳区
## 8 119.7004 浙江省 杭州市 临安区
## 9 119.9974 浙江省 杭州市 富阳区
## 10 120.1235 浙江省 杭州市 西湖区
Finally, one more thing we can do is to visualize the points we obtained.