+ - 0:00:00
Notes for current slide
Notes for next slide



Web Scraping

Dr. Mine Dogucu

1 / 63

Review

Quiz
GitHub collaboration

Note that this week's repo is blank because now you are in full charge of creating files.

2 / 63

Goals

  • HTML & CSS
  • Web scraping
  • Considerations when working with web data
3 / 63

Hypertext Markup Language

4 / 63

An ugly web page

5 / 63

HTML document outline

6 / 63

Paragraphs

7 / 63
8 / 63

<a href="https://www.r-project.org/">R</a>


9 / 63

<a href="https://www.r-project.org/">R</a>


<a> </a> HTML tag


10 / 63

<a href="https://www.r-project.org/">R</a>


<a> </a> HTML tag


href attribute (name)


11 / 63

<a href="https://www.r-project.org/">R</a>


<a> </a> HTML tag


href attribute (name)


https://www.r-project.org/attribute (value)


12 / 63

<a href="https://www.r-project.org/">R</a>


<a> </a> HTML tag


href attribute (name)


https://www.r-project.org/attribute (value)


R content

13 / 63

Spans

14 / 63

Cascading Style Sheets

15 / 63

Styling

16 / 63

17 / 63

Web scraping

18 / 63

20 / 63

21 / 63

22 / 63

23 / 63

24 / 63

25 / 63
26 / 63

27 / 63

read_html() - reads an html page.
html_nodes() - extracts the html nodes.
html_text() - extracts the text of the node.
html_attr() - extracts the attribute

28 / 63

Load packages

library(rvest)
library(tidyverse)
29 / 63

Check if a bot has permisson to access page

robotstxt::paths_allowed("http://www.imdb.com")
www.imdb.com
[1] TRUE
robotstxt::paths_allowed("http://www.facebook.com")
www.facebook.com
[1] FALSE
30 / 63

Read the entire page

page <- read_html("http://www.imdb.com/chart/top")
page
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
31 / 63

Scrape titles

32 / 63
page %>%
html_nodes(".titleColumn a")
{xml_nodeset (250)}
[1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[17] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[18] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
[20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
...
33 / 63
page %>%
html_nodes(".titleColumn a") %>%
html_text()
[1] "The Shawshank Redemption"
[2] "The Godfather"
[3] "The Godfather: Part II"
[4] "The Dark Knight"
[5] "12 Angry Men"
[6] "Schindler's List"
[7] "The Lord of the Rings: The Return of the King"
[8] "Pulp Fiction"
[9] "The Good, the Bad and the Ugly"
[10] "The Lord of the Rings: The Fellowship of the Ring"
[11] "Fight Club"
[12] "Forrest Gump"
[13] "Inception"
[14] "The Lord of the Rings: The Two Towers"
[15] "Star Wars: Episode V - The Empire Strikes Back"
[16] "The Matrix"
[17] "Goodfellas"
[18] "One Flew Over the Cuckoo's Nest"
[19] "Seven Samurai"
[20] "Se7en"
[21] "The Silence of the Lambs"
[22] "City of God"
[23] "Life Is Beautiful"
[24] "It's a Wonderful Life"
[25] "Star Wars: Episode IV - A New Hope"
[26] "Saving Private Ryan"
[27] "Interstellar"
[28] "Spirited Away"
[29] "The Green Mile"
[30] "Parasite"
[31] "Léon: The Professional"
[32] "Hara-Kiri"
[33] "The Pianist"
[34] "The Usual Suspects"
[35] "Terminator 2: Judgment Day"
[36] "Back to the Future"
[37] "Psycho"
[38] "The Lion King"
[39] "Modern Times"
[40] "American History X"
[41] "Grave of the Fireflies"
[42] "City Lights"
[43] "Whiplash"
[44] "Gladiator"
[45] "The Departed"
[46] "The Intouchables"
[47] "The Prestige"
[48] "Casablanca"
[49] "Once Upon a Time in the West"
[50] "Rear Window"
[51] "Cinema Paradiso"
[52] "Alien"
[53] "Apocalypse Now"
[54] "Memento"
[55] "Raiders of the Lost Ark"
[56] "The Great Dictator"
[57] "The Lives of Others"
[58] "Django Unchained"
[59] "Paths of Glory"
[60] "Sunset Blvd."
[61] "WALL·E"
[62] "Avengers: Infinity War"
[63] "Witness for the Prosecution"
[64] "The Shining"
[65] "Spider-Man: Into the Spider-Verse"
[66] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
[67] "Joker"
[68] "Princess Mononoke"
[69] "Oldboy"
[70] "Your Name."
[71] "The Dark Knight Rises"
[72] "Coco"
[73] "Aliens"
[74] "Once Upon a Time in America"
[75] "Capharnaüm"
[76] "Avengers: Endgame"
[77] "Das Boot"
[78] "High and Low"
[79] "Hamilton"
[80] "American Beauty"
[81] "Toy Story"
[82] "3 Idiots"
[83] "Amadeus"
[84] "Braveheart"
[85] "Inglourious Basterds"
[86] "Good Will Hunting"
[87] "Star Wars: Episode VI - Return of the Jedi"
[88] "2001: A Space Odyssey"
[89] "Reservoir Dogs"
[90] "M"
[91] "Vertigo"
[92] "Like Stars on Earth"
[93] "Citizen Kane"
[94] "Come and See"
[95] "The Hunt"
[96] "Requiem for a Dream"
[97] "Singin' in the Rain"
[98] "North by Northwest"
[99] "Eternal Sunshine of the Spotless Mind"
[100] "Bicycle Thieves"
[101] "Ikiru"
[102] "Lawrence of Arabia"
[103] "The Kid"
[104] "Pather Panchali"
[105] "Full Metal Jacket"
[106] "Dangal"
[107] "The Father"
[108] "The Apartment"
[109] "Incendies"
[110] "A Clockwork Orange"
[111] "Taxi Driver"
[112] "Metropolis"
[113] "Double Indemnity"
[114] "The Sting"
[115] "A Separation"
[116] "Scarface"
[117] "1917"
[118] "Snatch"
[119] "Amélie"
[120] "To Kill a Mockingbird"
[121] "Toy Story 3"
[122] "For a Few Dollars More"
[123] "Up"
[124] "Indiana Jones and the Last Crusade"
[125] "Heat"
[126] "L.A. Confidential"
[127] "Dune"
[128] "Yojimbo"
[129] "Ran"
[130] "Die Hard"
[131] "Rashomon"
[132] "Green Book"
[133] "Downfall"
[134] "Monty Python and the Holy Grail"
[135] "All About Eve"
[136] "Batman Begins"
[137] "Some Like It Hot"
[138] "Unforgiven"
[139] "Children of Heaven"
[140] "Howl's Moving Castle"
[141] "The Wolf of Wall Street"
[142] "Judgment at Nuremberg"
[143] "The Great Escape"
[144] "Casino"
[145] "The Treasure of the Sierra Madre"
[146] "There Will Be Blood"
[147] "Pan's Labyrinth"
[148] "A Beautiful Mind"
[149] "The Secret in Their Eyes"
[150] "Raging Bull"
[151] "My Neighbor Totoro"
[152] "Chinatown"
[153] "Lock, Stock and Two Smoking Barrels"
[154] "Shutter Island"
[155] "The Gold Rush"
[156] "No Country for Old Men"
[157] "Dial M for Murder"
[158] "The Seventh Seal"
[159] "Three Billboards Outside Ebbing, Missouri"
[160] "The Thing"
[161] "The Elephant Man"
[162] "The Sixth Sense"
[163] "Klaus"
[164] "Wild Strawberries"
[165] "The Third Man"
[166] "The Truman Show"
[167] "Jurassic Park"
[168] "V for Vendetta"
[169] "Memories of Murder"
[170] "Inside Out"
[171] "Blade Runner"
[172] "Trainspotting"
[173] "The Bridge on the River Kwai"
[174] "Fargo"
[175] "Warrior"
[176] "Finding Nemo"
[177] "Kill Bill: Vol. 1"
[178] "Gone with the Wind"
[179] "Tokyo Story"
[180] "On the Waterfront"
[181] "My Father and My Son"
[182] "Z"
[183] "Stalker"
[184] "Wild Tales"
[185] "The Deer Hunter"
[186] "Sherlock Jr."
[187] "Gran Torino"
[188] "The General"
[189] "Persona"
[190] "The Grand Budapest Hotel"
[191] "Prisoners"
[192] "Before Sunrise"
[193] "Mary and Max"
[194] "Mr. Smith Goes to Washington"
[195] "Catch Me If You Can"
[196] "Room"
[197] "In the Name of the Father"
[198] "Barry Lyndon"
[199] "Gone Girl"
[200] "Hacksaw Ridge"
[201] "Andhadhun"
[202] "The Passion of Joan of Arc"
[203] "To Be or Not to Be"
[204] "Ford v Ferrari"
[205] "12 Years a Slave"
[206] "The Big Lebowski"
[207] "How to Train Your Dragon"
[208] "Autumn Sonata"
[209] "Mad Max: Fury Road"
[210] "Dead Poets Society"
[211] "Ben-Hur"
[212] "Million Dollar Baby"
[213] "Harry Potter and the Deathly Hallows: Part 2"
[214] "The Wages of Fear"
[215] "Stand by Me"
[216] "The Handmaiden"
[217] "Network"
[218] "Logan"
[219] "Cool Hand Luke"
[220] "Hachi: A Dog's Tale"
[221] "The 400 Blows"
[222] "Gangs of Wasseypur"
[223] "La Haine"
[224] "Platoon"
[225] "Spotlight"
[226] "A Silent Voice: The Movie"
[227] "Rebecca"
[228] "Monsters, Inc."
[229] "Monty Python's Life of Brian"
[230] "The Bandit"
[231] "Hotel Rwanda"
[232] "In the Mood for Love"
[233] "Rush"
[234] "Into the Wild"
[235] "Love's a Bitch"
[236] "Rocky"
[237] "Nausicaä of the Valley of the Wind"
[238] "Andrei Rublev"
[239] "It Happened One Night"
[240] "Fanny and Alexander"
[241] "Before Sunset"
[242] "Neon Genesis Evangelion: The End of Evangelion"
[243] "The Battle of Algiers"
[244] "Demon Slayer: Mugen Train"
[245] "The Princess Bride"
[246] "Paris, Texas"
[247] "Nights of Cabiria"
[248] "Three Colors: Red"
[249] "Tangerines"
[250] "Miracle in Cell No. 7"
34 / 63
titles <- page %>%
html_nodes(".titleColumn a") %>%
html_text()
35 / 63
str(titles)
chr [1:250] "The Shawshank Redemption" "The Godfather" ...
36 / 63

Scrape years

37 / 63
page %>%
html_nodes(".secondaryInfo") %>%
html_text()
[1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
[9] "(1966)" "(2001)" "(1999)" "(1994)" "(2010)" "(2002)" "(1980)" "(1999)"
[17] "(1990)" "(1975)" "(1954)" "(1995)" "(1991)" "(2002)" "(1997)" "(1946)"
[25] "(1977)" "(1998)" "(2014)" "(2001)" "(1999)" "(2019)" "(1994)" "(1962)"
[33] "(2002)" "(1995)" "(1991)" "(1985)" "(1960)" "(1994)" "(1936)" "(1998)"
[41] "(1988)" "(1931)" "(2014)" "(2000)" "(2006)" "(2011)" "(2006)" "(1942)"
[49] "(1968)" "(1954)" "(1988)" "(1979)" "(1979)" "(2000)" "(1981)" "(1940)"
[57] "(2006)" "(2012)" "(1957)" "(1950)" "(2008)" "(2018)" "(1957)" "(1980)"
[65] "(2018)" "(1964)" "(2019)" "(1997)" "(2003)" "(2016)" "(2012)" "(2017)"
[73] "(1986)" "(1984)" "(2018)" "(2019)" "(1981)" "(1963)" "(2020)" "(1999)"
[81] "(1995)" "(2009)" "(1984)" "(1995)" "(2009)" "(1997)" "(1983)" "(1968)"
[89] "(1992)" "(1931)" "(1958)" "(2007)" "(1941)" "(1985)" "(2012)" "(2000)"
[97] "(1952)" "(1959)" "(2004)" "(1948)" "(1952)" "(1962)" "(1921)" "(1955)"
[105] "(1987)" "(2016)" "(2020)" "(1960)" "(2010)" "(1971)" "(1976)" "(1927)"
[113] "(1944)" "(1973)" "(2011)" "(1983)" "(2019)" "(2000)" "(2001)" "(1962)"
[121] "(2010)" "(1965)" "(2009)" "(1989)" "(1995)" "(1997)" "(2021)" "(1961)"
[129] "(1985)" "(1988)" "(1950)" "(2018)" "(2004)" "(1975)" "(1950)" "(2005)"
[137] "(1959)" "(1992)" "(1997)" "(2004)" "(2013)" "(1961)" "(1963)" "(1995)"
[145] "(1948)" "(2007)" "(2006)" "(2001)" "(2009)" "(1980)" "(1988)" "(1974)"
[153] "(1998)" "(2010)" "(1925)" "(2007)" "(1954)" "(1957)" "(2017)" "(1982)"
[161] "(1980)" "(1999)" "(2019)" "(1957)" "(1949)" "(1998)" "(1993)" "(2005)"
[169] "(2003)" "(2015)" "(1982)" "(1996)" "(1957)" "(1996)" "(2011)" "(2003)"
[177] "(2003)" "(1939)" "(1953)" "(1954)" "(2005)" "(1969)" "(1979)" "(2014)"
[185] "(1978)" "(1924)" "(2008)" "(1926)" "(1966)" "(2014)" "(2013)" "(1995)"
[193] "(2009)" "(1939)" "(2002)" "(2015)" "(1993)" "(1975)" "(2014)" "(2016)"
[201] "(2018)" "(1928)" "(1942)" "(2019)" "(2013)" "(1998)" "(2010)" "(1978)"
[209] "(2015)" "(1989)" "(1959)" "(2004)" "(2011)" "(1953)" "(1986)" "(2016)"
[217] "(1976)" "(2017)" "(1967)" "(2009)" "(1959)" "(2012)" "(1995)" "(1986)"
[225] "(2015)" "(2016)" "(1940)" "(2001)" "(1979)" "(1996)" "(2004)" "(2000)"
[233] "(2013)" "(2007)" "(2000)" "(1976)" "(1984)" "(1966)" "(1934)" "(1982)"
[241] "(2004)" "(1997)" "(1966)" "(2020)" "(1987)" "(1984)" "(1957)" "(1994)"
[249] "(2013)" "(2019)"
38 / 63
page %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>%
str_remove("\\)") %>%
as.numeric()
[1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 2002 1980
[16] 1999 1990 1975 1954 1995 1991 2002 1997 1946 1977 1998 2014 2001 1999 2019
[31] 1994 1962 2002 1995 1991 1985 1960 1994 1936 1998 1988 1931 2014 2000 2006
[46] 2011 2006 1942 1968 1954 1988 1979 1979 2000 1981 1940 2006 2012 1957 1950
[61] 2008 2018 1957 1980 2018 1964 2019 1997 2003 2016 2012 2017 1986 1984 2018
[76] 2019 1981 1963 2020 1999 1995 2009 1984 1995 2009 1997 1983 1968 1992 1931
[91] 1958 2007 1941 1985 2012 2000 1952 1959 2004 1948 1952 1962 1921 1955 1987
[106] 2016 2020 1960 2010 1971 1976 1927 1944 1973 2011 1983 2019 2000 2001 1962
[121] 2010 1965 2009 1989 1995 1997 2021 1961 1985 1988 1950 2018 2004 1975 1950
[136] 2005 1959 1992 1997 2004 2013 1961 1963 1995 1948 2007 2006 2001 2009 1980
[151] 1988 1974 1998 2010 1925 2007 1954 1957 2017 1982 1980 1999 2019 1957 1949
[166] 1998 1993 2005 2003 2015 1982 1996 1957 1996 2011 2003 2003 1939 1953 1954
[181] 2005 1969 1979 2014 1978 1924 2008 1926 1966 2014 2013 1995 2009 1939 2002
[196] 2015 1993 1975 2014 2016 2018 1928 1942 2019 2013 1998 2010 1978 2015 1989
[211] 1959 2004 2011 1953 1986 2016 1976 2017 1967 2009 1959 2012 1995 1986 2015
[226] 2016 1940 2001 1979 1996 2004 2000 2013 2007 2000 1976 1984 1966 1934 1982
[241] 2004 1997 1966 2020 1987 1984 1957 1994 2013 2019
39 / 63
years <-
page %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>%
str_remove("\\)") %>%
as.numeric()
40 / 63

Scrape ratings

41 / 63
ratings <- page %>%
html_nodes("strong") %>%
html_text() %>%
as.numeric()
42 / 63
imdb_top_250 <- tibble(
title = titles,
year = years,
rating = ratings
)
43 / 63
imdb_top_250 %>%
group_by(year) %>%
summarize(avg_rating = mean(rating)) %>%
arrange(desc(avg_rating))
# A tibble: 86 × 2
year avg_rating
<dbl> <dbl>
1 1972 9.1
2 1994 8.62
3 1946 8.6
4 1977 8.6
5 1990 8.6
6 1974 8.55
# … with 80 more rows
44 / 63
imdb_top_250 %>%
filter(year == 1972)
# A tibble: 1 × 3
title year rating
<chr> <dbl> <dbl>
1 The Godfather 1972 9.1
45 / 63

API Wrappers

Two ways of web scraping

Screen scraping: What we have done by extracting the data from the source code of the website.

46 / 63

API Wrappers

Two ways of web scraping

Screen scraping: What we have done by extracting the data from the source code of the website.

Web APIs (Application Programming Interface): Website offers a set of http requests that you can use use to access data.

47 / 63

API Wrappers

Two ways of web scraping

Screen scraping: What we have done by extracting the data from the source code of the website.

Web APIs (Application Programming Interface): Website offers a set of http requests that you can use use to access data.

However, there are wrapper packages for some APIs. In short, you can connect to certain APIs (i.e. if the package exists) without having to know about too much about working with APIs.

48 / 63

Example

library(spotifyr)

https://github.com/charlie86/spotifyr

49 / 63

Get Access to the Spotify API

https://developer.spotify.com/dashboard/login

You will get a Client ID and Client Secret

50 / 63

Authentication

Do not use this specific method on a public computer.

51 / 63

Authentication

Do not use this specific method on a public computer.

usethis::edit_r_environ()
52 / 63

Authentication

Do not use this specific method on a public computer.

usethis::edit_r_environ()

In your .Renviron write the following and save your file. The XXXXXXXXX comes from your own Spotify developer account.

SPOTIFY_CLIENT_ID = XXXXXXXXXXXXXXXXXXXXXXXXXXX
SPOTIFY_CLIENT_SECRET = XXXXXXXXXXXXXXXXXXXXXXXX
53 / 63

https://open.spotify.com/artist/6vWDO969PvNqNYHIOW5v0m

get_artist("6vWDO969PvNqNYHIOW5v0m")
$external_urls
$external_urls$spotify
[1] "https://open.spotify.com/artist/6vWDO969PvNqNYHIOW5v0m"
$followers
$followers$href
NULL
$followers$total
[1] 29199536
$genres
[1] "dance pop" "pop" "r&b"
$href
[1] "https://api.spotify.com/v1/artists/6vWDO969PvNqNYHIOW5v0m"
$id
[1] "6vWDO969PvNqNYHIOW5v0m"
$images
height url width
1 640 https://i.scdn.co/image/ab6761610000e5ebd3d058be8485c8583703b6d2 640
2 320 https://i.scdn.co/image/ab67616100005174d3d058be8485c8583703b6d2 320
3 160 https://i.scdn.co/image/ab6761610000f178d3d058be8485c8583703b6d2 160
$name
[1] "Beyoncé"
$popularity
[1] 86
$type
[1] "artist"
$uri
[1] "spotify:artist:6vWDO969PvNqNYHIOW5v0m"
56 / 63
get_artist("6vWDO969PvNqNYHIOW5v0m") %>%
str()
List of 10
$ external_urls:List of 1
..$ spotify: chr "https://open.spotify.com/artist/6vWDO969PvNqNYHIOW5v0m"
$ followers :List of 2
..$ href : NULL
..$ total: int 29199536
$ genres : chr [1:3] "dance pop" "pop" "r&b"
$ href : chr "https://api.spotify.com/v1/artists/6vWDO969PvNqNYHIOW5v0m"
$ id : chr "6vWDO969PvNqNYHIOW5v0m"
$ images :'data.frame': 3 obs. of 3 variables:
..$ height: int [1:3] 640 320 160
..$ url : chr [1:3] "https://i.scdn.co/image/ab6761610000e5ebd3d058be8485c8583703b6d2" "https://i.scdn.co/image/ab67616100005174d3d058be8485c8583703b6d2" "https://i.scdn.co/image/ab6761610000f178d3d058be8485c8583703b6d2"
..$ width : int [1:3] 640 320 160
$ name : chr "Beyoncé"
$ popularity : int 86
$ type : chr "artist"
$ uri : chr "spotify:artist:6vWDO969PvNqNYHIOW5v0m"
57 / 63

Considerations about web data

58 / 63

59 / 63

Do you need all that data at that speed?

Sampling rather than scraping all of the data may be an option.

60 / 63

Do you need all that data at that speed?

Sampling rather than scraping all of the data may be an option.

You may end up with HTTP Error 429 (Too many requests). In this case you may want to slow down your requests per a given time interval.

scrape_movie <- function(movie_url) {
Sys.sleep(runif(1))
#### Remaining code of the function
}

Before scraping each movie's page this would make system to sleep for a random number of seconds between 0 and 1 second.

61 / 63

Write your data (if possible)

  • Data online are not static.

  • Web pages change structures.

  • Only way of reproducing the same results may be from the .csv files that you write.

62 / 63

Optional

Make use of beepr::beep(), this way when your code finishes running, you will be notified.

63 / 63

Review

Quiz
GitHub collaboration

Note that this week's repo is blank because now you are in full charge of creating files.

2 / 63
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow