Simple function to download a PDF, robustly.

download_pdf(url, file, quiet = FALSE, overwrite = FALSE, pause = TRUE)

Arguments

url
The URL for a PDF
file
File to which the PDF will be downloaded
quiet
Suppress a message about which URL is being processed [default=FALSE]
overwrite
Overwrite an existing file of the same name [default=FALSE]
pause
Whether to pause for 0.5-3 seconds during scraping [default=TRUE]

Value

A data.frame with url, destination, success, pdfCheck

Details

Scraping PDFs from the web can run into little hitches that make writing a scraper annoying. This simplifies PDF scraping by creating a dedicated function and support functions to, e.g., test for PDFness. Ensures URL encoding, handles missing URLs gracefully. The filename is the basename of the URL with " " replaced with "_". Includes the pause parameter to limit the rate at which requests hit the hosting servers.

TODO: Have the overwrite check work on the MD5 hash of files in the download sudb rather than relying on file names.

Examples

## Not run: ------------------------------------ # result <- download_pdf(url = "https://goo.gl/I3P3A3", # file = "~/Downloads/test.pdf") ## ---------------------------------------------