The Pew Research Center, in its own words, “is a
nonpartisan fact tank that informs the public about the issues,
attitudes[,] and trends shaping America and the world.” As a result, Pew
is an invaluable source of survey data on a wide range of topics; the
surveys it conducts are available on its website, generally after an
embargo of about six months. Researchers taking advantage of these
datasets, however, are caught in a bind. The terms and conditions for
downloading any Pew dataset state that one “may not reproduce, sell,
rent, lease, loan, distribute or sublicense or otherwise transfer any
Data, in whole or in part, to any other party.”1 But to ensure that
one’s work can be reproduced, assessed, and built upon by others, one
must provide access to the raw data one employed. The
pewdata
package cuts this knot by providing programmatic,
reproducible access to Pew’s datasets from within R.
When used interactively, the pew_download
function will
ask for the login information required by the Roper Center: the
registered user’s email and password. After that information is input
once, it will be entered automatically for any other download requests
made in the same session.
To change this contact information within a session, one may set the
argument reset
to TRUE
when running
pew_download
again, and the function will again request the
required information.
An optional, but highly recommended, setup step is to add the contact information Pew requires to your .Rprofile as in the following example:
options("pew_email" = "[email protected]",
"pew_password" = "password123!")
The pew_download
function will then have access the
information it needs to pass on to Pew by default. This means that
researchers will not have to expose their info in their R scripts and
that others reproducing their results later will be able to execute
those R scripts without modification. (They will, however, need to enter
their own information into their own .Rprofiles, a detail that should be
noted in the reproducibility materials to avoid confusion.)
The pew_download
function (1) opens a Chrome browser and
navigates to the website of the specified Pew Research Center research
area, (2) enters the required information to log in, (3) navigates to a
specified dataset, (4) downloads the dataset’s files, and, optionally
but by default, (5) decompresses the dataset’s files to a subdirectory
within a specified directory (or, by default, a pew_data
directory in the current working directory).
Datasets are specified using the file_id
and
area
arguments. The area
argument indicates
which of the seven Pew research areas produced the dataset:
"politics" |
U.S. Politics & Policy (the default) |
"journalism" |
Journalism & Media |
"internet" |
Internet, Science & Tech |
"religion" |
Religion & Public Life |
"hispanic" |
Hispanic Trends |
"global" |
Global Attitudes & Trends |
"socialtrends" |
Social & Demographic Trends |
The file_id
argument specifies particular datasets
within the area
. Pew does identifies datasets as strings
across all of its research areas, but the only way to get the id for a
dataset is to copy it from the end of the dataset’s url. For the
November 2014 Gender and Leadership survey conducted by Pew Social &
Demographic Trends, for example, the file id is “gender-leadership”:
To reproducibly download this dataset:
For the Pew Forum on Religion & Public Life’s dataset Religion in Latin America, the file id is “religion-in-latin-america”:
To download this dataset reproducibly:
Multiple datasets may be downloaded from the same research area in a
single command by passing a vector of ids to file_id
. The
following downloads the September 2018 Political Survey; the July 11-15,
2018 Weekly Survey; and the June 2018 Political Survey from the U.S.
Politics & Policy area (which is the default for
area
):
pew_download(file_id = c("september-2018-political-survey", "july-11-15-2018-weekly-survey", "june-2018-political-survey"))
After the needed datasets are downloaded, they are, by default, unzipped, and the data file is converted to .RData format and renamed with the file id.2 The original files are left unmodified.
The terms do include an exception that one “may incorporate limited portions of the Data in scholarly, research or academic publications or for the purposes of news reporting” with attribution, but this would appear inadequate for the purposes of reproducibility.↩︎
In the event that the download contains more than one data file, all are converted to .RData format, but only the largest of the original files is renamed.↩︎