{"id":13330,"date":"2020-12-10T14:15:46","date_gmt":"2020-12-10T13:15:46","guid":{"rendered":"https:\/\/surveyinsights.org\/?p=13330"},"modified":"2023-07-11T17:17:14","modified_gmt":"2023-07-11T16:17:14","slug":"collecting-and-using-always-on-location-data-in-surveys","status":"publish","type":"post","link":"https:\/\/surveyinsights.org\/?p=13330","title":{"rendered":"Collecting and using always-on location data in surveys"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>The traditional model of survey research\u2014a lengthy survey instrument collecting all measures of interest, a high response rate, a random sample from all population members\u2014is in crisis (National Academies of Sciences, 2017, 2018). Research subjects are increasingly intolerant of long questionnaires (Mavletova &amp; Couper, 2015; Tourangeau, Kreuter, &amp; Eckman, 2015), response rates are falling (Brick &amp; Williams, 2013; de Heer &amp; de Leeuw, 2002; The Economist Group, 2018), and incentives are not always effective at offsetting these trends (Mercer, Caporaso, Cantor, &amp; Townsend, 2015). In this environment, how can researchers collect the data necessary to understand society?<\/p>\n<p>Passive data collection may offer one way forward. By passively collected data, we mean data gathered without the direct involvement of research subjects. For example, rather than being asked numerous survey questions about their exercise and sleep, subjects could wear sport watches to track steps, heart rate, sleep, etc. Such data would reduce the recall and time burden placed on subjects and might also provide more accurate data. Surveys could then focus on asking about attitudes, characteristics, and behaviours not available via passive data collection.<\/p>\n<p>This paper explores one specific source of passively collected data: Global Positioning System (GPS) data collected from mobile devices. Data on where a device has been may contain useful information about subjects. From these data, we might be able to infer, with varying degrees of accuracy, characteristics such as:<\/p>\n<ul>\n<li>what Census block the subject lives in, which is highly correlated with race and income in the United States;<\/li>\n<li>whether she regularly visits day-care or school (evidence that she has children);<\/li>\n<li>whether she attends religious services;<\/li>\n<li>where she works and what hours; and<\/li>\n<li>how frequently, how long, and in what ways she exercises (at the gym\/running\/ biking).<\/li>\n<\/ul>\n<p>These variables are not present in the location data themselves: GPS sensors record latitude and longitude with a corresponding date-time stamp. However, traces of these characteristics and behaviours are present. If we can impute these characteristics with reasonable accuracy, we could remove them from the survey instrument, which could increase response rates and decrease data collection costs. In this paper, we describe our experiences in collecting and analysing always-on location data alongside survey data in a pilot study. We address the following research questions:<\/p>\n<ol>\n<li>Can we determine where a subject was (grocery store, dentist) from the GPS coordinates?<\/li>\n<li>Do passively collected measures of where subjects were agree with survey responses?<\/li>\n<li>Can we identify subjects\u2019 workplace through mobile device GPS data?<\/li>\n<li>What are subjects\u2019 attitudes toward passive data collection?<\/li>\n<\/ol>\n<p>Small studies such as this pilot are a necessary first step in understanding how we might transition from survey to passive data collection. Although this case study does not offer definitive answers, we hope it will help inform future surveys interested in using passively collected location data.<\/p>\n<h1>Data collection<\/h1>\n<p>We recruited subjects via an e-mail to colleagues in two departments at RTI International. All participants in the study were RTI employees who owned iPhones. Subjects were in the study for 2 weeks between January 28 and February 24, 2017. Previous research has shown that 2 weeks of location data is enough to understand subjects\u2019 activity spaces (Stanley, Yoo, Paul, &amp; Bell, 2018). The study protocol was approved by RTI\u2019s Office of Research Protection and the legal and human resources departments.<\/p>\n<p>To meet the aims of our study, we used a combination of survey and passive data collection. All subjects completed daily surveys: participants downloaded an application to their phones which asked survey questions each day. The survey was just two questions long. At the end of the 2 weeks of data collection, subjects completed an outtake survey. This web survey asked about experiences with the study and the subjects\u2019 familiarity with common digital and Internet topics and products. Subjects also installed the Moves application (Evenson &amp; Furberg, 2017), which passively collected location data from the phone: time, date, and GPS coordinates. The Moves application is no longer available from the Apple App Store. Arc App (<a href=\"https:\/\/www.bigpaua.com\/arcapp\/\">https:\/\/www.bigpaua.com\/arcapp\/<\/a>) was developed specifically to replace the Moves application and to offer similar functionality. We have not replicated our data collection with Arc App, however. Both Moves and Arc App are only available for iOS devices. Location data were collected whenever the phone was on and moving.<\/p>\n<p>Forty-six subjects expressed initial interest in the study. After reading the informed consent document for this study, four who had expressed interest chose not to participate. Others dropped out without explicitly giving a reason. We have location data from 24 subjects, and 21 completed all phases of data collection.<\/p>\n<p>The output of the Moves app is not raw sensor data, but rather processed travel or location data. The Moves app generates two datasets: (1) a places file\u2014a list of coordinates where the subjects stopped and spent some time, and (2) a traces file\u2014a database of subjects\u2019 travel paths. The algorithm used to determine what constitutes a place is not published by the application developers. (Arc App, however, seems to publish its algorithms on GitHub, though we have not rigorously reviewed them.) We only use the places file because of our focus on locations subjects visited. Across all subjects and days, we collected coordinates of 1,928 places. The number collected per day and subject varied from 1 to 11, median 5, mean 5.6.<\/p>\n<h2>RQ1: Determining where a subject was<\/h2>\n<p>The places dataset contains the date, start time, end time, and latitude and longitude of each location a subject visited. To replace survey responses from these data, however, we must first figure out the real-life sites (office, store, park) each subject visited. We queried three popular online databases of business and other points of interest (PoIs): Google Places (Google Maps Platform, 2018), Yelp (Yelp, 2018), and Foursquare (Foursquare, 2017). We used each site\u2019s Application Programming Interface to generate candidate PoIs within 100 meters of each place coordinate.<\/p>\n<p><strong>Figure 1: <\/strong>Number of Matches per Place<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15038\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_1.png\" alt=\"\" width=\"400\" height=\"409\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_1.png 710w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_1-293x300.png 293w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<p>Matching a given place to a single PoI was challenging. Figure 1 shows the distribution of the number of matches to each of the 1,928 places. The modal number of matches was zero. Nevertheless, 1,377 coordinates had at least one matched PoI. The databases returned 14,643 PoIs within 100 meters of these places. For the purposes of this article, we retained only the closest PoI from each of the three databases, leaving 2,536 candidate PoIs for 1,377 places. There is more than one PoI for each place, on average, because the databases often did not agree on which PoI best matched a place. For example, one coordinate matched to a grocery store in the Google database, a liquor store in the Foursquare database, and a dentist\u2019s office in the Yelp database.<\/p>\n<p>Manual matching of the PoIs across the three datasets by name revealed only 53 three-way matches. The three sources were more likely to agree when the PoI was large and isolated, such as a large retail store, a university, or a large church. Sources were less likely to agree about smaller locations such as restaurants, professional offices, and coffee shops. Foursquare was the clear outlier, having more and different types of matches, such as \u201cWork Break Room\u201d and \u201cPaul\u2019s Apartment.\u201d It contained separate PoIs for many buildings on the RTI campus and the softball field. Foursquare is a user-driven community more so than Google and Yelp, which may account for these differences.<\/p>\n<p>Several sources of error could lead a place to be matched to the wrong PoI. The collected GPS coordinate of the place recorded in the Moves data could be off by several meters: smartphone GPS is accurate to approximately 5 meters under ideal conditions (Van Diggelen &amp; Enge, 2015). The PoI databases could also be wrong about the name or location of a PoI\u2014perhaps the ice cream store recently went out of business and the subject in fact visited a chiropractor\u2019s office. Lastly, accurate matching becomes more difficult in dense commercial areas where there are several PoIs near a recorded coordinate. This issue is particularly problematic in mixed-use developments: visiting friends or family living in an apartment above a row of stores could trigger a false positive detection of a ground-floor PoI.<\/p>\n<h2>RQ2: Agreement between survey responses and passively collected data<\/h2>\n<p>Despite the disagreement between PoI databases, we are still interested in how closely the inferred PoI visits from the location data match reported survey responses. For the survey, we used items from the outtake questionnaire asking subjects to report the number of times they had been to day-care centres, grocery stores, and gyms during the study period. For the passively collected location data, we manually coded the closest PoIs for each coordinate to flag the ones falling into these three categories. When the PoI sources disagreed, we considered a subject to have visited a PoI whenever any of the sources indicated that she had. We summed within subjects to get the number of visits to day-care centres, grocery stores, and gyms in the PoI data.<\/p>\n<p>Figure 2 shows scatterplots of the number of visits reported in the outtake survey (on the horizontal axis) and the number found in the PoI data (on the vertical axis). The diagonal black line shows the points of agreement between the two sources. Away from the origin, we see very little agreement between the two sources. The greatest deviation in agreement is for grocery visits. There are several points both below the black line (respondents reported more visits than we see in the data) and above the black line. The smallest deviation occurs for day-care visits, although most of the agreement in this category is from points at the origin, where neither source indicated a visit to a day-care facility. Interestingly, although we expected overidentification of PoIs in the passively collected location data, based on our generous assignment across databases, the survey counts are often greater.<\/p>\n<p><strong>Figure 2.<\/strong> Comparison of Visits Counts in Survey Responses and Location Data<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15039\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2-943x1024.png\" alt=\"\" width=\"400\" height=\"435\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2-943x1024.png 943w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2-276x300.png 276w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2-768x834.png 768w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Figure_2.png 1018w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<h2>RQ3: Identifying work location<\/h2>\n<p>Although successfully matching subjects\u2019 passively collected coordinates to PoIs is difficult without context, we can begin drawing more compelling insights if we include additional information on subjects or if we include behavioural assumptions as part of matching algorithms. To demonstrate this approach, we used a combination of unsupervised learning methods and common-sense decision rules to identify where subjects work from the places dataset. In our analysis, we included only subjects who work at RTI headquarters (n=22). However, we used this information only to validate the model, not to build it. Thus, our results are applicable to other surveys where the work location is unknown and only GPS coordinates are available. We used the Python programming language for wrangling and analysing the data and the scikit-learn library (Pedregosa et al., 2011) for clustering.<\/p>\n<p>First, we filtered each subject\u2019s places coordinates to only those occurring between the hours of 8:00 AM and 6:00 PM, Monday through Friday. Records were included for further analysis if (1) both the start and end times lie entirely within the interval (e.g., 8:30 AM\u201311:00 AM); (2) the start time begins before and end time finishes after the interval (e.g., 7:00 AM\u20136:45 PM); or (3) either the start or end times fall within the interval and more time was spent inside the interval than outside (e.g., 7:00 AM\u201311:30 AM, corresponding to 1 hour outside and 3.5 hours inside). These hours were chosen to reflect the most common business hours for industries using a 40-hour workweek in the United States. Researchers working with different populations (e.g., students, workers in the hospitality industry) should modify their query to better reflect the expected work patterns for their sample.<\/p>\n<p>Next, we truncated the latitude and longitude coordinates to the thousandths place. Truncating the digits helps smooth and reduce noise in the clusters; we want to identify a workplace instead of detecting different spots in the parking lot. We used the DBSCAN algorithm (Davis, 2014; Ester et al., 1996) to develop clusters of coordinates. DBSCAN is a density-based clustering algorithm that groups dense neighbouring points. The algorithm has several nice properties that are useful for this type of task: (1) DBSCAN does not require specifying the number of clusters up front, as opposed to other popular clustering methods like K-means (Hartigan &amp; Wong, 1979); (2) DBSCAN has a natural notion of outliers and assigns all points lying in low-density areas to a catch-all \u201coutlier\u201d cluster; and (3) the algorithm\u2019s tuning parameters have a useful interpretation for GPS coordinates. The two parameters are (1) the radius surrounding each point that should be considered when determining neighbouring points for cluster assignment, and (2) the minimum number of points that must be densely connected to be considered a cluster. Generally, the larger the radius, the larger the cluster membership will be. For our model, the radius parameter was set to 0.2 km and the minimum points parameter was set to 5.<\/p>\n<p>All clustering algorithms require a distance metric to determine similarity between points. Because our points have a geographic interpretation, we used the haversine formula (Bullock, 2007) to calculate pairwise \u201cgreat-circle\u201d distance in kilometres between each coordinate for each subject. The \u201cgreat-circle,\u201d or orthodromic, distance is the shortest distance between two points on the surface of the Earth. Although using a Euclidean distance approximation is likely fine for determining candidate work locations within a short commute, we opted for the orthodromic distance to help with edge cases where distances travelled are longer, especially in areas farther from the equator where the distortion between distance calculations is more pronounced.<\/p>\n<p>We ran DBSCAN independently on each subject to create clusters. We then coded the cluster where the subject spent the most time 8:00 AM\u20136:00 PM, Monday through Friday, as the workplace location. If the subject spent the most time in the outlier category, then the cluster with the second longest duration was assigned as the workplace.<\/p>\n<p>To assess this method, we compared the predicted workplace location to the location of RTI\u2019s Research Triangle Park, NC, campus. If the predicted workplace location fell within 0.5 km of the RTI campus centroid, we called the prediction a success. Twenty-two of our 24 subjects had location data in the Research Triangle Park area. For these 22, this heuristic correctly identified workplaces for all but 2 subjects (90.9%). Upon further inspection, the two misclassified clusters were apartments. Those subjects may have worked from home more often than on campus during the field period. For both misclassified subjects, the second most common cluster was the RTI campus. Thus, our approach was largely successful at identifying subjects\u2019 workplaces in this relatively homogenous population.<\/p>\n<h2>RQ4: Subjects\u2019 attitudes toward passive data collection<\/h2>\n<p>The outtake survey collected information about the subjects\u2019 experience with the passive data collection. Table 1 shows the negative effects of the passive data collection reported by the subjects. A majority indicated no negative effects on their smartphone performance.<\/p>\n<p><strong>Table 1.<\/strong> Frequency of Problems Encountered by Subjects with Passive Data Collection (n=22)<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_1-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15040\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_1-1.png\" alt=\"\" width=\"400\" height=\"157\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_1-1.png 838w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_1-1-300x117.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_1-1-768x301.png 768w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<p>One concern with passive data collection is that subjects might change their behaviour when the data are being collected. Similar effects occur with survey data collection (Bach &amp; Eckman, 2018; Crossley et al., 2017; Dholakia, 2010; Traugott &amp; Katosh, 1979). Estimating such survey conditioning effects is quite challenging (Bach, forthcoming), and our study was not designed to do so.<\/p>\n<p><strong>Table 2.<\/strong> Frequency of Thinking about Passively Collected Data (n=22)<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_2-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15041\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_2-1.png\" alt=\"\" width=\"400\" height=\"170\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_2-1.png 834w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_2-1-300x127.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_2-1-768x326.png 768w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<p>However, the outtake survey asked some questions to touch on this issue. One question asked how often the respondents thought about the passive data that the study was collecting via their mobile phones (Table 2). Two respondents reported that they thought about it all the time. Another question asked if respondents changed their behaviour at all in response to the passive data collection (Table 3). One subject indicated that he or she changed behaviour while the data were being collected, but most (81%) said they did not. In addition, 19 participants (90%) said they definitely or probably would participate in another survey that combined survey and passive data collection, including the two who reported that they thought about the collection all the time.<\/p>\n<p><strong>Table 3<\/strong>. Changed Behavior Because of Location Data Collection (n=22)<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_3-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-15042\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_3-1.png\" alt=\"\" width=\"400\" height=\"121\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_3-1.png 828w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_3-1-300x91.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2020\/11\/Table_3-1-768x232.png 768w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><\/p>\n<p>Of course, we do not have responses to these questions from those who chose not to take part in the survey. Thus, we do not know what aspects of the study caused them to opt out.<\/p>\n<h1>Discussion<\/h1>\n<p>Although our study was small and limited to our colleagues, it reveals important lessons for other researchers who are considering passive data collection. We suggest that researchers interested in incorporating passive data collection in their studies also start with small data collection studies to gain a hands-on understanding of the challenges involved. It is not cost-effective or ethical to collect data from subjects without a good plan for processing, storing, and analysing them. Researchers should also think carefully about how the data will be stored and transferred and who will have access at each stage.<\/p>\n<p>An important finding from this study was that the location data we collected were challenging to work with. Our approaches to matching coordinates to PoIs was not always successful: we found both too few and too many matches. Even interpreting agreement between passively collected location data and survey responses is complex, because both may have errors. Much more research is necessary before researchers can use location data to impute subject characteristics. Future research should investigate the use of the trace data as well as the places data.<\/p>\n<p>Always-on location data must be collected, stored, and used properly, with full knowledge and consent on the part of the study participants. Data such as the places and traces files collected in this study cannot help but reveal where subjects live and work, and where and when they travel around their neighbourhoods. The data should probably be considered personally identifiable information and should not be released to any researchers outside of the study team. The usual methods of anonymizing survey data such as review of outliers and separation of identifiers from survey data do not work with location data (Cassa, Wieland, &amp; Mandl, 2008; Zang &amp; Bolot, 2011). We anticipate a growing interest in passive data collection in the future and encourage researchers to develop standards and best practices for the collection, handling, storage, and release of such data.<\/p>\n<p>Social science researchers are not the only ones working on understanding the places that people visit. Google, Yelp, Facebook, and other technology firms are far ahead in developing these capabilities, in large part because of their extensive data resources and business interest in selling targeted advertising. These firms are unlikely to share their proprietary algorithms with social science researchers, and we cannot share the confidential data we collect from our subjects with them. We hope to find a way for these two sets of researchers to combine efforts in the future. We are encouraged by the potential of passive location data and support the multidisciplinary research effort needed to make continued progress.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction The traditional model of survey research\u2014a lengthy survey instrument collecting all measures of interest, a high response rate, a random sample from all population members\u2014is in crisis (National Academies of Sciences, 2017, 2018). Research subjects are increasingly intolerant of long questionnaires (Mavletova &amp; Couper, 2015; Tourangeau, Kreuter, &amp; Eckman, 2015), response rates are falling [&hellip;]<\/p>\n","protected":false},"author":1142,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[630],"tags":[639,319,654,984,983],"class_list":["post-13330","post","type-post","status-publish","format-standard","hentry","category-advancements-in-online-and-mobile-survey-methods","tag-location-data","tag-mobile-surveys","tag-passive-data-collection","tag-survey-data","tag-tools"],"acf":[],"_links":{"self":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/13330","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/users\/1142"}],"replies":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13330"}],"version-history":[{"count":14,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/13330\/revisions"}],"predecessor-version":[{"id":18810,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/13330\/revisions\/18810"}],"wp:attachment":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}