{"id":17190,"date":"2023-11-20T12:20:11","date_gmt":"2023-11-20T11:20:11","guid":{"rendered":"https:\/\/surveyinsights.org\/?p=17190"},"modified":"2024-01-09T17:36:33","modified_gmt":"2024-01-09T16:36:33","slug":"exploring-optical-character-recognition-ocr-as-a-method-of-capturing-data-from-food-purchase-receipts","status":"publish","type":"post","link":"https:\/\/surveyinsights.org\/?p=17190","title":{"rendered":"Exploring Optical Character Recognition (OCR) as a Method of Capturing Data from Food-Purchase Receipts"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>This paper explores using optical character recognition (OCR) to capture expenditure data from food-purchase receipts. We employ the open-source Tesseract OCR engine and a custom-tailored algorithm to capture text data from images of food-purchase receipts and then parse and store the data for further analysis. We compare the accuracy of data captured through this OCR algorithm to a manually coded review of food-purchase receipts, as well as respondent-reported expenditures corresponding to those purchases. \u00a0A process that makes receipts machine readable could provide immense cost savings by reducing the level of effort needed to access and analyze the data.<\/p>\n<p>As a record of food expenditure, receipts are a particularly robust source of information. They provide itemized expenses in a recognizable format that includes a description of the purchased good or service, the quantity procured, and the corresponding cost. Receipts also typically identify the retailer, the time and date of purchase, the subtotal, taxes, and the total cost for the expenditure. However, collecting data from receipts can be difficult. First, there is the challenge of gaining cooperation and collecting receipts from respondents which often leads to a reliance on small, nonrepresentative samples (e.g., Rankin et al. 1998, Ransley et al. 2001, French et al. 2009). Once receipts have been collected, they must be reviewed, annotated, and coded into a data set (J\u00e4ckle et al. 2021). Due to this high level of effort very few general population surveys have included the collection and analysis of receipt data.<\/p>\n<p>Of the few general population surveys that have collected receipts from respondents, the most recent is the <em>Understanding Society<\/em> Spending Study 1 which included the collection of expenditures and receipts through a mobile survey application (University of Essex, Institute for Social and Economic Research 2021). 2,112 members of the <em>Understanding Society<\/em> Innovation Panel were invited to participate in the study and of those who consented and downloaded the app, 270 participants used the app at least once for a total of 11,507 reported expenditures over a 31-day period (J\u00e4ckle et al. 2019). Analysis of the data indicated that nearly half of the app uses were receipt submissions (Read 2019), and the number of times that participants scan receipts or report purchases was relatively consistent over the 31-day period (J\u00e4ckle et al. 2019).<\/p>\n<p>In the United States two federal surveys have captured and analyzed receipts. The first survey is a small pilot study conducted as part of the Consumer Expenditure (CE) Quarterly Interview Survey Records Study. Researchers recruited 115 households to be interviewed twice in a seven-day period. Participants were asked to keep receipts and other personal records of expenditures like credit card statements. Interviewers found that records were available for 36% of the 3,039 expenditures reported in the initial interview and those records were enough to provide evidence of measurement error from both over- and under-reporting (Geisen et al. 2011). Subsequently, a protocol for receipt collection was included in the suggestions for redesigning the CE (Westat 2011, National Research Council 2013).<\/p>\n<p>The second survey is the National Household Food Acquisition and Purchase Survey (FoodAPS-1) sponsored by the U.S. Department of Agriculture (U.S. Department of Agriculture, 2012). FoodAPS-1 was a nationally representative survey that captured all food acquired, whether purchased or obtained for free, by all household members older than 12 years of age. Participants were asked to keep all food-purchase receipts for the 7-day data collection period and use them as a reference when reporting details of the expenditure. Similar to the <em>Understanding Society<\/em> Spending Study 1, FoodAPS-1 recognized a cooperative pattern of behavior among active respondents. FoodAPS-1 respondents provided receipts for 80% of 15,998 \u201cfood at home\u201d (FAH) events (Kirlin and Denbaly 2017) and 57% of 23,472 \u201cfood away from home\u201d (FAFH) events for which the respondent paid for food. Similar to the CE, early findings from the FoodAPS-1 receipts motivated suggestions to continue the use of receipt data in future iterations of data collection (Cole and Baxter 2016, Yan and Maitland 2016, Kirlin and Denbaly 2017, Page et al. 2019).<\/p>\n<h1>Alternative Data Collection Methods Study<\/h1>\n<p>In preparation for FoodAPS-2, the Alternative Data Collection Methods (ADCM) study was conducted in 2017. The ADCM tested an online diary format called the \u201cFoodLogger\u201d to reduce reporting burden and improve data quality. Respondents were able to access the FoodLogger on a computer, tablet, or mobile phone. The FoodLogger platform included product identification assistance using Universal Product Codes (UPCs), Google Maps integration for looking up event locations, and the ability to upload images of receipts. Receipts were submitted as digital images including JPEG, PNG, and PDF formats. No specific guidance on photographing or scanning was given to respondents during their initial interview. All receipts used for this examination of OCR come from the receipts collected during the ADCM.<\/p>\n<p>The ADCM sample was drawn from an address-based sampling frame of 12 primary sampling units (PSUs) across nine states that were sampled for FoodAPS-1. The ADCM aimed at collecting representative data from a target of 500 households, including 150 households participating in the Supplemental Nutrition Assistance Program (SNAP). In total, 430 households reported 4,906 food acquisition events with 1,598 reports \u201chaving a receipt to upload\u201d as indicated by the respondent, which we will refer to as \u201creceipt-indicated\u201d reports. There were two categories of food acquisition events that respondents could report: a) \u201cfood away from home\u201d (FAFH) which includes meals, snacks or drinks consumed outside the home, or b) \u201cfood at home\u201d (FAH) which includes any food or drink items brough into the home for consumption.<\/p>\n<p>Figure 1 provides annotated examples of two FAFH receipts. In addition to item descriptions, item prices, taxes, and the total cost, there are several other elements that illustrate types of formatting conventions that can be found across establishments. For example, the receipt from Little Caesars, a fast-food pizza chain, lists items individually along the left side of the receipt with corresponding prices aligned to the right. We can infer from the line labeled \u201citem count\u201d that each description represents a singular item, resulting in a total of four items. In comparison, the receipt from Subway, the international sandwich chain, also lists the item description to the left of the item price but it also includes an indication of quantity preceding the item. We also see that the receipt from Subway includes the type of payment used (i.e., cash) and the resulting change from the transaction whereas the receipt from Little Caesars provides no additional information on the type of payment.<\/p>\n<table>\n<tbody>\n<tr>\n<td colspan=\"2\" width=\"557\"><strong>Figure 1: Example Food Away from Home (FAFH) Receipts<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"278\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-18375 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated1-225x300.png\" alt=\"\" width=\"225\" height=\"300\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated1-225x300.png 225w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated1.png 698w\" sizes=\"auto, (max-width: 225px) 100vw, 225px\" \/><\/a><\/td>\n<td width=\"278\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-18376 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated2-225x300.png\" alt=\"\" width=\"225\" height=\"300\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated2-225x300.png 225w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAFH-Annotated2.png 697w\" sizes=\"auto, (max-width: 225px) 100vw, 225px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Figure 2 provides annotated examples of two FAH receipts. Both contain several additional elements that are not present in the FAFH receipt examples. For example, the multi-line-item descriptions on the receipt from Walgreens, a nationwide drug store chain, include product UPCs and quantity information and the receipt from Giant, a regional grocery store, includes item-level discounts. There are also elements that are not of particular interest for this examination of OCR, which we will refer to as \u201cnonessential data\u201d. The receipt from Walgreens, for example, lists the return value for each item and the receipt from Giant includes headers for different items, such as \u201cGROCERY\u201d and \u201cNATURAL FOOD\u201d.<em>\u00a0<\/em><\/p>\n<table>\n<tbody>\n<tr>\n<td colspan=\"2\" width=\"587\"><strong>Figure 2: Example Food at Home (FAH) Receipts<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"331\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-18377 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated1-224x300.png\" alt=\"\" width=\"224\" height=\"300\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated1-224x300.png 224w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated1.png 698w\" sizes=\"auto, (max-width: 224px) 100vw, 224px\" \/><\/a><\/td>\n<td width=\"256\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18378 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated2.png\" alt=\"\" width=\"524\" height=\"931\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated2.png 524w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/FAH-Annotated2-169x300.png 169w\" sizes=\"auto, (max-width: 524px) 100vw, 524px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h1>Methods<\/h1>\n<h2>Sampling Receipts<\/h2>\n<p>As part of a data quality review, a random sample of 100 FAFH and 100 FAH events was selected from the 1,598 receipt-indicated reports to evaluate against the reported expenditure data (Kaderabek et al. 2021). During the validation process it became evident that the number of receipts indicated by respondents was not accurate due to several factors: reporting error, non-itemized receipts, images of receipts from stores that did not correspond to the reported establishment, images that were not receipts, corrupted files, and receipt images that were illegible.<\/p>\n<p>Figure 3 provides a breakdown of ADCM receipt-indicated events compared to the actual number of receipts that were available for data validation. It was found that only 1,426 (89%) of the 1,598 receipt-indicated events corresponded to an available image file and only 1,274 (80%) of the receipt-indicated events corresponded to an itemized and legible receipt. Although the same issues contributed to the reduction of available FAFH and FAH receipts, Figure 3 also shows there was a more profound reduction in the number of FAFH receipts.<\/p>\n<table>\n<tbody>\n<tr>\n<td width=\"623\"><strong>Figure 3: Receipt Availability for Food Away from Home (FAFH) and Food at Home (FAH) Events<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"623\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-18904\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1-1024x665.png\" alt=\"\" width=\"1024\" height=\"665\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1-1024x665.png 1024w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1-300x195.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1-768x499.png 768w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Figure-3-1.png 1462w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of the 100 sampled FAFH events, 18 images were missing (i.e., they were either never uploaded by the respondent or they were not included in the archive of images after data collection), 5 images were not of receipts, 8 receipts were not itemized (i.e., they only indicated a confirmation of payment). This resulted in 69 FAFH receipts that could be compared to the reported data. Of the 100 sampled FAH events, 9 receipt images were missing, 2 receipts were illegible, and 1 receipt was not itemized (and appeared to be from a misreported FAFH event), leaving 88 FAH receipts for comparison. We manually coded the total cost, number of items, and item prices from the receipt data for all sampled receipts. Because there were no standards for how respondents should capture the receipt image, the legibility of each sampled receipt was also coded as \u201ccompletely legible\u201d or \u201cpartly legible\u201d to evaluate the quality of the submitted images.<\/p>\n<h2>OCR Test Cases<\/h2>\n<p>The 69 FAFH and 88 FAH events from the validation study represent the initially available OCR test cases. However, because any limitation to manual interpretation of receipt data will also limit automated interpretation, we used the receipt legibility scores to exclude blurry images and receipts with stains or other peripheral markings that impeded legibility. Restricting our test cases to completely legible receipts reduced the available FAFH images to 61 and the available FAH images to 82. In order to bolster the number of FAFH test cases, we included 7 of the 8 non-itemized receipts from the FAFH sample to test the possibility of capturing the receipt total when no other information was available, yielding a total of 68 FAFH test cases (the eighth non-itemized receipt only contained a payment confirmation without any cost information). This was not possible for FAH receipts because grocery stores and similar FAH establishments in our sample all provided itemized receipts.<\/p>\n<h2>Pre-processing with ImageMagick<\/h2>\n<p>Before the OCR process can be applied, we use ImageMagick (ImageMagick Development Team 2021) for image pre-processing. OCR accuracy with Tesseract is dependent on several image properties (Google 2021). Of critical importance are resolution, clarity, and skew. Resolution, measured in pixels, is defined by the image capture device (e.g., camera or scanner). Clarity is a less objective metric that is dependent on image contrast and focus. Image skew refers to the 2D and 3D alignment of elements in the image. This includes both vertical and horizontal alignment as well as distortions of perspective like keystoning (i.e., converging vertical elements due to the top of an image appearing further away from the camera than the bottom). ImageMagick facilitates cropping images, binarizing (i.e., converting all pixels to black or white), adjusting resolution, and de-skewing images. Each ImageMagick function includes parameters for adjusting preprocessing performance and output however, due to time constraints, only default parameters were tested during this investigation.<\/p>\n<h3><strong>OCR with Tesseract<\/strong><\/h3>\n<p>Following the image pre-processing, each file was processed using the Tesseract OCR engine (Google 2021). Tesseract identifies pixels in relation to each other and associates the identified shapes with known characters. The resulting interpretation is then converted into a string of text characters with the individually recognized lines of text separated by the new line escape character \u201c\\n\u201d. The next step it to parse the data from strings of text and store it in a way that makes statistical analysis possible.<\/p>\n<p>Figure 4 provides examples of OCR results from each end of the quality spectrum. The upper pair of images represent a near perfect OCR capture of a receipt containing 35 items. OCR accurately captured the receipt total, all item descriptions, and all but five item prices. In contrast, the bottom pair of images represent a receipt that was legible to the human eye but was unrecognizable to the OCR.<\/p>\n<table width=\"557\">\n<tbody>\n<tr style=\"height: 15.05pt;\">\n<td colspan=\"2\" width=\"557\"><strong>Figure 4: Successful and Unsuccessful OCR Capture<\/strong><\/td>\n<\/tr>\n<tr style=\"height: 4.1in;\">\n<td width=\"238\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18382 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good1.png\" alt=\"\" width=\"382\" height=\"684\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good1.png 382w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good1-168x300.png 168w\" sizes=\"auto, (max-width: 382px) 100vw, 382px\" \/><\/a><\/td>\n<td width=\"318\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18383 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good2.png\" alt=\"\" width=\"603\" height=\"745\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good2.png 603w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/good2-243x300.png 243w\" sizes=\"auto, (max-width: 603px) 100vw, 603px\" \/><\/a><\/td>\n<\/tr>\n<tr style=\"height: 3.1in;\">\n<td width=\"238\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18384 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad1.png\" alt=\"\" width=\"384\" height=\"451\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad1.png 384w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad1-255x300.png 255w\" sizes=\"auto, (max-width: 384px) 100vw, 384px\" \/><\/a><\/td>\n<td width=\"318\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18385 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad2.png\" alt=\"\" width=\"425\" height=\"505\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad2.png 425w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/bad2-252x300.png 252w\" sizes=\"auto, (max-width: 425px) 100vw, 425px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><strong>Using Regular Expressions (REGEX) to Parse Text<\/strong><\/h3>\n<p>As illustrated above, receipts are often not standardized across establishments. However, there are a number of properties that make receipt data highly recognizable. For example, in the United States, receipts read left to right with the item description appearing first, possibly followed by an indication of quantity or size, and the item-specific price displayed to the right of the description. Prices are aligned and summed to provide the subtotal, tax, and total cost below the listing of item prices. Dollar signs are commonly, but not always, associated with prices. Within the ADCM receipts, prices were universally listed as a numeric value including two decimal places. As seen in the FAH receipt examples (Figure 2), there may also be other elements present such as Universal Product Codes (UPCs), or loyalty-member discounts.<\/p>\n<p>Recognizing these patterns enables humans to employ heuristics, or cognitive short-cuts that enhance recognition and interpretation of the information contained in a receipt. In turn, it is possible to employ an algorithm that uses similar logical inference to parse the raw text data that results from using OCR. We use regular expressions (or \u201cregex\u201d) and Boolean logic to create an algorithm that is capable of identifying pertinent elements of the receipt as well as removing text which is not informative. Regex are special sequences used to find or match patterns in strings of text data. These sequences use metacharacters and other syntax to represent sets, ranges, or specific characters. For example, the expression \u201c[0-9]\u201d matches the range of numbers between 0 and 9. Regex are far more flexible than searching for explicitly defined character strings, which makes them incredibly useful for searching and manipulating text strings.<\/p>\n<p>It should be noted, however, that there are limitations to using a strictly logic-based approach to parsing the data. Rules will only be enforced if the explicit regex are recognized. For example, if we instruct the algorithm to look for the word \u201cTOTAL\u201d but the OCR errantly perceives \u201cT0TAL\u201d then the corresponding rule will not be triggered. There are ways to mitigate these simple issues, such as building a dictionary that includes likely alternatives or modifying the regex to be more flexible but these solutions can have limitations as well. Future work in novel text extraction could benefit from incorporating predictive machine learning techniques to mitigate character misinterpretations, particularly if there is significant variation in the quality of the images undergoing OCR. Due to time constraints, our algorithm was constrained to a \u201cwhat you see is what you get\u201d OCR output.<\/p>\n<h2>Defining and Targeting Receipt Data to Capture<\/h2>\n<p>During the preliminary review of the OCR test receipts, a list of common and consistent elements was created to guide the development of the text parsing algorithm. The algorithm was also programmed to include indicator variables when identifying certain receipt elements, including item counts, price discounts, the subtotal, tax, and total. Indicator variables were constructed as binary variables that indicated if the element-specific regex pattern is present and has a corresponding value. In order for a given indicator to be coded as 1, both the regex pattern and a numeric value needed to be identified and captured. An indicator would be coded as 0 for either not recognizing the specified regular expressions, or by recognizing the expression but failing to identify an associated dollar value. We focus on capturing the following eight elements from the receipt image:<\/p>\n<ol>\n<li>Item descriptions \u2013 specifically a string of characters constrained to a single line of text indicating a food item. We exclude lines containing nonessential information like department headers or lines indicating the quantity of an item.<\/li>\n<li>Item prices \u2013 as present for each individual item description on FAH receipts. Respondents were asked to report individual items included in FAFH combo meals and report the price with only the initial\/parent item.<\/li>\n<li>Item discounts \u2013 in the FAFH test cases discounts were rare to the extent that a short custom dictionary was constructed to identify discounts by the terms \u201cEMP DISC\u201d, \u201cMANAGER MEAL\u201d, and \u201cDISCOUNT\u201d. Discounts among FAH test cases were far more prevalent and universally indicated by a hyphen immediately before or after the price, as seen in the \u201cBONUS BUY SAVINGS\u201d lines of the receipt from Giant in Figure 2.<\/li>\n<li>Item count (FAH receipts only) \u2013 by default a line-item translates to a quantity of one, even if multiples of the same product are purchased in succession. As an element of the receipt, we define quantities specifically as any line containing an asperand (i.e., \u201c@\u201d) to be a line indicating a weight or other quantity, for example \u201c3 @ $2.79\u201d or \u201c1.56 @ 3.99 \/ LB\u201d. We then parse the amount purchased and the per-unit price and associate them with the item description immediately preceding the quantity line to calculate the total item count.<\/li>\n<li>Line count (FAFH receipts only) \u2013 in contrast to the use of the asperand, FAFH receipts tend to list the quantity of each product purchased before the item description as seen in the receipt from Subway (Figure 1). This pattern, coupled with the frequent presence of numbers at the beginning of item descriptions and the common inclusion of order instructions, made it impossible for the algorithm to capture item and quantity information in any way comparable to both the manually coded receipt data and the reported data for FAFH events. Figure 5 illustrates these formatting issues. Line 1 indicates the purchase of a \u201cbuy one, get one\u201d deal and the two items included are listed subsequently in lines 2 and 3. Line 4 indicates a quantity of 1 for \u201c2 Burritos\u201d and line 7 similarly indicates a single order of \u201c2 Hash Browns\u201d. Additionally, both lines 6 and 9 are instructions related to different items and not unique food items themselves. We were unable to modify the algorithm to accommodate this level of nuance in the time available for testing. However, it was still important to make some assessment of the OCR\u2019s performance in accurately capturing text from the receipt. As a solution, we manually counted the items listed on the FAFH receipts by line as illustrated in Figure 5 with no differentiation of quantities or content. We compare this \u201cline count\u201d with the number of lines captured by OCR as a performance metric. Since respondents were asked to report the number of food items we are unable to compare the respondent reported data to the OCR captured line count.<br \/>\n<strong>Figure 5: Example of Line Count for FAFH Receipts<\/strong><\/p>\n<table>\n<tbody>\n<tr>\n<td width=\"317\"><\/td>\n<\/tr>\n<tr>\n<td width=\"317\"><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/WhyFAFHis-complicated.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-18379 aligncenter\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/WhyFAFHis-complicated.jpg\" alt=\"\" width=\"704\" height=\"889\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/WhyFAFHis-complicated.jpg 704w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/WhyFAFHis-complicated-238x300.jpg 238w\" sizes=\"auto, (max-width: 704px) 100vw, 704px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li>Subtotal (FAFH only) \u2013 as indicated by the presence of \u201cSUBTOTAL\u201d or \u201cSUB TOTAL\u201d on FAFH receipts. The subtotal indicator was the least complex in that it simply targeted the appearance of \u201cSUBTOTAL\u201d or \u201cSUB TOTAL\u201d and an associated value. Because respondents were not asked to report subtotal or tax for FAH events, we did not attempt to capture these data from the FAH receipts using OCR.<\/li>\n<li>Tax (FAFH only) \u2013 only FAFH receipts included tax in addition to the receipt total. The tax indicator targeted patterns that were highly likely to be representations of \u201cTAX\u201d, such as \u201cTAK\u201d or \u201cAX\u201d. As noted above, because respondents were not asked to report tax separate from the receipt total for FAH events, we did not attempt to capture tax from the FAH receipts using OCR.<\/li>\n<li>Total \u2013 the balance due for all purchased items plus tax. The total indicator used regex patterns to identify strings that included commonly used terms like \u201cTOTAL\u201d, \u201cBALANCE\u201d, and \u201cAMOUNT DUE\u201d.<\/li>\n<\/ol>\n<h3><strong>Analysis<\/strong><\/h3>\n<p>For our analysis we refer to the manually coded receipt data as the \u201ctrue\u201d values for each event. To assess the accuracy of OCR as a method of capturing text data, we compare the OCR output to the manually coded receipt data. Specifically, we look at the FAFH values of subtotal, tax, absolute total, and line count (as defined above) along with the FAH values of total and item count. Because respondents were not asked to report the subtotal and tax separately for FAH events the subtotal and tax were not captured as part of the manually coded review or OCR capture of the FAH receipts.<\/p>\n<p>Where the OCR output differs from the manually coded data, we compare whether OCR or the respondent reported data are closer to the true values. We also compare the correlation between the OCR and manually coded data to the correlation between manually coded data and the respondent reported data. If the OCR produces price and item information that is highly correlated to the manually coded information, then there will be support for OCR\u2019s ability to accurately capture the expenditure information.<\/p>\n<h1>Results<\/h1>\n<p>Table 1 provides a comparison of summary statistics for the manually coded receipt data, the OCR results, and the respondent-reported data. For FAFH events, we present the mean receipt line count, tax, subtotal, and total. For FAH events, we show the mean receipt item count and total. Because of the inclusion of non-itemized receipts and two receipts with missing totals (one FAFH and one FAH receipt), we indicate the number of test cases available for each comparison in the fourth column. For both the OCR results and the self-reported data, we then present the number of cases that accurately matched the manually coded receipt data.<\/p>\n<p><strong>Table 1: Summary of Manually Coded Receipt Data, OCR Results, and Self-Reported Data<\/strong><\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-19137\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2-1024x551.png\" alt=\"\" width=\"800\" height=\"431\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2-1024x551.png 1024w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2-300x161.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2-768x413.png 768w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2-1536x827.png 1536w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2022\/07\/Table_1-2.png 1884w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>Within the FAFH test case events, the mean line count in the OCR output is 5.11 lines and only slightly higher than the mean line count of 4.74 in the manually coded receipt data. The resulting correlation between the OCR line count and the manually coded receipt is R<sup>2 <\/sup>= 0.94, indicating that OCR is largely consistent in its ability to capture lines of text without other conditions applied. OCR over-estimates the average tax by $0.04 whereas the respondent-reported data under-estimates the average tax by $0.03. Additionally, we see that the correlation of OCR tax results to the manually coded receipt tax (R<sup>2 <\/sup>= 0.87) is nearly the same as the correlation of the self-reported tax data to the manually coded receipt values (R<sup>2 <\/sup>= 0.90), excluding the two largest tax mis-captures of $80.00 and $833.00 from the OCR results. The average subtotal from the OCR results ($19.92) is notably higher than the manually coded data ($16.99).\u00a0 The correlation of the OCR subtotals to the manually coded receipt values is R<sup>2 <\/sup>= 0.68. The respondent-reported data of subtotal ($15.99), in turn, more closely match with the manually coded receipt values, with a correlation of R<sup>2 <\/sup>= 0.94. For the total of FAFH receipts, we see that the manually coded data yield a mean of $17.76 while the OCR results yield a mean of $22.98, with a correlation of R<sup>2 <\/sup>= 0.30.<\/p>\n<p>The final two rows of Table 1 present a similar summary of the FAH data. The OCR results over-estimate the average number of items per receipt (10.61) compared to the manually coded data (9.72). The respondent-reported data, in turn, under-estimate the average item count (7.59). Although we find that OCR correctly captures the exact number of items for only 27% of the FAH receipts, the correlation of OCR item counts to the manually coded receipt values (R<sup>2 <\/sup>= 0.97) is much stronger than the correlation of self-reported item counts to the manually coded receipt data (R<sup>2 <\/sup>= 0.88). Finally, we find that OCR performs quite well at capturing FAH totals. Although the average total based on the OCR results ($23.75) under-estimates the average total from the manually coded receipts ($32.94) by almost $10.00, OCR captures the correct total in 75% of test cases, with a correlation of R<sup>2 <\/sup>= 0.93, after excluding outliers. The respondent-reported average total ($32.60) matches closely with the manually coded total and captures the correct total in 85% of test cases, with a correlation of R<sup>2 <\/sup>= 0.99.<\/p>\n<h1>Discussion<\/h1>\n<p>This work set out to explore the feasibility of using open-source OCR software and a custom-tailored algorithm to capture expenditure data from images of food-purchase receipts. The value of receipts as a record of expenditure is high and a process that makes receipts machine readable would provide immense cost savings by reducing the level of effort needed to access and analyze the data. Although the results of OCR were generally less accurate than the reported data, OCR did perform well in some situations. We present this work as evidence that computer vision methods can successfully capture text data.<\/p>\n<p>We found OCR to be successful in capturing more accurate data on FAH item counts, which were one of the most commonly under-reported event details in both FoodAPS-1 and the ADCM (Kaderabek et al. 2021). Additionally, we found OCR to reliably capture several data elements from text when the image quality is sufficient, although the results can be drastically impaired by a lack of consistency across images. It is our believe that a more focused scope could greatly improve the performance of OCR. For example, Walmart accounted for roughly 15% of the FAH receipts submitted during the ADCM. The ability to construct less sophisticated algorithms that focus on specific establishments could provide meaningful insight into patterns of over- and under-reporting food acquisition and expenditure.<\/p>\n<p>One of the substantial limitations of OCR is that the extracted values potentially deviate from the actual receipt data to a large extent. If OCR mis-captures text, the result may be missing integers or including errant digits altogether. The most extreme example in this study is the OCR-based tax of $833.00 for a receipt indicating a total of $26.64, with a manually coded tax of only $1.69. Interestingly, while the tax captured was almost 500 times higher than the tax based on manually coded receipts, OCR successfully captured all items and corresponding prices accurately, the sum of which matched the receipt subtotal perfectly.<\/p>\n<p>A review of OCR&#8217;s performance in recognizing our specified REGEX patterns (e.g., &#8220;tax&#8221;, &#8220;subtotal&#8221;, and &#8220;total&#8221;) offered no clear insights into the overall performance of the algorithm aside from supporting concerns that accuracy will decrease with poorer image quality. The OCR and parsing processing took about 1 second per receipt and the process was able to batch process images from an existing directory. Comparatively, manually coding generally took between 2-10 minutes per receipt depending on the length for a single coder to review and capture the information on the receipt.<\/p>\n<p>As a method of data collection, we found the use of ImageMagick and Tesseract OCR to be accessible to anyone with a working knowledge of R and\/or Python. This work was conducted with no prior familiarity with ImageMagick, Tesseract or regex over an eight-week period during the final semester of the University of Michigan\u2019s Graduate Program in Survey and Data Science. Future work should explore iterative rounds of testing to improve overall performance of the algorithm, including predictive classification of text and modularized versions of the algorithm tailored for specific establishments.<\/p>\n<p>In conclusion, the effort involved in capturing receipt data for analysis may still be the largest impediment to expenditure researchers seeking to use receipts as a way to reduce respondent burden and improve measurement. However, receipts will continue to be a robust source of expenditure data. These results provide some evidence that capturing text data from receipts can be successful and using OCR as a method of data collection can benefit from further investigation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction This paper explores using optical character recognition (OCR) to capture expenditure data from food-purchase receipts. We employ the open-source Tesseract OCR engine and a custom-tailored algorithm to capture text data from images of food-purchase receipts and then parse and store the data for further analysis. We compare the accuracy of data captured through this [&hellip;]<\/p>\n","protected":false},"author":4562,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[818],"tags":[861,336,860,859,862],"class_list":["post-17190","post","type-post","status-publish","format-standard","hentry","category-food-acquisition-research-methods","tag-expenditure-data","tag-measurement-error","tag-optical-character-recognition","tag-receipt-data","tag-under-reporting"],"acf":[],"_links":{"self":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/17190","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/users\/4562"}],"replies":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=17190"}],"version-history":[{"count":27,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/17190\/revisions"}],"predecessor-version":[{"id":19463,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/17190\/revisions\/19463"}],"wp:attachment":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=17190"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=17190"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=17190"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}