{"id":8708,"date":"2017-08-24T16:15:28","date_gmt":"2017-08-24T15:15:28","guid":{"rendered":"http:\/\/surveyinsights.org\/?p=8708"},"modified":"2017-08-29T15:07:39","modified_gmt":"2017-08-29T14:07:39","slug":"comparing-continuous-and-dichotomous-scoring-of-social-desirability-scales-effects-of-different-scoring-methods-on-the-reliability-and-validity-of-the-winkler-kroh-spiess-bidr-short-scale","status":"publish","type":"post","link":"https:\/\/surveyinsights.org\/?p=8708","title":{"rendered":"Comparing Continuous and Dichotomous Scoring of Social Desirability Scales: Effects of Different Scoring Methods on the Reliability and Validity of the Winkler-Kroh-Spiess BIDR Short Scale"},"content":{"rendered":"<h1 style=\"text-align: left;\"><strong>Introduction<\/strong><\/h1>\n<p>A major threat to the validity of survey data is socially desirable responding, \u201cthe tendency to give overly positive self-descriptions\u201d (Paulhus, 2002, p. 50). Accordingly, researchers have developed a number of scales which aim to measure this tendency. These scales are used to identify <em>respondents<\/em> who tend to describe themselves in an overly positive manner or <em>items and scales<\/em> which tend to elicit answers tainted by desirability bias (see Paulhus, 1991; Tourangeau &amp; Yan, 2007); also, measures of desirable responding can be used as covariates in multivariate analyses to remove the influence of desirability on the relationship of interest between other variables (see van de Mortel, 2008).<\/p>\n<p>One of the most popular measures is the Balanced Inventory of Desirable Responding (BIDR; Paulhus, 1991). It is based on the two-dimensional conception of social desirability (Paulhus, 1984) and \u201cusually operationalized via the impression management (IM) and self-deceptive enhancement (SDE) scales\u201d (Trapnell &amp; Paulhus, 2012, p. 44). According to the revised interpretation of the inventory (Paulhus &amp; John, 1998), the SDE scale measures <em>agentic<\/em> self-descriptions; high scores on this scale indicate \u201cegoistic bias\u201d, \u201ca self-deceptive tendency to exaggerate one\u2019s social and intellectual status\u201d (Paulhus, 2002, p. 63) and ascribe \u201csuperhero-like\u201d attributes to oneself (Paulhus, 2002, p. 63); the IM scale measures overly positive self-descriptions in terms of <em>communal<\/em> values; high scores on this scale indicate \u201cmoralistic bias\u201d, \u201ca self-deceptive tendency to deny socially-deviant impulses and claim sanctimonious, \u2018saint-like\u2019 attributes\u201d (Paulhus, 2002, p. 64). The BIDR 6 contains 20 statements each to measure SDE and IM. Answers are given on a Likert-type scale ranging from \u201c1 \u2013 NOT TRUE\u201d to \u201c7 \u2013 VERY TRUE\u201d (Paulhus, 1991).<\/p>\n<p>The length of the BIDR limits its utility to survey researchers. Winkler, Kroh and Spiess (2006) have hence developed a six-item German short form of the BIDR. It has proven popular for use in surveys of the general population (Naef &amp; Schupp, 2009a, b; Schneider &amp; Schupp, 2014; Shajek, 2007), consumers (Goetzke, Nitzko, &amp; Spiller, 2014), teachers (Hertzsch, 2012), employees (He\u00df, 2012; Schneider, 2015) and businesses (Schneider, 2015); it has also been used with student samples (Becker &amp; Swim, 2012; Liebig, May, Sauer, Schneider, &amp; Valet, 2015; Linhoff, 2015; Tondello, Wehbe, Diamond, Busch, Marczewski, &amp; Nacke, 2016). The short scale has been employed to identify respondents who describe themselves in an overly positive manner (Goetzke et al., 2014), as a covariate in multivariate analyses (Becker &amp; Swim, 2012; He\u00df, 2012; Liebig et al., 2015; Seifried, 2015) and to flag items and scales that correlate with social desirability scores when scales are developed, evaluated and validated (Hertzsch, 2012; Linhoff, 2015; Naef &amp; Schupp, 2009a, b; Schneider, 2015; Schneider &amp; Schupp, 2014; Tondello et al., 2016).<\/p>\n<p>One aim of the present paper is to determine how to use the BIDR short form best. We focus on a consequential detail that has proven contentious: how best to calculate values for SDE and IM from the raw data. Paulhus (1991) recommends \u201cdichotomous scoring\u201d (p. 39): \u201cAfter reversing the negatively keyed items, one point is added for each extreme response (6 or 7)\u201d (p. 37).<\/p>\n<p>This recommendation has been contested. St\u00f6ber, Dette and Musch (2002) suggest that continuous scoring \u2013 taking the mean of the response values (after reversing the negatively keyed items) \u2013 may be preferable. They see three potential reasons for this. First, \u201cit may be plausible to assume that the processes underlying socially desirable responding are continuously distributed variables\u201d (p. 373); second, dichotomous scoring confounds socially desirable responding and the tendency to give extreme answers; third, the dichotomisation leads to a loss of information.<\/p>\n<p>Empirically, St\u00f6ber et al. (2002) and Kam (2013) find that continuous scoring yields superior results. Specifically, internal consistencies (Cronbach\u2019s alphas) are higher when continuous scoring is used (St\u00f6ber et al., 2002, Studies 1-3), though some of these differences are not significant (St\u00f6ber et al., 2002, Study 2). Convergent validity is significantly higher for continuously scored BIDR results in five out of seven comparisons (Kam, 2013; St\u00f6ber et al., 2002, Study 1). Concerning criterion validity, St\u00f6ber et al. (2002) find that \u201ccontinuous SDE scores display significantly higher correlations with the Big Five personality traits for which previous research has found correlations [. . .] than do dichotomous SDE scores\u201d (p. 385). However, for the two versions of the IM scores the results are equivocal (St\u00f6ber et al., 2013, Study 3).<\/p>\n<p>It is important not to overstate the significance of these findings. Cronbach\u2019s alpha is of limited value, as it is in part a function of the number of items in a scale (Cortina, 1993; Schmitt, 1996, p. 350; Sijtsma, 2009, p. 114; Streiner, 2003, p. 101), which means its value can be increased by simply adding items (Boyle, 1991; Rammstedt, 2010, p. 249; Streiner, 2003, p. 102), it can also be increased by narrowing the content of the construct that is measured (Boyle, 1991; McCrae, Kurtz, Yamagata, &amp; Terracciano, 2011, p. 230; Streiner, 2003, p. 102), it may exhibit high values when the underlying structure is mulitdimensional or low values when the underlying structure is unidimensional (Cortina, 1993; Sijtsma, 2009) and it is a poor predictor of validity (McCrae, Kurtz, Yamagata, &amp; Terracciano, 2011). Accordingly, the relevance of the results concerning Cronbach\u2019s alpha is limited.<\/p>\n<p>Comparisons of criterion validity are hard to interpret unless one knows what the correlation between a measure and its criterion measure ought to be \u2013 contrary to what St\u00f6ber et al. (2002, pp. 381-382) imply, more is not necessarily better in this respect (Stanton, Sinar, Balzer, &amp; Smith, 2002, p. 178). The most convincing evidence on the superiority of continuous scoring comes from the tests of convergent validity, but not all of these tests yield clear results. These limitations of the evidence may help explain why some authors (e.g., Mallinckrodt, Miles, Bhaskar, Chery, Choi, &amp; Sung, 2014) continue using dichotomous scoring despite extant research favouring the continuous method.<\/p>\n<p>Our main aim in this paper is to contribute to the resolution of this issue. To do so, we present the first comparisons of dichotomously vs. continuously scored test-retest results based on BIDR data. Retest reliability is a variant of reliability that is devoid of the weaknesses of consistency measures mentioned above and is clearly interpretable in a more-is-better fashion (Streiner, 2003, p. 102). We also supplement previous results by presenting comparisons of measures of internal consistency and associations with external criteria. Our paper is the first to address this question in a survey context and the first to use the BIDR short form (Winkler et al., 2006). Hence, this article is also a validation study of this scale, the first to present retest data. We address both questions \u2013 dichotomous vs. continuous scoring and the validity of the BIDR short scale \u2013 throughout the article.<\/p>\n<p>Another unusual feature of this article is that it uses data representative of an identifiable population, family doctors in Germany. We use post-stratification weights to compensate differential nonresponse. Unweighted results are displayed in the Appendix, and differences between weighted and unweighted results are noted throughout the text. However, as weighted results are likely to be closer to the results that would have been obtained had data on the whole population been available, we base our interpretation of the data almost exclusively on the weighted results.<\/p>\n<p>&nbsp;<\/p>\n<h1 style=\"text-align: left;\" align=\"center\"><strong>Method<\/strong><\/h1>\n<h2><strong>Participants and Procedure<\/strong><\/h2>\n<p>BIDR short scales were included in both waves of a pilot study of a questionnaire addressing family doctors in Germany. The questionnaire contained 53 items in 14 questions. The topic of the survey was abuse and neglect of doctors\u2019 patients in need of long-term care and assistance. We asked about subjective confidence in the respondent\u2019s ability to deal with such problems as well as experiences, attitudes and continuing medical education regarding the survey topic. Questions about sociodemographic information and the proportion of the respondent\u2019s patients in need of long-term care were also included. The study was approved by the review board of \u00c4rztekammer Berlin (Eth-21\/16).<\/p>\n<p>Four versions of the questionnaire were tested simultaneously. Each version dealt with one facet of abuse and neglect: physical violence, sexual abuse, restraint and neglect. Questionnaires were highly similar, with analogous questions asked in the same order in all questionnaires in an attempt to ensure comparability across questionnaire types. The BIDR short scale (described in detail below) was identical in all questionnaires and placed between the substantial portion of the questionnaire and sociodemographic items.<\/p>\n<p>The aim was to draw a small but representative sample of family doctors in Germany. We obtained data from the commercial provider ArztData. The provider aims to cover all resident doctors in Germany and claims 99% coverage (ArztData, n.d.); accordingly, this database has been used by a previous large-scale study aiming for a representative sample of German family doctors, the <em>\u00c4SP-kardio<\/em> study (e. g., G\u00f6rig et al., 2014; Diehl et al., 2015). From the provider\u2019s dataset, a random sample of 11,000 was drawn. From this list, 2369 respondents were randomly selected and sent questionnaires by mail. Respondents were randomly assigned to one of the four questionnaires. Personal codes were included. In the cover letter, participants were promised (and later received) a \u20ac 50 Amazon gift card if they returned, by mail or fax, two completed questionnaires on time. When a Wave 1 questionnaire was received, the participant was sent an identical Wave 2 questionnaire. No reminders were sent. The field period for the two Waves was June and July 2016.<\/p>\n<h2><strong>Measures<\/strong><\/h2>\n<p><em>Social desirability.<\/em> The six-item, German language short version of the BIDR developed by Winkler et al. (2006) contains three items each for SDE and IM. For this instrument, the authors chose items on the basis of their psychometric properties from a pool of ten items, which, in turn, were chosen and translated from Paulhus\u2019 original 40 item scale. Responses are given on a Likert-type scale ranging from \u201c1 \u2013 trifft \u00fcberhaupt nicht zu\u201d [\u201c1 \u2013 is not accurate at all\u201d] to \u201c7 \u2013 trifft voll zu\u201d [\u201c7 \u2013 is completely accurate\u201d]; the other points are labelled by integers only. Winkler et al. (2006) report that the items in question load on two different principal components, consistent with the two-component model of social desirability. Internal consistencies are .60 (SDE) and .55 (IM). Correlations with the Big Five personality factors are largely as expected. Correlating results of the BIDR short scale with a short version of the Marlowe-Crowne (1960) scale (a test of convergent validity) shows that \u201cthe expected results can be observed, though the relationships are somewhat weak\u201d (Winkler et al., 2006, p. 18).<\/p>\n<p>Winkler et al. (2006) argue that future research should use the scale with some items reverse keyed to reduce confounding of the results with acquiescence. The resulting items are shown in Table 1.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9125\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_1.png\" alt=\"\" width=\"709\" height=\"365\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_1.png 709w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_1-300x154.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_1-600x308.png 600w\" sizes=\"auto, (max-width: 709px) 100vw, 709px\" \/><\/a><\/p>\n<p>For the purposes of the present research, four slight alterations were made to the scale proposed by Winkler et al. First, while a seven-point scale was used as in Paulhus (1991) and Winkler et al. (2006), the one used here ranged from \u201c0 \u2013 trifft \u00fcberhaupt nicht zu\u201d [\u201c0 \u2013 is not accurate at all\u201d] to \u201c6 \u2013 trifft voll und ganz zu\u201d [\u201c6 \u2013 is completely accurate\u201d] rather than from 1 to 7; as in the original, all intermediate points were labelled by integers only and high values indicate a strong tendency towards socially desirable responding (after recoding appropriate items). This change was made to improve consistency with the rest of the questionnaire. Second, and for the same reason, full stops were added at the end of the items. Third, spelling was adjusted to reflect current German orthography. Fourth, the item concerning one\u2019s judgement was moved to the end of the scale to minimize halo effects. Given the context, we assumed that some respondents would interpret this item to refer to <em>medical <\/em>decision making only if it was placed towards the beginning of the list and that any such misunderstanding would be reduced by moving the item to the end of the list. We did not test these assumptions.<\/p>\n<p><em>Subjective confidence<\/em>. As an external criterion, we use the BEACON-C-3, a scale constructed to measure the extent to which the respondent feels competent to take action against abuse of his or her patients in need of long-term care and assistance (Schnapp &amp; Suhr, 2017). The exact wording differs between types of abuse. The measure consists of the question stem \u201cIf I suspected that a patient in need of long-term care was being [abused in the specific manner] . . .\u201d and the three items \u201c. . . I&#8217;d know exactly what to do next.\u201d, \u201c. . . I&#8217;d be unsure how to proceed.\u201d (reverse scored) and \u201c. . . I&#8217;d be well prepared.\u201d It is scored on a five-point scale from 0 to 4, with high values indicating high confidence. As reported in Schnapp &amp; Suhr (2017), this scale yields a one-factorial solution, a retest correlation of .89, a small and marginally significant negative correlation with a measure of interest in further education (used as a criterion) and an average <em>k*<\/em> value for the three items of .84. <em>k*<\/em> is a measure of content validity derived from expert judgements of item relevance, with values of .75 or above considered \u201cexcellent\u201d (Polit, Buck, &amp; Owens, 2007).<\/p>\n<p><em>Test-retest interval.<\/em> We estimate the time between test and retest by subtracting the return date of the first questionnaire from the return date for the second questionnaire, allowing one day for each (first or second Wave) questionnaire returned by mail (but zero days if it is returned by fax).<\/p>\n<p><em>Sociodemographic information<\/em>. We estimate participants\u2019 age by subtracting respondents\u2019 self-reported year of birth from 2016. Gender was measured by a standard question.<\/p>\n<p>&nbsp;<\/p>\n<h1 style=\"text-align: left;\" align=\"center\"><strong>Results<\/strong><\/h1>\n<h2>Response rate and weighting procedure<\/h2>\n<p>Twenty-three questionnaires were undeliverable and 14 addressees notified us that they did not work as family doctors. This reduced the effective sample to 2332. Two hundred and sixty questionnaires were received in Wave 1 and 176 in Wave 2, for response rates of 11% and 8%, respectively. The response rates for the different versions of the questionnaire are: restraint, 9% (Wave 1)\/7% (Wave 2); physical violence, 11%\/8%; sexual abuse, 11%\/8%; neglect, 9%\/7% (AAPOR [2016] Response Rate 1). Cases with incomplete data on any of the social desirability items or age are excluded from the analysis, but missing data on other variables are accepted. This results in a sample of 166.<strong><\/strong><\/p>\n<p>The only sociodemographic variable for which data for the universe are available is grouped age as of 31 December, 2015 (Bundes\u00e4rztekammer, n.d., Table 8). While there is other information on the characteristics of doctors working in Germany available (Bundes\u00e4rztekammer, n. d.; Statistisches Bundesamt, 2017), none of it contains sociodemographic data for our universe. Table 2 shows some noteworthy differences between the sample and the universe. Accordingly, weights were calculated by dividing the proportion of the universe in an age group by the proportion of the unweighted sample. The resulting weights are also displayed.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9126\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_2.png\" alt=\"\" width=\"701\" height=\"288\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_2.png 701w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_2-300x123.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_2-600x246.png 600w\" sizes=\"auto, (max-width: 701px) 100vw, 701px\" \/><\/a><\/p>\n<h2><strong>Descriptives<\/strong><\/h2>\n<p>Table 3 and Appendix Table A-1 show descriptives for data with and without weighting, respectively. As should be expected, weighted and unweighted results are similar, as are results across waves. While no data is available to compare the characteristics of our weighted sample to those of the universe, one source (Statistisches Bundesamt, 2017) provides the gender (but not age) distribution of family doctors <em>registered with the statutory health insurances<\/em> in 2015. According to this source, 45% of these doctors were female, a value closer to the weighted than the unweighted result. This supports the view that weighted results are to be preferred.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9127\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_3.png\" alt=\"\" width=\"702\" height=\"564\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_3.png 702w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_3-300x241.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_3-600x482.png 600w\" sizes=\"auto, (max-width: 702px) 100vw, 702px\" \/><\/a><\/p>\n<p>The retest interval varies more than it would have under more controlled circumstances but less than one might expect in a postal survey.<\/p>\n<h2><strong>Internal Consistency<\/strong><\/h2>\n<p>Table 4 shows Cronbach\u2019s alphas for both Waves 1 and 2. Internal consistencies for the IM scale are very similar to those reported by Winkler et al. (<em>\u03b1<\/em> = .55), whereas those for the SDE scale are lower than in the original study (<em>\u03b1<\/em> = .60). Particularly surprising is the decrease in alpha for the continuously scored SDE scale between Waves 1 and 2. This result was double-checked and confirmed. It is driven mainly by a decrease in the correlations of the items \u201cI often doubt my own judgement\u201d and \u201cI always know why I like things\u201d. All alphas are below .60.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9128\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_4.png\" alt=\"\" width=\"709\" height=\"291\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_4.png 709w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_4-300x123.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_4-600x246.png 600w\" sizes=\"auto, (max-width: 709px) 100vw, 709px\" \/><\/a><\/p>\n<p>The formula by Feldt, Woodruff and Salih (1987, Equation 22) is employed to test for the significance of differences between alpha<em> <\/em>values. Continuously scored values yield significantly higher alphas for Wave 1 but not for Wave 2. The main difference between the weighted and unweighted samples is that in the latter, the results for dichotomously and continuously scored IM scales are not significantly different (Table A-2).<\/p>\n<h2><strong>Retest reliabilities<\/strong><\/h2>\n<p>Table 5 shows the central results. It compares retest reliabilities for the two scoring procedures. The relevant comparisons are between scoring methods (continuous vs. dichotomous) within a wave and scale. These are comparisons between non-overlapping correlations from dependent samples. For this situation, there is a number of significance tests, none of which is clearly preferable to all the others (Diedenhofen &amp; Musch, 2015). The table displays results for Steiger\u2019s\u00a0(1980, Equation 15) statistic, calculated using the cocor program (Diedenhofen &amp; Musch, 2015). Alternative tests also implemented in cocor yield very similar results; in no case does the choice of test make a difference in terms of significance.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_51.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9221\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_51.png\" alt=\"\" width=\"699\" height=\"287\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_51.png 699w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_51-300x123.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_51-600x246.png 600w\" sizes=\"auto, (max-width: 699px) 100vw, 699px\" \/><\/a><\/p>\n<p>All weighted retest reliabilities are at least .70 and some exceed .80. Continuous scoring results in a significantly higher retest reliability for the IM subscale. The difference between the two versions of the SDE subscale is very small and not significant. The combined social desirability scale exhibits a somewhat higher retest reliability for continuous scoring, but the result is only marginally significant.<\/p>\n<p>The most noteworthy difference between the weighted and the unweighted dataset is that in the latter, the difference between the two scoring methods applied to the IM scale is not quite significant at the conventional level using a two-sided test; <em>z<\/em>(164) = -1.93, <em>p<\/em> = .054 (Table A-3). However, more trust should be put in the result from the weighted dataset, given that it is probably the better estimate of the result that would have been obtained had the whole population been tested. A more serious threat to the results arises from the fact that the estimates are based on a fairly small sample. Generally speaking, the smaller a sample, the more likely it is that a coefficient\u2019s statistical significance is the consequence of an overestimation of the coefficient\u2019s magnitude due to chance factors such as sampling and measurement error (Button et al., 2013; Loken &amp; Gelman, 2017). Accordingly, some researchers issue recommendations such as, \u201ca minimum sample of 200-300 respondents [&#8230;] is needed for any good correlational or reliability analysis\u201d (Clark &amp; Watson, 1995, p. 317). However, the danger of the significant result for the IM subscale being due to such an overestimation need not be decided on the basis of rules of thumb. Instead, it can be assessed using the retrospective design calculation proposed by Gelman and Carlin (2014). This method allows researchers to estimate the \u201cexaggeration ratio\u201d (p. 641). The exaggeration ratio is the expectation of the factor by which an empirically obtained, statistically significant coefficient overestimates the true value in the population the sample was drawn from. Calculating the exaggeration ratio requires as inputs the standard error of the empirically obtained coefficient and a plausible estimate of the true size of the coefficient in the population. This latter estimate needs to be taken from sources other than the data at hand. The standard source is the extant empirical literature (Gelman &amp; Carlin, 2014).<\/p>\n<p>As noted, the extant literature contains no studies measuring the difference in retest reliabilities of dichotomously vs. continuously scored IM scales. However, Cronbach\u2019s alpha is often a reasonable proxy for short-term retest reliability (Gnambs, 2014; McCrae et al., 2011). Given this, differences between alphas seem likely to be reasonable proxies for differences in retest reliabilities. St\u00f6ber et al. (2002) give values for the differences between alphas obtained on the basis of continuous and dichotomous scoring of IM scales from three studies. We take the mean of the three differences weighted by their sample sizes. This mean is approximately 0.10 and the exact value (to eight digits) is used as our preferred estimate of the true effect size.<\/p>\n<p>Using this estimate, we find that the exaggeration ratio is 1.00. Varying the assumed true effect size, as recommended by Gelman and Carlin (2014), we find that an assumed effect size of .05 leads to an exaggeration ratio of 1.23 and an assumed effect size of .15 yields an exaggeration ratio of 1.00. Hence, the significance of our finding for the IM subscale is unlikely to be the result of overestimating the effect size to a substantial degree.<\/p>\n<h2><strong>Associations with external measures<\/strong><\/h2>\n<p>In this section we describe the association of dichotomously and continuously scored measures with two external variables, gender and the BEACON-C-3 measure of subjective confidence in dealing with possible cases of abuse and neglect. Previous research suggests that women typically score substantially higher than men on IM measures, but no such clear pattern has been found for SDE measures (Dalton &amp; Ortegren, 2011; Paulhus, 1991; Winkler et al., 2006). We hence expected women to exhibit higher values for IM but made no prediction regarding differences in SDE values.<\/p>\n<p>We expected a positive correlation of the BEACON-C-3 scale with the SDE measure, as \u201ca self-deceptive tendency to exaggerate one\u2019s social and intellectual status\u201d (Paulhus, 2002, p. 63), which the SDE scale aims to measure, seems likely to extend to professional competence. In contrast, we made no specific prediction with respect to the IM measure, as \u201ca self-deceptive tendency to deny socially-deviant impulses and claim sanctimonious, \u2018saint-like\u2019 attributes\u201d (Paulhus, 2002, p.64), which it aims to measure, is not obviously relevant in this context.<\/p>\n<p>A higher mean for female respondents on the IM measure and a positive correlation of the BEACON-C-3 with the SDE measure would hence serve to validate the BIDR short scale. In contrast, this section makes no contribution to the question whether scales should be scored in a continuous or dichotomous fashion. As discussed in the introduction, the prediction that there is an association between two measures does not imply that measures yielding particularly strong associations are to be preferred (Stanton et al., 2002, p. 178). Hence, results for both scoring methods are reported in this section, but there is no focus on the differences between them.<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_61.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9222\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_61.png\" alt=\"\" width=\"696\" height=\"284\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_61.png 696w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_61-300x122.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_61-600x244.png 600w\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" \/><\/a><\/p>\n<p>Table 6 displays the results for gender differences, with one case excluded due to missing data on gender and <em>p<\/em> values based on a two-sided <em>t <\/em>test. As expected, women score consistently higher than men on the IM scale, a difference that is significant in 3 out of 4 tests. Women also score higher on the SDE scale, a difference that is significant in 2 out of 4 tests. As a consequence, women score significantly higher than men on the combined scale in all four cases.<\/p>\n<p>Unweighting the dataset results in a surprising number of results crossing the threshold from significant to not significant. However, the most important result is the difference for the IM scale using continuous scoring (as both our and extant results show that continuous scoring is preferable to the dichotomous method). This result is largely unaffected (Table A-4).<\/p>\n<p><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_7.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9131\" src=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_7.png\" alt=\"\" width=\"695\" height=\"277\" srcset=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_7.png 695w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_7-300x119.png 300w, https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Table_7-600x239.png 600w\" sizes=\"auto, (max-width: 695px) 100vw, 695px\" \/><\/a><\/p>\n<p>Table 7 displays partial correlations between the BEACON-C-3 and BIDR scales. As described above, our data are based on questionnaires that are very similar but deal with four different types of abuse and neglect. Hence, when estimating the correlation of interest, we partial out the influences of three dummy variables indicating that the BEACON scale made reference to restraint, physical violence or sexual abuse, respectively (with neglect as the reference category). The sample sizes are reduced due to missing data on the BEACON-C-3.<\/p>\n<p>The correlation with the IM scale is close to zero in all cases. The overall scale yields one significant result, a positive association with the dichotomous scale in the second wave. Most importantly, the expected positive correlation between the SDE measure and the measure of confidence is observed, with 3 out of 4 tests yielding significant results.<\/p>\n<p>When unweighted data are used, more associations reach statistical significance. The prediction concerning the correlation of the SDE and BEACON-C-3 scales is also borne out in this version of the data (Table A-5).<\/p>\n<p>In sum, all predictions about the directions of associations are borne out by the data, although these results are statistically insignificant in a minority of cases.<\/p>\n<p>&nbsp;<\/p>\n<h1 style=\"text-align: left;\"><strong>Discussion and Conclusions<\/strong><\/h1>\n<p>This paper\u2019s aim is to contribute to both the study of dichotomous vs. continuous scoring of BIDR scales and the validation of the BIDR short scale. We discuss results with eyes on both aims and start with some limitations of the study. All interpretations are based on the results for the weighted dataset only.<\/p>\n<p>This study employed a post-stratified national probability sample of family doctors. This may be seen as an improvement over the convenience samples often used for the development and validation of scales. However, it is unclear whether results reported herein generalise to other groups. In particular, compared to the general population, our sample may suffer from restriction of range issues and the high socioeconomic status of respondents may be thought to influence retest reliabilities. However, Hemingway, Nicholson and Marmot (1997) found \u201cno effect\u201d of occupational status on the retest reliability of the SF-36, a general health questionnaire (p. 1486). Nonetheless, it may be that retest results would have been different with a more diverse sample. We consider it unlikely, however, that our use of a highly educated sample has had much of an effect on the differences between the two scoring methods, as the exact same data are used for both methods.<\/p>\n<p>We studied the question of dichotomous vs. continuous scoring with a specific version of the BIDR scale, the short German instrument developed by Winkler et al. (2006). We believe this increases the value of our results for survey researchers compared to results from a study of the full-length BIDR, the use of which in surveys is not feasible. Nonetheless, it should be noted that the results would likely have been different had the full-length version of the scale been used. In particluar, it seems likely that retest reliability would have been higher had the full scale been used, given that longer versions of an instrument typically yield higher retest correlations than shorter versions of the same instrument (see, e. g., Gnambs, 2014). With respect to other measures, and differences between the scoring methods, the direction of the differences (if any) is unclear.<\/p>\n<p>All Cronbach\u2019s alphas are below .60 and hence lower than usually desired. Results also show that the continuous method yields significantly higher consistencies in one of two waves. The findings on internal consistency hence appear to bolster the case for continuous scoring while casting doubt on the utility of the BIDR short scale. However, we advise against putting too much weight on our or extant results concerning this question, given the large literature on the limitations of the Cronbach\u2019s alpha measure discussed in the introduction (Cortina, 1993; McCrae et al., 2011; Rammstedt, 2010; Schmitt, 1996; Sijtsma, 2009; Streiner, 2003). While other measures of internal consistency exist, they all share with alpha the weakness of rewarding narrowness of the construct actually measured (irrespective of the theoretical construct the researcher has in mind; Boyle, 1991). It is worth remembering in this context that consistency measures were developed as, and should still be seen as, substitutes for measures of retest reliability when the latter are not available (Guttman, 1945; Sijtsma, 2009).<\/p>\n<p>In the present study, retest reliabilities are available. They are .70 or above for all subscales using either scoring method. This is an attractive feature of the BIDR short scale. The scoring method makes no noteworthy difference for the SDE scale, while dichotomous scoring is superior when used with the IM scale. When interpreting these results, one should keep in mind that the average interval between test and retest was fairly short. While short intervals reduce the threat of measurement error due to change in the unobserved true values, they increase the threat of measurement error due to memory effects (Rammstedt, 2010). If memory effects are a more serious threat than changes in the true values, then the results presented here should be seen as upper-bound estimates of the retest reliability of the BIDR short scale. Again, assessments of the relative merits of dichotomous and continuous scoring seem likely to be unaffected, as both scoring methods are based on the exact same data.<\/p>\n<p>Tests of criterion validity do not contribute to the resolution of the question regarding scoring methods but help assess the validity of the short scale. All associations with external variables were in the expected direction and 6 out of 8 were statistically significant.<\/p>\n<p>The results presented herein make a fairly strong case for using the BIDR short scale devised by Winkler et al. (2006). Given the combination of results concerning reliability and validity, future survey researchers may want to consider this instrument as a measure of socially desirable responding.<\/p>\n<p>As mentioned in the introduction, St\u00f6ber et al. (2002) gave three reasons why continuous scoring may be superior: (i) the underlying processes are best conceived of as continuously distributed; (ii) possible confounding with extremity bias is reduced; (iii) all of the information in the scores is preserved. While we cannot distinguish between these explanations, our results suggest that continuous scoring is indeed superior. Taken by itself, the present paper makes only a weak case for this conclusion: The most important tests concern retest reliability; these show a significant difference for only one of the subscales (IM) and this difference is small. However, our evidence points into the same direction as previous findings. As a result, researchers faced with the decision of how to score the BIDR will find that all three publications on the topic converge on the same conclusion despite using different types of samples, different versions of the BIDR and a variety of tests. The results presented herein hence strengthen the case for using continuous rather than dichotomous scoring of BIDR scales.<\/p>\n<p align=\"center\"><strong>\u00a0<\/strong><\/p>\n<h1 style=\"text-align: left;\" align=\"center\"><strong><a href=\"https:\/\/surveyinsights.org\/wp-content\/uploads\/2017\/08\/Appendix_Schnapp_et_al.pdf\">Appendix<\/a><\/strong><\/h1>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction A major threat to the validity of survey data is socially desirable responding, \u201cthe tendency to give overly positive self-descriptions\u201d (Paulhus, 2002, p. 50). Accordingly, researchers have developed a number of scales which aim to measure this tendency. These scales are used to identify respondents who tend to describe themselves in an overly positive [&hellip;]<\/p>\n","protected":false},"author":521,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[41],"tags":[398,400,401,83,399,397],"class_list":["post-8708","post","type-post","status-publish","format-standard","hentry","category-questionnaire_design","tag-balanced-inventory-of-desirable-responding-bidr","tag-impression-management","tag-self-deceptive-enhancement","tag-social-desirability","tag-socially-desirable-responding","tag-test-retest-reliability"],"acf":[],"_links":{"self":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/8708","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/users\/521"}],"replies":[{"embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8708"}],"version-history":[{"count":93,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/8708\/revisions"}],"predecessor-version":[{"id":9233,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=\/wp\/v2\/posts\/8708\/revisions\/9233"}],"wp:attachment":[{"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8708"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8708"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/surveyinsights.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8708"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}