Psychological Assessment © 2011 American Psychological Association2012, Vol. 24, No. 2, 476 – 489 1040-3590/11/$12.00 DOI: 10.1037/a0026100 The Wartegg Zeichen Test: A Literature Overview and a Meta-Analysis of Reliability and Validity Jarna Soilevuo Grønnerød Cato Grønnerød Fredrikstad, Norway University of Oslo All available studies on the Wartegg Zeichen Test (WZT; Wartegg, 1939) were collected and evaluated through a literature overview and a meta-analysis. The literature overview shows that the history of the WZT reflects the geographical and language-based processes of marginalization where relatively isolated traditions have lived and vanished in different parts of the world. The meta-analytic review indicates a high average interscorer reliability of rw ⫽ .74 and high validity effect size for studies with clear hypotheses of rw ⫽ .33. Although the results were strong, we conclude that the WZT research has not been able to establish cumulative knowledge of the method because of the isolation of research traditions. Keywords: Wartegg, meta-analysis, reliability, validity, history The Wartegg Zeichen Test (WZT, or Wartegg Drawing Com- man, 2000). In general, the theoretical groundwork of the Wartegg pletion Test) was introduced by Ehrig Wartegg (1939) as a method method is inadequate, reflecting its development within scattered of personality evaluation within the Gestalt psychological tradition traditions. We explain this historical process in more detail later in in Leipzig, Germany (on the early history of the method, see this article. Klemperer, 2000; Lockot, 2000; Roivainen, 2009). The WZT form The methods of interpreting the WZT protocols vary from consists of a standard A4-sized paper sheet with eight 4 cm ⫻ 4 cm approaches emphasizing qualitative interpretation (e.g., Ave´- squares in two rows on the upper half of the sheet. A simple sign Lallemant, 1978; Gardziella, 1985; Wartegg, 1953) to more quan- is printed in each of the squares (see Figure 1). The test person’s titative scoring systems (e.g., Crisi, 1998, 1999, 2008; Kinget, task is to make a complete drawing using the printed sign as a part 1952; Puonti, 2005; Takala, 1957; Wass & Mattlar, 2000). The of the picture (see Figures 2 and 3) and then give a short written scoring categories of drawing performance include drawing time, explanation or title of each drawing on the lower part of the sheet. the order of the squares drawn, possible refusals, the size of the Wartegg’s (1939) early work includes of a presentation of how drawings, the content of the drawings, crossing of the borders of different personality types (synthesizing, analytical, and inte- the squares, shading, drawing line quality, and the written title of grated) react in different ways to the small and simple geometrical the drawing. Although many elements are common, slight differ- figures, producing drawings according to the person’s typical ways ences in scoring definitions between authors and traditions occur. of perceiving and reacting (see also Roivainen, 2009; Wass & The interpretation is related to many different, but unfortunately Mattlar, 2000). Theoretically, the Wartegg traditions can be cate- not always commonly used or well defined, personality functions. gorized into analytical systems of interpretation, which regard the For example, in Wass and Mattlar’s (2000) scoring system, which printed signs as visual stimuli (e.g., Kukkonen, 1962a; Takala, is based on Gardziella (1985), the personality characteristic vari- 1957; Takala & Hakkarainen, 1953), and dynamic systems (e.g., ables are the following: vitality, initiation and activity, ambition, Crisi, 1998; Gardziella, 1985; Kinget, 1952; Lossen & Schott, expansion, spontaneity, energy, ego strength, ego control, inde- 1952; Wass & Mattlar, 2000), which argue that the printed signs pendency, objectivity, subjectivity, interest in emotional interac- have certain symbolic meanings representing certain areas of in- tion, emphatic ability, and egoism. The Crisi (2008) system pro- dividual psychology (Tamminen & Lindeman, 2000). As an ex- duces three types of evaluations of an individual. First, a ample of the latter, Kinget (1952) proposed that the printed signs qualitative description on eight personality areas corresponding in Square 3 give an impression of rigidity, order, and progression, with each WZT square; second, a three-level classification of the linking the interpretation of the test persons’ drawings to achieve- individual’s emotional, cognitive, and social maturity (reached, ment motivation. Similarly, the dot in Square 1 is placed right in partially reached, not reached); and finally, a clinical evaluation the middle, linking the interpretation to images of the self. The (well-structured personality, need for a closer evaluation, and symbolic hypothesis is problematic, however, and has been criti- psychopathological condition). cized for the lack of empirical verification (Tamminen & Linde- The appropriateness of the drawing in relation to the printed sign is a key aspect in many scoring systems, the basic rationale being that the ability to perceive and respond to the test stimuli This article was published Online First November 7, 2011. corresponds to behavior in social environments. The WZT thus Jarna Soilevuo Grønnerød, Fredrikstad, Norway; Cato Grønnerød, De- partment of Psychology, University of Oslo, Oslo, Norway. falls into the category of a “projective” test, or what now more Correspondence concerning this article should be addressed to Cato precisely should be labeled a performance-based (Meyer & Kurtz, Grønnerød, Department of Psychology, University of Oslo, P.O. Box 1094 2006) or free-response method. It seems to have the same type of Blindern, NO-0317 Oslo, Norway. E-mail: [email protected] attraction as the Rorschach method in that the ideographical nature 476 and that empirical research has not managed to scope of its use. Copyright 1997 Hogrefe Verlag GmbH & Co. Pereira. Italy (Crisi. Finland (Kuuskorpi & Keskinen. & Biedma. Example of a simple drawing solution (drawn by the first author based on client material). 1952). Takala. Although Lallemant. 1998. The method several test manuals. 1954) original work. 2003). . 1953. Göttingen. In addition to Wartegg’s (1939. Primi. even though they have admitted Noronha. Biedma & D’Alfonso. Spain (Wartegg. Vetter. WARTEGG ZEICHEN TEST: META-ANALYSIS 477 Figure 1. and Indonesia (Kinget. have recommended against the use of the WZT in Switzerland ously highly problematic for a method used in psychological (Diagnostikkommission des Schweizerischen Verbandes fu¨r Be- practice. the Finnish Committee on Psychological Assess- that the Wartegg method is commonly used in Brazil (de Godoy & ment (Testilautakunta. the WZT included. from the Rorschach method. that the scientific status of the methods used may vary. Wartegg. the ments in the last decade have established the Rorschach method as United States (Kinget. Recent surveys among practicing psychologists have confirmed 2003). 1960). Vetter. Göttingen. which is obvi. develop. the United States by Beck (1930) and Hertz (1935). any method. 2005. & Ari’an- the Wartegg traditions in different non-English-speaking countries Gafni. 1969. France (Fetler-Sapin. Sweden (Wass & Mattlar. The Wartegg Zeichen Test form. D’Alfonso. 1974). of academic psychologists in Finland and in Brazil have criticized Figure 2. Italy (Ceccarelli. the theoretical and methodological quality Committees evaluating and monitoring psychological test use of empirical Wartegg studies varies considerably. de Oliveira. Primi. 1991. 1962b. Dantas. introductions. Petzold. of the material appeals to the interpretative skills of clinicians. egg. a personality assessment instrument with empirical backing similar Finland (Gardziella. & Cobeˆro. Ben-Asa. and Switzerland (Deinlein & European in origin. Israel (Ave´-Lallemant. to our knowledge. 1965). 1985. 2000). However. 2000). Noronha. KG. 1999. 2001). and translations have been has since evolved through the interplay of theory and empirical published from the 1940s to the present day in Germany (Ave´- work expressed in the major Rorschach scoring systems. 2008). Lossen & Schott. & Alchieri. to that of other instruments (Society for Personality Assessment. 1957). Argentina (Ave´-Lallemant. Reprinted with permission. 2002). 2005. 1952. rufsberatung. 1960). many challenges still lie ahead (Meyer & Archer. even though the occasional contact. method variables. Germany. 1978. 2004) and in Brazil (Conselho Federal de Psicologia. 1962a. The scientific development of the WZT is markedly different. A number 2003). Also. Copyright 1997 Hogrefe Verlag GmbH & Co. as cited in Roivainen. 2009). Wart- 2005). 1972). has not been verified by any create cumulative knowledge on the reliability and validity of the surveys other than those cited here. In this article. This indicates that the have evolved in relative isolation from each other. with only test is or has been widely known around the world. 1952). Reprinted with permission. KG. Germany. 2005. Kukkonen. Uruguay (de Go´mez & Go´mez however. Renner. 2001. the Rorschach method was first introduced in Boss. 1959. & Santarem. 2008) has seen no reason to ban the use of Noronha. we argue that Pinilla. Tamminen and Lin. and used this new bibliography as a base to choose studies for a global 2006. 2007. Carl-Erik and psychosocial database). Padilha. Nummenmaa and Hyöna¨ turned out to be responsible for most irrelevant hits. returned more than 78. Japan) ⫽ 15. A large portion of the hits in a has not been easily available because traditions of research and particular database also appeared in other databases. 1999). The development of online databases and search Grønnerød & Grønnerød. the use of the WZT in psychological practice. see Soilevuo relevant material. 2010). and Bib- problem may not be the complete lack of research on the WZT. meta-analytic review. have argued for the practical value of the method they research more visible and available in a way that was not previously feel they experience in their everyday work (Heiska. Wartegg empirical research (de Souza. 2005. that the literature studying the Wartegg method checked the first 600 of these. Med- Mattlar has been working on a literature review on Finnish War. Nevanlinna. while the WZT finds interested audiences among schol- of proven validity and the extremely small amount of published ars and practitioners in other parts of the world. When the first author of this We first carried out an extensive literature search in several article asked a librarian to perform the widest and most complete databases during the month of September 2007. Thus. The following databases institute in Italy. databases until recently. pendent on the authors’ knowledge of and physical access to Tamminen & Lindeman. but they were systems of interpretation have developed relatively separately. 2005. Reprinted with permission. Example of a more complex drawing solution (drawn by the first author). Noronha. 2000). 2002. ProQuest (dissertation 1950s. literature references have escaped the research bibliographies and 2004. Tamminen & Lindeman. conducted searches using the national academic library databases lished manuscript with 80 references (Mattlar.478 SOILEVUO GRØNNERØD AND GRØNNERØD Figure 3. while references to em. the Linda in Finland (42 hits).000 hits (also excluding restaurant). The search term in Wartegg search that was possible in the early 1990s. with a peak in the Web of Knowledge (citation index) ⫽ 14. Roivainen (2009) was able to find 88 hits in pharmacology database) ⫽ 7. PsycINFO (psychology database) ⫽ 91. & Fagan. We also tegg studies and has kindly given us a copy of this as-yet unpub. nonetheless valuable as sources of correction and completion of without cross-references. 1997. excluding hesse and schloss. Ojanen. NII-ELS (general database. KG. Bibsys in Norway (4 hits). 2005. 2008). academic theses . PubMed The Wartegg handbooks cited earlier typically refer to a small (medicine database) ⫽ 7. NORART (Nordic journals). Nummenmaa & Hyöna¨. Copyright 1997 Hogrefe Verlag GmbH & Co. 2009. how The small number of empirical studies using the Wartegg reliable and valid is the method? To answer these questions. Pereira et al. and WorldCat Beta pirical studies are few. has published an online bibliography with 50 did not return any results: CSA (sociology database).se in Sweden (12 hits). HAPI (health entries. 18 hits in a later search). Fifteen years later. books. actually been studied? Second. Göttingen. 2003. Noronha. Roivainen. We found the (2005) reported that they had found only 10 hits in PsycINFO. Second. based on the available studies. Puonti. on the engines in recent years has now made the whole scope of Wartegg other hand. Primi. 2008. & Miguel. first. 2010). ranging from the 1930s to the 2000s. Contrary to these authors.. database) ⫽ 5. a regular Google search the difficulties in finding the studies. It is therefore time to ask: How much has the method sonen. number of earlier interpretation manuals. The method is almost unknown in the English-speaking references: journal articles. LinePlus (medicine database. 2000. We initially accepted all types of barriers. Google Scholar Beta (full-text articles) ⫽ 304. test manuals. Practicing psychologists. 326) assumed that international journals are not interested in publishing articles on a method that they (unfortu. we method has indeed been noted by several Wartegg authors as well gathered all the Wartegg method references we could possibly find (Kuuskorpi & Keskinen. Germany. Ko. and we We argue. Alessandro Crisi. This separation is partly due to language already existing references. 2007. which only a dozen hits. A following numbers in each search: EMBASE (biomedical and couple of years later. Mattlar (2008) was aware of the scope of Finnish Literature Overview Wartegg research but has pointed out that literature searches on the method have given modest results. the leader of a Wartegg (library catalogues worldwide) ⫽ 117. ERIC (pedagogy database) ⫽ 1. possible. ISI PsycINFO. deman (2000. but liotek. p. Method nately quite incorrectly) claim is used only in Finland. on the Finnish debate. Earlier literature reviews have been de- 1999. the result was all databases was Wartegg. pointing out the lack countries. In addition. including few empirical works (Crisi. we calculated intercoder reliability in several almost every master’s thesis reference was from Finland. In all cases. we came to a consensus October 2009. qualitative analysis. For scorer blinding and reliabil. Next. 1973) works were excluded since the studies concern the semantic meanings of the printed Coding stimuli on the Wartegg test blank. Full-text articles in studies. In a final round. Cramer’s V in Japanese could not be translated and had to be regarded as not for multicategorical nominal data. Master’s theses were not ordered either. Italian. book reviews. The first author the whole coding manual. we lations of books) could be included as separate entries. Refer.g.translate. and the criteria type of publications gathered over several years to begin with.g. country of origin (defined by first author’s affiliation in Three of them examined the WZT method’s sensitivity to cultural journal articles or the country of publication in the case of books). scoring system used. In cases where subject age was only reported available electronic full-text documents were downloaded. Portuguese. we combined Not Reported with either No coding of the validity results (Andreani Dentici. conference coder had coded one coefficient and the other divided this coeffi- papers. 1973.. We then selected 20 of the 37 this stage. findings could not be readily interpreted as supportive or dismis- sented in the Appendix.g.1) ⫽ tially translated using online translation services (Google Transla. and whether there were Ryha¨nen et al. Additional intraclass correlations (ICC) for intercoder reliability was searches were made to complete and correct the database up until ICC(2. on any disagreement on the result variable. because to perform multiple regression analyses with subject age as one of they would have required extra funding that was not available at the variables without excluding the study because of missing data. Three studies were excluded for the then classified the 238 full-text publications as a study or not a result statistics calculations because the results were coded by us in study. no codable results (e. In some studies. 1979). type of statistic reported. Wassing (1974) investi- study design. we were not able to determine the correct ity checks. German. respectively. still on request despite several attempts to order them. In three studies. For validity coefficient. differences (Cuppens. and master’s theses were thus excluded). thus not presenting empirical results using the method. (1978) studied possible negative psychological reported results.. or codable and Bing Translator at Irrelevant or Not Reported. the time. We already had a small personal collection contrary to them or merely explorative. also coded whether the statistical test was focused. and as adults. Blinding or No Reliability Check. 1979.microsofttranslator. The first section included publication year. A study was defined as a published text presenting new a manner that did not allow direct agreement estimation (e. subject popu- and even references we were not able to verify or references that lation and age. First.. scorer blinding. and whether the result was were incomplete. whether the Every reasonable effort was made to retrieve as many of the results were in line with specific. meta-analysis studies to independently double-code according to and they were therefore regarded as unavailable. and Swedish. No Codable Results. we inserted the average age of 38 years.1) ⫽ . studies. When we closed the database. Finally. case were subsequently coded by consensus. based on 212 results from 17 studies (excluding the three studies The 37 studies were written in English. other assessment methods used in the clearly formulated hypotheses. based on the 14 other publications were ordered through the university library. Sisley’s (1972. Different versions of the same texts (e. French. publication type. Forsse´n. an excellent level. Dutch. translated texts were often grammatically imperfect. and the additional references not retrieved in the search. Additionally. French. the meaning Certain studies were excluded from the meta-analysis for the of the text was mostly clear enough for our needs. it consisted of 507 references. which stages.90. By excluding this variable. WARTEGG ZEICHEN TEST: META-ANALYSIS 479 of all levels. the average of one the classified as having one of the following three results: no coder’s results equaled the other coder’s single or average result. The number of studies with codable results was 37. All results with this type of disagreement tegg results). and In the next stage. which was the case when it was not whenever our language skills reliability was r ⫽ . All applied in the study. which data. Mellberg. and it was not possible to . The second section of the coding manual included effects of various anesthetic substances. the average http://www. the positive or negative We coded the 37 studies according to the coding manual pre. because we could not know Gardziella. –. & Hirvenoja. a reliability or a validity coefficient. references as possible. 1969. reliability checks were performed. and whether pared personality variables in pregnant and not pregnant women. whether gated an alternate Wartegg form.84. one empirical results in some form (unpublished works. sive of the WZT validity as a personality assessment method. section included number of subjects in the sample. conference papers. Studies written We calculated one-way random ICCs for scale data. only descriptive data). Although the automatically were resolved by consensus coding. adult samples that reported age ranges or means.g. unpublished papers. and Faisal-Cury (2005) com- scorers were blind to relevant aspects of test origin. all coding disagreements did not suffice to comprehend the text. Dutch. studies mentioning the WZT and the coding differences would therefore not affect the overall as one of the methods used but not presenting any specific War. following reasons. with result coding disagreements) and study data from all 20 Portuguese. resulted in ICC(2. was r ⫽ .. This allowed us Dissertations from the United States were not ordered. Gardziella. the third only significant results were reported.12 based on six out of 20 disagreements on whether to code tor at http://www. Reported Results. or German were fully or par. at the coding manual definitions. trans. The overall average intercoder reliability for our coding Italian. Seven publications were. sistent definitions.95. All studies were cient into three subcomponents). One variable. In six studies. Finnish. language. because we To ensure the coding manual provided us with clear and con- decided to exclude them from the meta-analysis. we first jointly coded 18 studies to fine-tune 238 of these were retrieved in full text. and kappa for dichotomous codable. effect size for the study. 1994. we double-coded the result coding variable (No would have made the meta-analysis biased in this respect. Ihalainen. we calculated intercoder reliability for 20 variables formed the data set for the meta-analysis. In all. or Codable Results) in 138 ence lists of the retrieved full-text publications were scanned for studies initially selected as a study by the first author. 1972). specific Wartegg method results (e. we were only able to retrieve one of its two frequent and. reference databases—not until recently. we entered formulas for the Rosenthal (1991) servation. Sugiura & Yagi. and we therefore opted to randomly select four tables for nificant at the entry of the second block. 1978b. We found a clear dip in WZT interest in the 1970s and 1980s. a 10-year age difference is much larger in younger age than in & Fischel. 2002. assume that the studies relate to a common population effect. 1979c. Other less-known Wartegg traditions have existed in priate rather than a fixed effects model since there is no reason to the Netherlands (Evers. All averages were weighted by sample size.. this is the result of variance and hierarchical stepwise linear regression procedures of the invisibility of the literature. We were surprised to see how few cross-references there procedures. It has not been possible to find in SPSS 16. 2006. 1981) were coded as Teiramaa the WZT. intelligence. 2003). Germany. 1979b.. & Evers. Takayanagi. we opted to code three nonsignificant results. Publication Type. Study Design. 1965). since the significance of. Check. 2002). regarded them as being of secondary importance: Publication many of them unpublished. We entered eight variables related naire. ity of the analysis and that the number of studies is too low to Caricchia. 1979c) and Crisi (1998) using an online calculator (Preacher. to our knowledge. leading to a total of 3. five scales from an intelligence measure. Subject Age (transformed). using the spreadsheet formulas these countries as well. and Japanese Wartegg tradition has not been reported in earlier War- Weinberger (2009). Takeuchi. d’Angerio.0. and the analysis rerun. and Brazil has explicit formulas and guidelines for data entry and analysis and has been documented (Roivainen. dissertations. . divided into 230 journal publications. We coded and calculated the basic meta-analytic effect sizes by Compared with earlier accounts on the scope of Wartegg use entering data and formulas into a spreadsheet.15. The isolated nature of the Wartegg traditions is a notable ob- able. parts. velopment. We used the studies in Japanese were. Literature Review Finally. If not available. 1979a. but. Hilsenroth. and Other. A random effects model is appro- been used. 40 (1978a). and the study was therefore not included (Regel. Wartegg studies typically cite sizes were interpreted according to Hemphill’s (2003) guidelines. Brazil. 2000) indicate possible growing traditions in derived from the individual results. if any at all. and Teiramaa We were able to retrieve 507 references of scholarly works on (1978b. our findings are remarkable. Japan. scales from an occupational performance assessment question- Observer. for converting different statistics into correlations whenever avail. Both transformations substantially their codable results. and exclusion level to p ⫽ . Croatian (Kostelic´-Martic´ & Jokic´-Begic´. Based on the small to samples and method. excluded since we were transformed effect sizes as input and the average sample size as not able to translate them (Daitoku & Nishimura. dividing them into patients and nonpatients. Although a free spreadsheet. 2005. Age was transformed by the Sugiura & Takanashi. for substantial reasons. and sizes as averages of the Fisher-transformed correlation coefficients Indonesian (Kinget. it provided (2009). 1998). five of which we entered into the first number of reported significant results in relation to the vast block of the linear regression analysis: Scorer Blinding. Although strictly a fixed model analysis. . reduced skewness and kurtosis. & Lonoce. If a variable from the first block was rendered nonsig- jects. and Miguel (2007) corre. log linear function to reduce skewness since adolescents were most 2008). older age. because the works have not been listed in that a random model analysis would unduly increase the complex. with the WZT correlated against each of the three methods. Zaal. as we tailed results of several studies by his colleagues and students. we argue relevant research. We computed study effect al. Suzuki. 1974. Among the 31 countries of origin.10. Inclusion level for the linear regression was set to p ⫽ al. The test history in Germany. In our view. 2000. De Souza. We report the random effects model correla- tegg literature at all. Diagnosis. The final three we entered into the second block. In addition. Kinget (1952). vocational interest. the been used in a meta-analysis primer by Diener. only a small number of earlier publications. say. 1979a. and Wartegg (1953) have presented in this spreadsheet.480 SOILEVUO GRØNNERØD AND GRØNNERØD determine the number of insignificant results (Brönnimann. in 15 tables of codable results. The highest sizes were then fed into a meta-analysis calculation spreadsheet reported number of Wartegg references was 88 in Roivainen made by Diener (2009). A dummy variable was created for lated 141 WZT variables with 16 scales from a personality inven- subject population. 2002) and in France. The references indicate that the works by tional coefficients based on Hedges and Olkin’s (1985) procedures Ave´-Lallemant (1994). representing each of the topics the author examined: de. Effect are between traditions. tory (16PF). Italy. and Recent translations of test manuals into Hebrew (Ave´-Lallemant et we want to make inferences beyond the type of studies included in our analysis (Hedges & Vevea. Katsura et weights. Free Response. Sugiura. For one study. Reliability amount of correlations. as mentioned. The number of coded results was also transformed due Two studies were challenging because of the large number of to a highly skewed distribution. Finland. & Kakudate. 2001.807 correlations. 113 books. and Finland Meta-Analysis were the most frequent. A group of estimate error variance for the independent variables. Puonti. followed by a revival in the 1990s and 2000s. and do not We ran moderator analyses using correlations. The study effect and research globally. The Year. there- fore Kuha (1981) was coded as Kuha (1973). Parnitzke. and 124 other types of publications. formed). Chi-square coefficients were calculated from figures reported in Results Teiramaa (1978a. 2010) to get more accurate results. 1979. one-way analysis build hypotheses on earlier findings. and six The criteria variable was recoded into Self-Report. 2009). Italy. Hara. the variable was removed coding. and correlations with free response methods. Primi. and Subject Takala (1964) was especially challenging in presenting the de. and Number of Coded Results (trans- number of individual results far outnumbered the number of sub. 2005). some results were reported in several publications. Population. 00 Venturino et al.68 . Thirty of the 791 results Reliability. (2006) Italy/English 181/181/40 Oth 5/11/4/0/2/5/24 36 .29 Daini.45 . The results consisted of 21 indices (e. (1991) Finland/English 50/50/50 NPg 5/8/8/2/2/0/57 1 .067/1. NPg ⫽ Non-Patients. & Panetta (2007) Italy/English 91/91 Oth 5/11/4/0/0/2/29 26 .75d .33 Scarpellini (1964) Italy/French 120/120 Oth 5/14/7/0/0/1/22 36 . d Internal consistency reliability.35 Crisi (1998) Italy/Italian 384/372 Som 4/11/2/0/0/0/11 6 . anxiety index.89 . 1981) from the reference list were all coded as Teiramaa (1978a). Interscorer reliability coefficients averaged to rw ⫽ .37 Markwardt (1961) DDR/German 52/52 Som 5/0/2/0/0/4/10 1 .42 Puonti (2005) Finland/Finnish 29/29/29 NPs 4/10/10/0/2/4/38 2 .17 .g.g.593). results were reported in nine samples. One covered a wide selection of scoring categories (e.. InP ⫽ Inpatients.16 Keith et al. .34 Tamminen & Lindeman (2000) Finland/Finnish 107/81 NPg 5/8/7/0/1/1/18 5 .34 Hyyppa¨ et al. DDR ⫽ former East Germany. (2007) Brazil/Portuguese 121/121 NPg 5/13/7/0/0/3/41 3 .91 169/169/169 NPs 4/10/10/0/2/4/38 2 .79 (15 260 as being in the expected direction.12 Teiramaa (1977) Finland/English 199/99 Som 3/4/2/2/0/3/34 4 . 1964) reported results from two separate samples. scales and expanding the set to 38 samples.10 . NPs ⫽ Non-Patients.067 NPg 5/14/2/0/0/3/27 64 .83 .11 Note. WARTEGG ZEICHEN TEST: META-ANALYSIS 481 The Meta-Analysis were coded as exploratory. (2003) Italy/Italian 389/389 NPg 5/0/2/0/0/2/17 1 . and variables reliability coefficients reported in 15 samples and 791 validity based on more informal evaluations (vitality.04 Takala (1953) Finland/English 60/60 NPg 5/6/7/0/0/3/22 1 .34 . (1966) USA/English 98/32 NPg 5/4/3/0/0/1/11 18 . All reliability coefficients are interscorer reliability except where noted. pen pressure.11 Konttinen & Olkinuora (1968) Finland/English 68/68/68 NPg 5/6/7/0/2/2/14 1 .14 Laukkanen (1993) Finland/Finnish 120/120 OutP 3/8/2/2/0/3/17 25 . Oth ⫽ Other.14 Togliatti et al. study (Takala. a result in the excellent range.23 Silveri et al.75 .09 Takala (1964) Children Finland/English 148/148 NPg 5/6/7/0/0/3/7 60 . b Publication Type/Scoring System/Design/Scorer Blinding/Reliability Check/Count of Other Method Types/Average Sample Age. drawing order.64 .77 Mellberg (1972) Finland/English 284/284/284 NPs 5/7/9/0/2/1/15 1 . c Test–retest reliability.72d Roivainen & Ruuska (2005) Finland/English 83/83/83 OutP 5/12/7/0/2/2/45 4 .06 .10 Adolescents Finland/English 583/291 NPg 5/6/7/0/0/3/18 168 . et al. Both reliability and validity extroversion. 1979a. BDR ⫽ former West Germany.87 Teiramaa (1978a)e Finland/English 199/145 Som 5/4/10/2/0/3/34 29 . Lai.13 Daini. We based the meta-analysis on 812 individual formed the basis of the main analysis. (1981) Italy/Italian 390/279/20 NPg 5/12/3/0/2/1/12 6 .12 .37 Chimenti et al.94 . Selection.27 . The Table 1 Sample Coding and Coefficients From 37 Studies Coefficient Directed Sample Country/language Na Subjects Moderatorsb Entries Reliability Validity validity Araja¨rvi et al.10 de Caro & Venturino (1991) Italy/Italian 1.14 .61 .. which Data set. attachment.29 Bokslag (1960) The Netherlands/Dutch 96/96 NPs 5/1/3/0/0/3/16 10 .32 Soilevuo Grønnerød & Grønnerød (2010) Norway/English 351/351/50 NPg 5/9/7/0/2/0/20 3 .10 .46 Brönnimann (1979) Switzerland/German 190/190/190 NPg 3/12/7/0/0/1/15 1 . Bernardini. The contents of the variables results from the 37 studies shown in Table 1 (N ⫽ 7. and abstraction level).06 Flakowski (1957) BDR/German 38/38 NPg 5/12/7/0/0/1/11 1 . e Teiramaa (1978b. schizoid self-esteem. (1994) Italy/Italian 843/843 NPg 5/1/2/0/0/2/33 73 .11 Takala & Rantanen (1964) Finland/English 200/200/200 NPg 5/6/7/0/2/2/14 25 .68 Pesonen (1970) Finland/English 127/127 NPs 5/7/9/2/0/1/11 1 . The studies reported three types of reliability were coded as being in the opposite of the expected direction. and sensitivity to stimulus).08 . Som ⫽ Somatic Patients. 1979c.09 de Souza et al. (1975) Finland/English 100/100 Som 5/4/3/0/0/3/33 7 . OutP ⫽ Outpatients. (2004) Italy/English 40/40 Som 5/1/3/2/0/4/76 1 . a Total N/average N/reliability coefficient N (if any).83c Kuha (1973) Finland/English 150/150 Som 3/4/2/0/0/4/33 7 .83 .09 Kuha et al. This left us with 290 results that we could define as directed based on specific hypotheses. coefficients reported in 33 samples. and results. content categories. (1974) Finland/English 151/151 InP 5/0/3/0/0/3/13 1 .17 Juurmaa & Leskinen (1966) Finland/Finnish 260/130/50 Som 5/7/2/0/1/0/14 73 .83 68/68/68 NPg 5/6/7/0/2/2/14 2 . (1991) Finland/English 651/651/50 NPg 5/8/5/0/2/4/50 4 .19 .45c Burbiel & Wagner (1984) BDR/German 37/37/37 InP 5/12/1/2/2/4/38 23 .56 Gardziella (1969) Finland/Finnish 26/26 NPs 4/8/9/2/0/4/17 4 . and control).05 Wass & Mattlar (2000) Sweden/Swedish 131/87/10 NPg 4/9/4/0/1/1/30 77 .77 .60 .14 . the remaining 501 results results from 12 samples). 1979b.09 Mattlar et al. General. When planning this meta-analytic 3-week retest periods). . p ⬍ that resemble our focus on specific hypotheses. This resulted in a list of k ⫽ 51 effect sizes many studies did not report scorer blinding or interscorer reliabil- to be compared. a large effect size (95% confidence interval [CI: . We did not have any preference for The significant heterogeneity (Q ⫽ 225. and other. The Note. p ⬍ . Dahlstrøm. we examined different criteria used in the studies. Studies reported levels around . 1998.20 WZT scoring categories.12. for a review of various levels). Contrast analyses showed relatively strong result. however (e. 2004. b Gardziella (1985). whereas Full Scorer (2000) and de Souza et al. 1962b) 2 .693) as being tas (1993).18.149. publication year 1939 was .916. based on a clear and specific hypothesis. heterogeneity Q ⫽ 304.10 is doubtful whether split-half reliability is relevant for the WZT Other 14 . p ⬍ .482 SOILEVUO GRØNNERØD AND GRØNNERØD calculations were mostly based on single scoring categories cov. Florio. effect size for focus on study quality. too many WZT studies have not been con- was significantly different. and we opted to use the first The most important recent critiques of the validity of the War- model. given sufficient produced rather strange predictions.74. This is Crisi (1998) 3 . we grouped samples according to scoring system used. however.26 variables.14 given the unique character of each square. Rosenthal. the tegg method have been presented by Tamminen and Lindeman model predicted an effect size of r ⫽ .30 for the MMPI are quite comparable. t(9942) ⫽ 21. diagnosis. Clearly. we were aware of the The weighted average validity coefficient for all results was debate on the Wartegg method and of the uncompromising posi- rw ⫽ . Clearly. Vetter (1952).0000).000 (averages shown in Table 2). Berry. The Berry.30 (Garb.19. 1999. Renner (1953).35. Puonti (2005). p ⬍ . and their levels of . 1943) have generally lated to variances in effect size levels. 9942) ⫽ 417. One study applied a “split-half reliabil.14. weighted ity. Given the lack of empirical support for specific Kukkonen (1962a. it is difficult to conclude whether the variations in level can be attributed to state. and Brunell-Neuleib (1999) meta- Third. either supporting the validity of the method or not. Hiller et that employed and reported scorer blinding reported. a clear result showing poor validity results would have Second. Graham. 6162) ⫽ 325. therefore. although it Diagnosis 12 .g. p ⬍ . k ⫽ number of samples. 1953). F(4. self-report. Bornstein. rw ⫽ average effect size weighted by weighted average reliability coefficient for all three reliability sample N. on average. al. 2001. Hiller. defining a good Discussion rationale and specifying hypotheses is related to powerful results. . (2007). was significantly different. the regression model identified important variables re. Meyer & Archer. Internal consistency coefficients averaged to rw ⫽ . F(7. the Thematic Apperception Test (Murray. & Grove.or trait-related characteristics. difficult.000. a types was rw ⫽ . been easier to communicate to the scientific community. and we see a larger variation in WZT study same was true for free response and observer compared with quality than in studies of other widespread methods. and Fourth. by sample size. & Kraemmer. 1989). The model yielded rw ⫽ .15 other study calculated internal consistency for scales based on Free response 5 . In fact. tions of the opposite camps.678. Validity. The levels were satisfactory. Also. and Wartegg (1939. The overall differences between criteria. Interpretation is somewhat more . the MMPI system. By entering No Scorer Blinding into the equation. a p ⬍ .66). showing a decline in effect We conclude. Bornstein. The results were South American systemsc 1 . Table 2 ering a wide range of phenomena but also in a few instances on Sample Effect Sizes by Criteria (k ⫽ 51) and Scoring System scales based on scoring categories where higher levels should be (k ⫽ 30) expected. Tellegen.06 calculated on the basis of both content scores and scores related to Takala (1957) 4 . t(9942) ⫽ 22. in Fourteen samples used more than one criterion. Criteria ity” procedure without specifying how the spilt was done.28 Observer 7 .47]. our skepticism grew because one effect size for each.000.26]).000 (averages shown in Table 2).87. 2001..431.. that research on the WZT may reach sizes over the 79-year period covered. when looking at studies testing specific that the difference between self-report and free response criteria hypotheses. and we calculated the course of the coding process. (Butcher.0000) cautioned the results of the analysis—we would have been equally satisfied us not to interpret the levels directly but rather to investigate with any result.18. but see Meyer. and the ducted adequately.33. Rosenthal. Scoring system Test–retest reliability.74 (three Subdivision k rw results from two samples).08 Kinget (1952) 5 .33. c Biedma and D’Alfonso (1960) and Frei- based on 290 results coded in 14 studies (N ⫽ 3.13 especially true given the short retest periods. since only a few studies represented each Meta-analytic studies of the Rorschach method.29 for the Rorschach and . further the influences that various factors had on the levels. Ave´-Lallemant (1978). Lossen and Schott (1952). was disappointingly Wartegg and othersa 5 . (2000) studied WZT validity against four subscales of the Person- . and clear hypotheses were presented all too seldom. We were thus surprised to find an effect size of rw ⫽ . The results from the random effects model was and Wass and Mattlar (2000).23 samples within 1 week and 1.53 (three results from two Gardziella and othersb 6 . analysis of the Rorschach and the MMPI applied inclusion criteria The systems differed significantly. on the other hand. with a weighted average of rw ⫽ .18 drawing style. and the Self-report 13 . The Hiller. a lower middle magnitude effect size (95% CI [. Tamminen and Lindeman Blinding predicted a level of r ⫽ .31 Several scoring systems 4 . as shown in Table 3. & larger effect sizes than did those that ignored scorer blinding.10 low. other moderator was publication year. The second step model levels comparable to other assessment methods. Wartegg method may become a useful addition to a practicing report methods.).449ⴱⴱⴱ a Two-tailed univariate correlation with untransformed effect size. El test de dibujos Wartegg: Su aplicacio´n en method was introduced in Peru as a part of the strong German nin˜os.) en van een verrichtingstest (Block Design uit recent years.513ⴱⴱⴱ Step 2 .. Also in our data. J. practice. 19 –52. Pensiero logico e immaginazione negli ado- typically have not referred to relevant earlier results. On specific available. Roivainen.037 . set aside. possibly applies for similar countries as well. C. for personality evaluation.513 0. Argen- to assume that this is also the case in other countries. Munich. we argue that there is (Spielberger. ing systems and related to different personality characteristics. Forsse´n. WARTEGG ZEICHEN TEST: META-ANALYSIS 483 Table 3 Weighted Regression Model of Moderator Influences Moderator R Constant ra B SE ␤ Step 1 . However. Israel: Keter. however.122 0. In addition. Repo. 1969.. (1994). 2000. 