Crowdsourcing and Scholarly Culture: understanding expertise in an age of popularism

Alan Dix1, Rachel Cowgill2, Christina Bashford3, Simon McVeigh4, Rupert Ridgewell5.

1. Computational Foundry, Swansea University, UK
2. School of Music, Humanities and Media, University of Huddersfield, UK
3. School of Music, University of Illinois at Urbana-Champaign, USA
4. Department of Music, Goldsmiths, University of London, UK
5. British Library, UK

Chapter in Macrotask Crowdsourcing: Engaging the Crowds to Address Complex Problems , Editors: Vassillis-Javed Khan, Konstantinos Papangelis, Ioanna Lykourentzou, Panos Markopoulos.

draft chapter (PDF, 4.6M)

The increasing volume of digital material available to the humanities creates clear potential for crowdsourcing.  However, tasks in the digital humanities typically do not satisfy the standard requirement for decomposition into microtasks each of which must require little expertise on behalf of the worker and little context of the broader task. Instead, humanities tasks require scholarly knowledge to perform and even where sub-tasks can be extracted, these often involve broader context of the document or corpus from which they are extracted.  That is the tasks are macrotasks, resisting simple decomposition. Building on a case study from musicology, the In Concert project, we will explore both the barriers to crowdsourcing in the creation of digital corpora and also examples where elements of automatic processing or less-expert work are possible in a broader matrix that also includes expert microtasks and macrotasks.  Crucially we will see that the macrotask–microtask distinction is nuanced: it is often possible to create a partial decomposition into less-expert microtasks with residual expert macrotasks, and crucially do this in ways that preserve scholarly values.

Keywords: crowdsourcing, human-computer interaction, digital humanities, macro task, musicology, intelligent interfaces, HCI.


  • [Ac22]  Ackerman, Phyllis. (1922).  Catalogue of the Retrospective Loan Exhibition of European Tapestries", Taylor and Tayloy, NY.
  • [AI07]  Ahmed, E., Ipeirotis, P. and Verykios, V. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19 (1):1–16. doi:10.1109/TKDE.2007.9
  • [vA08] Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum, 2008. reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 321(5895):1465--1468.
  • [BC00]  Bashford, C., Cowgill, R. and McVeigh, S. (2000).  The Concert Life in Nineteenth-Century London Database, in Nineteenth-Century British Music Studies, 2, ed. by J. Dibble and B. Zon (Aldershot: Ashgate, 2000), 1–12.
  • [Be04]  Bell, D.(2004).  Infinite Archives, SubStance, Vol. 33, No. 3, Issue 105, pp. 148-161, University of Wisconsin Press.
  • [BL89]  Tim Berners-Lee (1989). Information Management: A Proposal.  CERN internal report, March 1989, May 1990.
  • [BG07]  Bhattacharya, I. and Getoor, L. (2007). Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data, 1(1):
  • [BL12]  Bodleian Library (2012/2019).  What's the Score at the Bodleian? Bodleian Library.  accessed 1/5/2019
  • [Bo46]  Borges, J. (1946).  Del rigor en la ciencia. (tr. ‘On Exactitude in Science’) Los Anales de Buenos Aires 1.3 (Mar. 1946):53
  • [BS97]  Brown, J. and Stratton, S. (1897).  British Musical Biography: a dictionary of musical artists, authors and composers, born in Britain and its colonies.  S.S. Stratton, Birmingham.
    OCR text: 
    searchable and data version:
  • [DT15] Justin Cheng, Jaime Teevan, Shamsi T. Iqbal, and Michael S. Bernstein. 2015. Break It Down: A Comparison of Macro- and Microtasks. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 4061-4064. DOI: 10.1145/2702123.2702146
  • [CL97]  Concert Life in 19th-Century London database project, funded by the University of Huddersfield and Oxford Brookes University (1997–2001), and the Arts and Humanities Research Board (UK) and University of Leeds (2001–04).
  • [CP04]  Concert Programmes online database.  created 2004–2007. accessed 29/9/2018.
  • [CR12]  Cowgill, R. and Poriss, H. (eds) (2012).  The Arts of the Prima Donna in the Long Nineteenth Century. Oxford University Press.
  • [DB00]  Dix, A., Beale, R. and Wood, A. (2000).  Architectures to make Simple Visualisations using Simple Systems.  Proc. AVI2000, ACM, pp. 51–60.
  • [DC14]  Alan Dix, Rachel Cowgill, Christina Bashford, Simon McVeigh, and Rupert Ridgewell. 2014. Authority and Judgement in the Digital Archive. In Proceedings of the 1st International Workshop on Digital Libraries for Musicology (DLfM '14). ACM, New York, NY, USA, 1-8. DOI:10.1145/2660168.2660171
  • [DC16]  A. Dix, R. Cowgill, C. Bashford, S. McVeigh and R. Ridgewell (2016).  Spreadsheets as User Interfaces. Proc. AVI2016, ACM, pp.192-195.  DOI: 10.1145/2909132.2909271
  • [Dx19] Dix, A. (2019).  Creativity – understanding and enhancing technical creativity and innovation. Accessed 11/1/2019.
  • [DP18]  Distributed Proofreaders (2018).  Distributed Proofreaders: preserving history one page at a time.  Accessed 2/9/2018
  • [Du46]  Dunn, H. (1946). Record Linkage. American Journal of Public Health 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412
  • [FS17]  Florian Fink, Klaus U. Schulz, and Uwe Springmann. 2017. Profiling of OCR'ed Historical Texts Revisited. In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (DATeCH2017). ACM, New York, NY, USA, 61-66. DOI: 10.1145/3078081.3078096
  • [GS10]  Di Gioia, M., Scannapieco, M. and Beneventano, D.  (2010).  Object Identification across Multiple Sources. Proc. of the Eighteenth Italian Symposium on Advanced Database Systems, SEBD 2010, Rimini, Italy, June 20–23, 2010.
  • [Go16]  Michael Gove (2016) Sky News interview with Faisal Islam, 6 June 2016. [Gr00]  Grove, George, ed.; A Dictionary of Music and Musicians 1450–1889 (1900).
  • [HA15] Daniel Haas, Jason Ansel, Lydia Gu, and Adam Marcus. 2015. Argonaut: macrotask crowdsourcing for complex data processing. Proc. VLDB Endow. 8, 12 (August 2015), 1642-1653. DOI=10.14778/2824032.2824062
  • [IC16] In Concert (2014–2016).  accessed 3/1/2016
  • [LT18]  Leverhulme Trust (2018).  Research Project Grants. Accessed 4/9/2018.
  • [MV92] McVeigh, S. (1992–2014)  Calendar of London Concerts 1750–1800. (Dataset) Goldsmiths, University of London.
  • [NA12]  Nikolov, A.,  d'Aquin, M., and Motta, E. (2012). Unsupervised learning of link discovery configuration. In Proc. ESWC'12, Springer-Verlag, Berlin, Heidelberg, 119–133. doi: 10.1007/978-3-642-30284-8_15
  • [ND16] T. Nurmikko-Fuller, A. Dix, D. M. Weigl, and K. R. Page (2016) In Collaboration with In Concert: Reflecting a Digital Library as Linked Data for Performance Ephemera. In Proceedings of the 3rd International workshop on Digital Libraries for Musicology (DLfM 2016). ACM, New York, NY, USA, 17-24.  DOI: 10.1145/2970044.2970049
  • [OR18]  OpenRefine: Reconciliation Service API. (accessed 24/9/2018).
  • [REF12] Part 2D: Main Panel D criteria, Panel criteria and working methods, REF2014, Research Excellence Framework.  January 2012.
  • [RS06]  Rendle, S. and Schmidt-Thieme. L. (2006). Object identification with constraints. Data Mining, 2006., 1026–1031,
  • [Ri07]  Chris Rusbridge (2007). Arts and Humanities Data Service decision. DCC News, 6 June, 2007. Digital Curation Centre.
  • [ST15]  Scannapieco, M., Tosco, L., Valentino, L., Mancini, L., Cibella, N., Tuoto T. and Fortini, M. (2015). Relais User's Guide – Version 3.0. Technical Report, Italian National Institute of Statistics (Istat). July 2015, doi:10.13140/RG.2.1.1332.5922
  • [SL18] Heinz Schmitz and Ioanna Lykourentzou. 2018. Online Sequencing of Non-Decomposable Macrotasks in Expert Crowdsourcing. Trans. Soc. Comput. 1, 1, Article 1 (January 2018), 33 pages. DOI: 10.1145/3140459
  • [TM16]  Transforming Musicology (accessed 3/1/2016).
  • [VG14]  Vobl, Thorsten, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. “PoCoTo - an Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 57–61. DATeCH ’14. New York, NY, USA: ACM. doi: 10.1145/2595188.2595197.
  • [Wi19]  Wikipedia (2109)  Arts and Humanities Data Service.  Accessed 1/5/2019


Figure 1. The digital archive process (from [DC14]).


Figure 2. Expertise and task decomposition.


Figure 3. Residual expert macrotasks.


Figure 4. Decomposing microtasks.


Figure 5. Microtasks lead to understanding.


Figure 6. Portion of Brown and Stratton's British Musical Biography [BS97].


Figure 7. Prototype web interface for link checking.


Figure 8. Links displayed with provenance.


Figure 9. Printed spreadsheet for grouping by hand.

Alan Dix 18/9/2019