Analysis of the Role of Citizen Science in the Formation and Development of Classic Persian Literature Linguistic Corpora

Document Type : Original Article

Authors

1 Tehran University, Tehran, Iran.

2 .Associate professor, Department of Knowledge and Information Science, Tehran University, Tehran, Iran. Iran.

10.22126/tbih.2024.9567.1000

Abstract

Objective: Given its official status in Iran and three other countries, Persian ranks as the ninth most frequently used language in terms of web content and substance, surpassing Arabic, Turkish, and other Middle Eastern languages. Consequently, Persian language processing has become a national and international necessity. This research investigates the role of citizen science in the formation and development of language corpora for classical Persian literature. The goal of this study is to explore how public participation can be used to create and enrich language corpora for classical Persian literature.
Method: This study begins with a qualitative approach using a seven-step meta-synthesis method developed by Sandelowski and Barroso to examine corpus components, characteristics, and applications. Following a literature review, a system was designed and the first indigenous citizen science platform for classical Persian literature was implemented.
Findings: The findings indicated that codes such as citizen science, machine learning, deep learning, data, information, cyberspace, and others received significant attention in the reviewed articles. Additionally, 15 other codes with the highest frequency were extracted from these articles, which led to the design of a system for the indigenous citizen science platform.
Conclusion: The results of user interaction with this platform demonstrate that citizen science can be a valuable and effective tool for promoting classical Persian literature in the digital world. This tool can help to increase the volume and diversity of corpus data, improve data accuracy and quality, reduce data collection and processing costs, and increase public commitment and participation in the preservation and promotion of classical Persian literature.

Introduction:

Persian, as a language with deep historical roots and contemporary relevance, holds the distinction of being the ninth most utilized language in digital spaces, surpassing Arabic, Turkish, and other regional languages in online content volume. This prominence underscores the urgent need for advanced linguistic tools to preserve and modernize classical Persian literature, which faces challenges in adapting to the digital era. Classical Persian texts—ranging from the poetry of Rumi and Hafez to historical prose—are integral to cultural identity but remain underrepresented in digital corpora. This study addresses this gap by investigating the role of citizen science, a collaborative approach that engages the public in academic research, to build and enrich linguistic corpora for classical Persian literature. The research aims to democratize access to cultural heritage while enhancing computational tools for language processing, thereby bridging traditional scholarship with modern technological demands.

Method:

The study employed a seven-step meta-synthesis framework developed by Sandelowski and Barroso, designed to synthesize qualitative data from diverse sources. The process began with a systematic literature review of 85 peer-reviewed articles focused on citizen science, linguistic corpora, and Persian language processing. Key themes were identified through iterative coding, including machine learning integration, public participation models, and data quality assurance. Following this analysis, the research team designed an indigenous citizen science platform tailored for Persian literary corpus development. The platform’s architecture incorporated:

Crowdsourcing modules for text digitization and annotation, allowing users to transcribe manuscripts, tag linguistic features, and validate machine-generated outputs.

Machine learning algorithms to automate preliminary text analysis, including optical character recognition (OCR) for handwritten manuscripts and semantic tagging for archaic vocabulary.

Gamification elements to sustain user engagement, such as progress badges and leaderboards for contributors.

Quality control mechanisms, such as peer-review workflows and AI-driven anomaly detection, to ensure corpus accuracy.

User testing involved 450 participants, including scholars, students, and heritage enthusiasts, who interacted with the platform over six months. Data collection focused on metrics like contribution volume, error rates, and user feedback.

Results and Discussion:

The platform’s implementation yielded transformative outcomes. Citizen science emerged as a cornerstone, with public contributors generating 62% of the initial corpus data, including rare manuscripts previously inaccessible to academic institutions. Machine learning integration reduced manual annotation time by 40%, though challenges persisted in recognizing cursive script variations. Key findings include:

Enhanced Data Diversity: Contributions from non-specialists introduced dialectal and stylistic variations often overlooked in traditional corpora, enriching the dataset’s representativeness.

Cost Efficiency: Decentralized data collection lowered operational expenses by 35% compared to conventional methods.

Accuracy Improvements: Hybrid validation (human-AI) achieved a 92% accuracy rate in text transcription, surpassing purely algorithmic approaches.

Cultural Engagement: 78% of users reported increased interest in Persian literary heritage, highlighting the platform’s role in fostering cultural stewardship.

Challenges included balancing technical accessibility for non-experts and maintaining scholarly rigor. For instance, contributors occasionally misinterpreted archaic grammatical structures, necessitating iterative revisions. However, the platform’s collaborative design allowed these errors to be flagged and corrected through community feedback, demonstrating the resilience of participatory models.

Conclusion:

This study establishes citizen science as a viable and impactful strategy for preserving classical Persian literature in the digital age. The platform’s success lies in its dual capacity to harness public enthusiasm and leverage computational efficiency, creating a sustainable ecosystem for corpus development. By decentralizing expertise, the model not only amplifies data volume but also democratizes cultural preservation, inviting global participation in safeguarding Iran’s literary legacy. Future directions include expanding multilingual support for comparative studies and integrating advanced NLP models to handle poetic meter and metaphor analysis. Policymakers and cultural institutions are urged to adopt such frameworks to mitigate the risks of linguistic erosion in an increasingly digitized world.

 

Keywords


Ahumada, J. A., Fegraus, E., Birch, T., Fores, N., Kays, R., O’Brien, T. G., et al. (2020). Wildlife insights: A platform to maximize the potential of camera trap and other passive sensor wildlife data for the planet. Environmental Conservation, 47(1).
Ceccaroni, L., Bibby, J., Roger, E., Flemons, P., Michael, K., Fagan, L., & Oliver, J. L. (2019). Opportunities and risks for citizen science in the age of artificial intelligence. Citizen Science: Theory and Practice, 4(1), 29.
Meurers, D. (2015). Learner corpora and natural language processing. The Cambridge handbook of learner corpus research, 537-566.
Dellermann, D., Calma, A., Lipusch, N., Weber, T., Weigel, S., & Ebel, P. (2019). The future of human-AI collaboration: A taxonomy of design knowledge for hybrid intelligence systems. In T. Bui (Ed.), Proceedings of the Hawaii International Conference on System Sciences (HICSS). (3):15-19
Flage, A. (2024). Taking games: a meta-analysis. Journal of the Economic Science Association, 1-24.
Hamatt, D, Staeheli, C (2011). Respect and responsibility: Teaching citizenship in South African high schools International. Journal of Educational Development, 31 (3). p:14-27
Hand, E. (2010). "Citizen science: People power". Nature. 466 (7307): 685–687.
Kennedy, Graeme (1998). An Introduction to CorPus Linguistics. London: Longman. 13-85.
Lehejcek, J., Adam, M., Tomasek, P., & Trojan, J. (2019). Informacni system pro spravu fotopasti (National database of photo trap records).
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
Peterson, Andrew & Knowles, Catherine (2009). Active Citizenship: A Preliminary Study into Student Teacher Understandings. Journal of Educational Research, Vol 51 No 1, PP 39-59.
Sandelowski, M., Docherty, S., & Emden, C. (1997). Focus on qualitative methods Qualitative metasynthesis: issues and techniques. Research in nursing and health, 20, 365-372
Swanson, A., Kosmala, M., Lintott, C., & Packer, C. (2016). A generalized approach for producing, quantifying, and validating citizen science data from wildlife images. Conservation Biology, 30 (3), 520–531.
Purta, J (2018). Civic Education, In: International Encycloped Curriculum. Dergamon press, v (9):117-132
Trojan, J., Schade, S., Lemmens, R., & Frantál, B. (2019). Citizen science as a new approach in geography and beyond: Review and reflections. Moravian Geographical Reports, 27(4), 254–264.
Yick, Alice G. (2013), “A Meta synthesis of Qualitative Findings on the Role of Spirituality and Religiosity Among Culturally Diverse Domestic Violence Survivors”. Health Policy & Services, 37 out of 70.
Zimmer L. (2006), “Qualitative meta-synthesis: a question of dialoguing with texts”, Journal of Advanced Nursing. 53(3): 311-318.
Sadeghi, S. S., Khotanlou, H., & Rasekh Mahand, M. (2021). Automatic Persian text emotion detection using cognitive linguistic and deep learning. Journal of AI and Data Mining, 9(2), 169-179.
Urválková, E. S., & Janoušková, S. (2019). Citizen science–bridging the gap between scientists and amateurs. Chemistry Teacher International, 1(2), 20180032
Yang, D., Wan, H. Y., Huang, T. K., & Liu, J. (2019). The role of citizen science in conservation under the telecoupling framework. Sustainability, 11(4), 1108.