نوع مقاله : مقاله پژوهشی
نویسندگان
1 پژوهشگر پسادکتری دانشگاه تهران، تهران، ایران.
2 دانشیار، گروه علم اطلاعات و دانش شناسی، دانشگاه تهران، تهران، ایران
چکیده
کلیدواژهها
عنوان مقاله [English]
نویسندگان [English]
Objective: Given its official status in Iran and three other countries, Persian ranks as the ninth most frequently used language in terms of web content and substance, surpassing Arabic, Turkish, and other Middle Eastern languages. Consequently, Persian language processing has become a national and international necessity. This research investigates the role of citizen science in the formation and development of language corpora for classical Persian literature. The goal of this study is to explore how public participation can be used to create and enrich language corpora for classical Persian literature.
Method: This study begins with a qualitative approach using a seven-step meta-synthesis method developed by Sandelowski and Barroso to examine corpus components, characteristics, and applications. Following a literature review, a system was designed and the first indigenous citizen science platform for classical Persian literature was implemented.
Findings: The findings indicated that codes such as citizen science, machine learning, deep learning, data, information, cyberspace, and others received significant attention in the reviewed articles. Additionally, 15 other codes with the highest frequency were extracted from these articles, which led to the design of a system for the indigenous citizen science platform.
Conclusion: The results of user interaction with this platform demonstrate that citizen science can be a valuable and effective tool for promoting classical Persian literature in the digital world. This tool can help to increase the volume and diversity of corpus data, improve data accuracy and quality, reduce data collection and processing costs, and increase public commitment and participation in the preservation and promotion of classical Persian literature.
Introduction:
Persian, as a language with deep historical roots and contemporary relevance, holds the distinction of being the ninth most utilized language in digital spaces, surpassing Arabic, Turkish, and other regional languages in online content volume. This prominence underscores the urgent need for advanced linguistic tools to preserve and modernize classical Persian literature, which faces challenges in adapting to the digital era. Classical Persian texts—ranging from the poetry of Rumi and Hafez to historical prose—are integral to cultural identity but remain underrepresented in digital corpora. This study addresses this gap by investigating the role of citizen science, a collaborative approach that engages the public in academic research, to build and enrich linguistic corpora for classical Persian literature. The research aims to democratize access to cultural heritage while enhancing computational tools for language processing, thereby bridging traditional scholarship with modern technological demands.
Method:
The study employed a seven-step meta-synthesis framework developed by Sandelowski and Barroso, designed to synthesize qualitative data from diverse sources. The process began with a systematic literature review of 85 peer-reviewed articles focused on citizen science, linguistic corpora, and Persian language processing. Key themes were identified through iterative coding, including machine learning integration, public participation models, and data quality assurance. Following this analysis, the research team designed an indigenous citizen science platform tailored for Persian literary corpus development. The platform’s architecture incorporated:
Crowdsourcing modules for text digitization and annotation, allowing users to transcribe manuscripts, tag linguistic features, and validate machine-generated outputs.
Machine learning algorithms to automate preliminary text analysis, including optical character recognition (OCR) for handwritten manuscripts and semantic tagging for archaic vocabulary.
Gamification elements to sustain user engagement, such as progress badges and leaderboards for contributors.
Quality control mechanisms, such as peer-review workflows and AI-driven anomaly detection, to ensure corpus accuracy.
User testing involved 450 participants, including scholars, students, and heritage enthusiasts, who interacted with the platform over six months. Data collection focused on metrics like contribution volume, error rates, and user feedback.
Results and Discussion:
The platform’s implementation yielded transformative outcomes. Citizen science emerged as a cornerstone, with public contributors generating 62% of the initial corpus data, including rare manuscripts previously inaccessible to academic institutions. Machine learning integration reduced manual annotation time by 40%, though challenges persisted in recognizing cursive script variations. Key findings include:
Enhanced Data Diversity: Contributions from non-specialists introduced dialectal and stylistic variations often overlooked in traditional corpora, enriching the dataset’s representativeness.
Cost Efficiency: Decentralized data collection lowered operational expenses by 35% compared to conventional methods.
Accuracy Improvements: Hybrid validation (human-AI) achieved a 92% accuracy rate in text transcription, surpassing purely algorithmic approaches.
Cultural Engagement: 78% of users reported increased interest in Persian literary heritage, highlighting the platform’s role in fostering cultural stewardship.
Challenges included balancing technical accessibility for non-experts and maintaining scholarly rigor. For instance, contributors occasionally misinterpreted archaic grammatical structures, necessitating iterative revisions. However, the platform’s collaborative design allowed these errors to be flagged and corrected through community feedback, demonstrating the resilience of participatory models.
Conclusion:
This study establishes citizen science as a viable and impactful strategy for preserving classical Persian literature in the digital age. The platform’s success lies in its dual capacity to harness public enthusiasm and leverage computational efficiency, creating a sustainable ecosystem for corpus development. By decentralizing expertise, the model not only amplifies data volume but also democratizes cultural preservation, inviting global participation in safeguarding Iran’s literary legacy. Future directions include expanding multilingual support for comparative studies and integrating advanced NLP models to handle poetic meter and metaphor analysis. Policymakers and cultural institutions are urged to adopt such frameworks to mitigate the risks of linguistic erosion in an increasingly digitized world.
کلیدواژهها [English]