Back to News
research

Multilingual Corpora Enable Social Science Concept Study, with Data Extracted from Company Websites and Annual Reports

Quantum Zeitgeist
Loading...
5 min read
1 views
0 likes
Multilingual Corpora Enable Social Science Concept Study, with Data Extracted from Company Websites and Annual Reports

Summarize this article with:

The study of how new ideas emerge and spread presents a significant challenge for researchers in the social sciences and humanities, often requiring analysis of language across multiple cultures. Revekka Kyriakoglou and Anna Pappa, both from Université Paris 8, address this need by presenting a new method for constructing multilingual corpora, resources of text used for linguistic analysis. Their work details the creation of a corpus focused on the concept of “non-innovation”, drawing data from company websites and annual reports in both French and English. This innovative approach yields a robust and expandable resource, enabling detailed investigation of how language reflects evolving concepts and providing valuable data for advanced natural language processing applications, ultimately offering a powerful tool for understanding the spread of ideas across different contexts.

Automated Lexicon Development for Emerging Concepts Scientists developed a methodology for automatically constructing lexicons for emerging concepts, particularly within the context of corporate text data and innovation studies. This automated approach overcomes limitations of existing lexicon-based methods by adapting to novelty and capturing evolving terminology.

The team established a multi-stage pipeline beginning with large-scale data collection through web scraping, carefully considering legal and ethical implications. Scraped data underwent language identification and standard natural language processing techniques, including tokenization and lemmatization. The core of the methodology leverages existing multilingual semantic networks, such as WordNet, BabelNet, and ConceptNet, combined with corpus-based techniques like term frequency analysis and co-occurrence analysis to identify and define emerging concepts. Identified concepts are then linked to existing knowledge graphs, enriching their semantic representation and creating a comprehensive resource. This process culminates in the creation of a multilingual, semantically annotated corpus, named MOSAICo, designed for research purposes. The research contributes a complete pipeline for automated lexicon development, a valuable multilingual corpus for researchers in the humanities and social sciences, and a novel integration of knowledge graphs and corpus-based techniques. Furthermore, the project prioritizes FAIR data principles, ensuring the data is Findable, Accessible, Interoperable, and Reusable.

The team explicitly addressed the ethical and legal challenges of web scraping, emphasizing responsible data collection practices and user privacy. Future work will extend the methodology to other conceptual domains, normalize metadata for improved interoperability, and integrate the corpus into larger data ecosystems like SDHSS, fostering collaboration and open science.

Multilingual Corpus Construction for Innovation Studies Scientists engineered a hybrid methodology for constructing a multilingual corpus, designed to facilitate the study of concepts within the humanities and social sciences, and demonstrated this approach through an investigation of “non-technological innovation”.

The team established the corpus using two complementary sources: automatically extracted textual content from company websites and annual reports systematically collected and filtered based on criteria including year, format, and avoidance of duplication. This pipeline incorporates automatic language detection, filtering of irrelevant content, extraction of pertinent segments, and enrichment with structural metadata to ensure data quality and consistency. From this initial corpus, the team created a derived dataset specifically for machine learning applications, focusing on the English language. For each instance of a term identified from an expert lexicon, scientists extracted a contextual block encompassing five sentences, two preceding and two following the sentence containing the term, to capture surrounding linguistic information. Each occurrence then received annotation with its associated thematic category, structuring the data for supervised classification tasks and enabling detailed analysis of conceptual variations. This methodology guarantees reproducibility and extensibility, providing a resource suitable for analyzing lexical variability around emerging concepts and generating datasets for natural language processing. Researchers meticulously documented the entire process, ensuring traceability of results and ethical compliance in data sharing. This work builds upon existing research utilizing web-sourced data, confirming its value for observing lexical and semantic variability in real-world contexts, and leverages advancements in automatic language identification to enhance data collection and pre-processing pipelines. The resulting corpus offers a robust foundation for investigating complex and evolving concepts within the humanities and social sciences. Multilingual Corpus for Studying Non-Innovation This research details a novel methodology for building a multilingual corpus specifically designed to support research into concepts within the humanities and social sciences, demonstrated here through a study of “non-innovation”.

The team successfully combined automated text extraction from company websites and annual reports with expert validation to create a robust and extensible resource. The processing pipeline incorporates automatic language detection, content filtering, and structural metadata enrichment, resulting in a dataset suitable for both qualitative lexical analysis and supervised machine learning tasks. From this initial corpus, a derived English language dataset is created to facilitate machine learning applications. For each instance of a term identified from an expert lexicon, a contextual block of five sentences is extracted, two sentences preceding and two following the sentence containing the term, providing rich linguistic context. Each occurrence is then annotated with its associated thematic category, enabling the construction of data suitable for supervised classification tasks and detailed thematic analysis. The fully vectorized dataset is openly available under a license promoting research and collaboration, and is compatible with tasks such as classification, lexical detection, and semantic modeling. The authors acknowledge that the current work focuses on a specific domain and language pair, and future research will extend the protocol to other conceptual areas and incorporate standardized metadata adhering to FAIR principles. They also envision integrating these corpora into shared infrastructures like SDHSS, furthering the accessibility and interoperability of resources for humanities and social science research. This work highlights the synergy between computational methods and linguistic analysis, offering a systematic approach to exploring evolving lexical dynamics within complex conceptual fields and addressing a critical need for tailored resources in the humanities and social sciences. 👉 More information 🗞 Multilingual corpora for the study of new concepts in the social sciences and humanities: 🧠 ArXiv: https://arxiv.org/abs/2512.07367 Tags:

Read Original

Source Information

Source: Quantum Zeitgeist