The OpenAIRE Graph's Quality Framework and Its Impact on Scholarly Communication

short talk × friday × 13.30-15.00

Stefania Amodeo

OpenAIRE AMKE
Athens, Greece

Paolo Manghi

OpenAIRE AMKE
Athens, Greece

Miriam Baglioni

ISTI-CNR
Pisa, Italy

Claudio Atzori

ISTI-CNR
Pisa, Italy

Giulia Malaguarnera

OpenAIRE AMKE
Athens, Greece

Leonidas Pispiringas

OpenAIRE AMKE
Athens, Greece

In today’s Open Science landscape, quality in research can be assessed through transparency, reproducibility, findability, and impact. Maintaining high-quality research metadata is essential for effective scholarly communication [1] ensuring the FAIR principles to be implemented. The OpenAIRE Graph [2] is a comprehensive research infrastructure that contributes to this effort, processing more than 400 million research-related records monthly, including over 290 million scientific publications, 82 million datasets, and one million software entries. Beyond mere metadata aggregation, the OpenAIRE Graph transforms heterogeneous metadata into an interconnected research ecosystem, creating meaningful links between research outputs, researchers, organizations, and funding bodies to enhance research assessment. This interconnected ecosystem enables comprehensive citation tracking, usage analysis, and research impact assessment, providing valuable insights for the scholarly community.

This presentation examines the complex quality enhancement mechanisms for scholarly data in the OpenAIRE Graph and the practical applications that establish it as one of the world’s largest and most comprehensive scholarly knowledge graphs.

A cornerstone of OpenAIRE’s data quality framework is its comprehensive entity identification and deduplication system. This process begins with precise author identification through ORCID validation, cross-referencing author information with the official registry. When no cross-reference with an author’s ORCID profile is found, the ORCID information found in the metadata is retained, but marked as not validated. For other entities, OpenAIRE employs a sophisticated hybrid deduplication pipeline that combines automated algorithms with expert human curation [3]. For research products (publications, datasets, and software), an automated algorithm identifies and merges duplicate entities while preserving all valuable connections between research outputs. Data sources undergo additional expert verification of algorithm-proposed matches and manual searching for additional matches, with over 8,000 data sources curated to date. For organizational metadata, OpenAIRE has developed the OpenOrgs service, which leverages a network of over 100 experts across 40 countries who have successfully curated more than 100,000 organizations, addressing challenges such as multilingual name variations, diverse identifier systems, and complex institutional hierarchies. This careful curation has significantly improved both the accuracy and coverage of organizational affiliations. In the representative record obtained via deduplication, the provenance of the information is retained for all the duplicates, ensuring transparency and traceability of the information.

Going beyond deduplication, the OpenAIRE Graph implements enrichment processes to enhance its metadata quality and analytical capabilities. Advanced text mining algorithms process full-text publications to extract additional metadata and identify relationships between research entities that might not be immediately apparent. The system also integrates with specialized external tools to provide comprehensive impact measurements: Bip! Finder [4] analyzes citations, popularity, and community importance, while Usage Counts [5] measures views and downloads from provider web pages and other registered services. The SciNoBo classifier [6] categorizes research according to Fields of Science and maps contributions to UN Sustainable Development Goals. These impacts, together with the indicators, are presented in OpenAIRE Monitor [8], which is designed to have an overview vision to National, Funders, Institutions, and Research Infrastructures on research activities and links with Open Science.

A further enrichment of the Graph is given by its propagation process that leverages on information already present in the graph to add new properties to the results or new relations. Together, these enrichment processes transform the Graph into a dynamic, interconnected research ecosystem that significantly improves both the discoverability of research and the accuracy of impact assessments.

The OpenAIRE Graph builds on these quality enhancement mechanisms to deliver concrete benefits to the research community. Research institutions can use the enriched metadata and advanced analytics to comprehensively assess their impact by tracking citations and research influence, measuring funding efficiency by connecting grants to high-impact research, and using bibliometric indicators for strategic planning and funding applications. The Graph also maps global research collaborations, revealing patterns in international partnerships and research trends. Its Fields of Science classification system helps institutions track interdisciplinary research and identify collaboration opportunities. Compliance monitoring represents another crucial feature, with robust tools for tracking adherence to Open Science policies such as Plan S, Horizon Europe, or national requirements. The system delivers detailed analytics on open access publication patterns, APC expenditure, and policy compliance rates. Furthermore, the Graph’s classification system enables measurement and evaluation of research outputs’ contributions to the United Nations Sustainable Development Goals (SDGs), facilitating institutional alignment of research strategies with global sustainability objectives. By combining metadata enrichment, deduplication and full provenance tracking, the OpenAIRE Graph provides a high standard of data quality and integrity, and is a transparent and trustworthy foundation for scholarly communication. It can be used by stakeholders to make informed decisions, track research impact, and ensure compliance with Open Science principles.

The presentation will review the main OpenAIRE Graph features discussed above and provide guidance on using its public APIs ([7]), which have been recently updated for better access to the Graph’s extensive resources. Participants will receive reference materials on how to use the APIs, including query examples for common tasks such as retrieving publication metadata, tracking research impact, and monitoring open science compliance. Additionally, we will showcase our customised dashboards that allow institutions to create targeted visualizations and reports tailored to their specific needs using the Graph’s comprehensive dataset.

Whether through APIs or dashboards, researchers and institutions can harness the Graph to inform evidence-based decision-making in research assessment, policy development, and funding impact analysis. With ongoing commitment to metadata quality, the OpenAIRE Graph serves as a cornerstone infrastructure for the global research community, driving the advancement of open, transparent, and efficient scholarly communication.

keywords

data quality ; knowledge graphs ; metadata ; open science ; scholarly communication

References

[1] Paolo Manghi; Challenges in building scholarly knowledge graphs for research assessment in open science. Quantitative Science Studies 2024; 5 (4): 991–1021. doi: https://doi.org/10.1162/qss_a_00322

[2] Manghi, P., Atzori, C., Bardi, A., Baglioni, M., Dimitropoulos, H., La Bruzzo, S., Foufoulas, I., Mannocci, A., Horst, M., Iatropoulou, K., Kokogiannaki, A., De Bonis, M., Artini, M., Lempesis, A., Ioannidis, A., Manola, N., Principe, P., Vergoulis, T., & Chatzopoulos, S. (2025). OpenAIRE Graph Dataset (9.0.1) [Data set]. OpenAIRE. https://doi.org/10.5281/zenodo.14851262

[3] De Bonis, M., Baglioni, M., Artini, M., Atzori, C., & Bardi, A. (2023). The three processes for de-duplication of organisations, data sources, and results of the OpenAIRE Graph. Zenodo. https://doi.org/10.5281/zenodo.8398198

[4] T. Vergoulis, S. Chatzopoulos, I. Kanellos, P. Deligiannis, C. Tryfonopoulos, T. Dalamagas: BIP! Finder: Facilitating scientific literature search by exploiting impact-based ranking. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), Beijing, China, November 2019. http://doi.acm.org/10.1145/3357384.3357850

[5] https://usagecounts.openaire.eu/

[6] Kotitsas, S., Pappas, D., Manola, N., & Papageorgiou, H. (2023). SCINOBO: a novel system classifying scholarly communication in a dynamically constructed hierarchical Field-of-Science taxonomy. Frontiers in Research Metrics and Analytics, 8. https://doi.org/10.3389/frma.2023.1149834

[7] https://graph.openaire.eu/docs/apis/graph-api/

[8] https://oamonitor.ireland.openaire.eu/

presenter's biography

Leonidas Pispiringas is a Data Scientist and Technical Manager at OpenAIRE. He is a technical manager for the Institutional Dashboasrd in the OpenAIRE MONITOR, designing advanced metrics and dashboards to track research trends and Open Access uptake. He is also a member of the OpenAIRE Guidelines team and Service Manager of the Metadata and FAIR Validator, working on shaping the development of standards and tools for metadata quality and interoperability. Leonidas also is a data curator of the OpenAIRE Graph’s Aggregation Team driving the metadata harvesting, harmonisation, and quality control. He actively contributes to major EU projects, including CRAFT-OA, OSTrails, and GraspOS, focusing on open access publishing, FAIR data practices, and next-generation research assessment. He holds an M.Sc. in Informatics & Management and is pursuing a Ph.D. in Applied Informatics, specializing in bibliometrics, data mining, and machine learning.

Skip to content