The rapid growth of scientific and technological publications has increased the demand for automated methods capable of extracting structured knowledge from unstructured textual data. This problem is particularly relevant for chemical and technological texts, where essential information is often represented through chemical reactions, physical formulas, and domain-specific terminology that standard natural language processing techniques handle poorly. This paper proposes a hybrid information extraction method that combines Named Entity Recognition (NER), a rule-based MiningNOUN algorithm, and Conditional Random Fields (CRF) to improve the identification of entities and relationships in domain-specific scientific texts. The proposed approach integrates statistical and ontological principles, enabling the recognition of substances, processes, physical quantities, and formally structured expressions that are typically missed by baseline NER models. The method was evaluated on a corpus of chemical and technological texts describing experimental procedures and reaction processes. The results show that the combined NER + MiningNOUN + CRF configuration significantly increases the coverage of extracted entities compared to a standard NER pipeline, allowing the system to capture information expressed in both natural language and formal notation. The extracted entities and relations are integrated into an ontological knowledge graph compliant with RDF/OWL standards and further applied within a Retrieval-Augmented Generation (RAG) architecture. The proposed method supports the development of more reliable knowledge graphs for intelligent scientific data processing and can be adapted to other technical domains with complex symbolic representations.
Keywords
Named Entity RecognitionConditional Random FieldsOntology-Based Information ExtractionKnowledge Graph ConstructionScientific Text Processing.
References
L. Ehrlinger and W. Wöß, “Towards a definition of knowledge graphs,” in Proc. SEMANTiCS 2016 Conf., 2016.
A. Hogan et al., “Knowledge graphs,” in ACM Computing Surveys, vol. 54, no. 4, pp. 1-37, 2021, [Online]. Available: https://doi.org/10.1145/3447772.
S. L. Dixon, K. R. M. Mackay, and A. A. Butler, “Extracting chemical reactions from text using Snorkel,” in BMC Bioinformatics, vol. 21, no. 1, pp. 1-14, 2020, [Online]. Available: https://doi.org/10.1186/s12859-020-03542-1.
Y. Chen, M. Sun, and J. Zhao, “OpenChemIE: An information extraction toolkit for chemistry literature,” arXiv:2404.01462, 2024.
S. I. Sanabria and T. N. Hart, “Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature,” in Journal of Cheminformatics, vol. 16, no. 1, pp. 1-12, 2024, [Online]. Available: https://doi.org/10.1186/s13321-024-00928-8.
D. Bekoulis, J. Deleu, T. Demeester, and C. Develder, “Joint entity recognition and relation extraction as a multi-head selection problem,” arXiv:1804.07847, 2018, doi: 10.48550/arXiv.1804.07847.
J. Ferreira, R. Martins, and M. Araújo, “Ontology-driven extraction of contextualized information,” in Proc. ICAART 2023, vol. 3, pp. 642-649, 2023.
İ. Karadeniz and A. Özgür, “Linking entities through an ontology using word embeddings and syntactic re-ranking,” in BMC Bioinformatics, vol. 20, no. 1, p. 156, 2019, [Online]. Available: https://doi.org/10.1186/s12859-019-2678-8.
M. Y. Jaradeh et al., “Information extraction pipelines for knowledge graphs,” in Knowledge and Information Systems, 2023, [Online]. Available: https://doi.org/10.1007/s10115-022-01826-x.
Q. Qiu et al., “Integrating NLP and ontology matching into a unified system for automated information extraction from geological hazard reports,” in Journal of Earth Science, vol. 34, no. 5, pp. 1433-1446, 2023, [Online]. Available: https://doi.org/10.1007/s12583-022-1716-z.
Z. Han and J. Wang, “Knowledge enhanced graph inference network based entity-relation extraction and knowledge graph construction for industrial domain,” in Frontiers of Engineering Management, vol. 11, no. 1, pp. 143-158, 2024, [Online]. Available: https://doi.org/10.1007/s42524-023-0273-1.
K. Kozaki et al., “Role representation model using OWL and SWRL,” in Proc. Workshop on Roles and Relationships in Object-Oriented Programming, Multiagent Systems, and Ontologies, 2007, pp. 39-46.
L. Massel, T. Vorozhtsova, and N. I. Pjatkova, “Ontology engineering to support strategic decision-making in the energy sector,” in Ontology of Designing, vol. 7, pp. 66-76, 2017, [Online]. Available: https://doi.org/10.18287/2223-9537-2017-7-1-66-76.
S. Yu, “Application of artificial intelligence methods in knowledge graphs,” in Applied and Computational Engineering, vol. 106, pp. 52-58, 2024, [Online]. Available: https://doi.org/10.54254/2755-2721/106/20241287.
D. Vrandecic and M. Krötzsch, “Wikidata: A free collaborative knowledge base,” in Communications of the ACM, vol. 57, no. 10, pp. 78-85, 2014, [Online]. Available: https://doi.org/10.1145/2629489.
M. Nickel et al., “A review of relational machine learning for knowledge graphs,” in Proceedings of the IEEE, vol. 104, no. 1, pp. 11-33, 2016, [Online]. Available: https://doi.org/10.1109/JPROC.2015.2483592.
R. Speer, J. Chin, and C. Havasi, “ConceptNet 5.5: An open multilingual graph of general knowledge,” in Proc. AAAI Conf. Artificial Intelligence, 2017, [Online]. Available: https://doi.org/10.1609/aaai.v31i1.11164.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” arXiv:1908.10084, 2019.
F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT models for Brazilian Portuguese NLP,” arXiv:1909.10649, 2019.