Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.25673/78511
Titel: Learning chemistry : exploring the suitability of machine learning for the task of structure-based chemical ontology classification
Autor(en): Hastings, Janna
Glauer, Martin
Memariani, Adel
Neuhaus, Fabian
Mossakowski, Till
Erscheinungsdatum: 2021
Art: Artikel
Sprache: Englisch
URN: urn:nbn:de:gbv:ma9:1-1981185920-804651
Schlagwörter: Chemical ontology
Automated classification
Machine learning
LSTM
Zusammenfassung: Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifications of chemical entities that can be used for navigation and filtering of the large chemical space. ChEBI is a prominent example of a chemical ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classification problem. In this paper we evaluate machine learning approaches for this task, comparing different learning frameworks including logistic regression, decision trees and long short-term memory artificial neural networks, and different encoding approaches for the chemical structures, including cheminformatics fingerprints and character-based encoding from chemical line notation representations. We find that classical learning approaches such as logistic regression perform well with sets of relatively specific, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every molecule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches.
URI: https://opendata.uni-halle.de//handle/1981185920/80465
http://dx.doi.org/10.25673/78511
Open-Access: Open-Access-Publikation
Nutzungslizenz: (CC BY 4.0) Creative Commons Namensnennung 4.0 International(CC BY 4.0) Creative Commons Namensnennung 4.0 International
Sponsor/Geldgeber: OVGU-Publikationsfonds 2021
Journal Titel: Journal of cheminformatics
Verlag: BioMed Central
Verlagsort: London
Band: 13
Originalveröffentlichung: 10.21203/rs.3.rs-107431/v1
Seitenanfang: 1
Seitenende: 20
Enthalten in den Sammlungen:Fakultät für Informatik (OA)

Dateien zu dieser Ressource:
Datei Beschreibung GrößeFormat 
Hastings et al._Learning_2021.pdfZweitveröffentlichung2.28 MBAdobe PDFMiniaturbild
Öffnen/Anzeigen