DI-2021 @ KDD 2021: Heng Ji Talk Recording

Recording of invited talk 4/6 in Document Intelligence Workshop @ KDD2021 given by Heng Ji, professor at Computer Science Department, and an affiliated faculty member at Electrical and Computer Engineering Department of University of Illinois Urbana-Champaign.
Title: What’s in a Chemical Entity? https://youtu.be/JYkth7jk3a8
Abstract: Like many scientific fields, new chemistry literature has grown at a staggering pace, with tens of thousands of papers released every month. In our newly created U.S. NSF AI Institute on Molecular Synthesis, we are applying knowledge extraction techniques to automatically construct knowledge bases from scientific literature. The constructed knowledge bases include chemical entities and reactions between entities, and thus they can be used to predict chemical reactions, products, and properties, such as yield, toxicity, and water solubility, for creating new molecules and improving manufacture of target molecules. However, existing information extraction techniques developed for news domain or even biomedical literature will not be directly effective for chemistry literature. One reason is that chemical entities are often complex formula-like names (e.g., 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid). Moreover, many chemicals simply have never been coined with any nomenclature in natural language. Therefore the chemical entity mentions are essentially rare terms that cannot be learned well by only language model. In pursuit of this goal, we propose a novel multimodal embedding approach for constructing a shared common semantic space among multiple data modalities: (1) 2-D images of molecules, representing the underlying molecules or reactions; (2) text-based molecule descriptors; (3) chemical graph structure; (4) natural language definition and description; and (5) structured properties in external databases. I will then present the applications of this common semantic space in building an end-to-end knowledge extraction system for chemistry literature, and using the constructed knowledge base for cross-modal chemical entity retrieval with natural language, and molecule descriptor string generation from molecular diagram images. I’ll present a new benchmark that includes 81 million molecules and 100 chemistry papers fully annotated with a new fine-grained Chemistry ontology. I’ll also talk about remaining challenges and ongoing work on representing chemical reactions.
Program committee (alphabetical): Doug Burdick, Dave Lewis, Yijuan Lu, Hamid Motahari, Sandeep Tata Chair: Benjamin Han
Originally posted on LinkedIn.