Enhancing Prediction of Drug Indication and Side Effects through Named Entity Recognition and Jointly Learning of Syntactic Structures of Sentences
European Journal of Molecular & Clinical Medicine,
2020, Volume 7, Issue 6, Pages 170-176
AbstractAbstract: The drug discovery process needs long time and cost to discover proper drug for treating the patients effectively. The unintended effects of drugs and the beneficial impact of drugs must be recognized because they may inflict severe patient’s injuries due to unforeseen acts of the produced candidate drugs. One of the effective techniques is text mining it can find the hidden relation between genes, diseases and drugs from the huge volume of data. Predict drug Indications and Side effects using TOpic modeling and Natural language processing (PISTON) was a text mining method which used to find the association between drug-disease and drug-side effects. Natural Language Processing (NLP) is used to identify words which relate association among drugs and genes from the sentences which are collected from literatures where words represent drugs and genes co-occurred. The relation between drugs and genes is represented through building drug-topic probability matrix by topic modeling. From the drug-topic probability matrix, the drugs for phenotypes can be identified by training a classifier for high-rank topics of drugs. It also predicted the association between drug and side effects. However, expressive power of named entities and their potential for enhancing the quality of discovered topics has not received much attention in PISTON. So in this paper, an Improved PISTON (IPISTON) is proposed which enhance the quality of discovered topics through named entity recognition system and inducing the syntactic structure from unannotated sentences. Initially, the sentences from the collected literature data are extracted and a dependency graph is constructed using NLP. After that, a Gene Regulation Score (GRS) of each sentence is calculated to define the relationship between gene and diseases. The topic modeling is enhanced by finding the biomedical entities in the biomedical repository using Conditional Random Field (CRF) and Bi-directional Long-Short Term Memory-CRF (BLSTM-CRF). CRF is a sequence modeling framework which finds the biomedical entities through the conditional probability distributions of biomedical entities on collected documents. BLSTM-CRF is a deep learning technique which is used to enhance the performance of CRF based named entity recognition. Moreover, the syntactic structure of sentences is calculated through syntactic distance measure. The syntactic structure, biomedical entities and the drug-topic probability matrix is given as input to CRF, BLSTM-CRF, Naïve Bayes, CART and Logistic for prediction of drug-phenotype and drug-side effects associations.
- Article View: 165
- PDF Download: 294