Anda di halaman 1dari 7

Nermeen Shaltout Lancaster Scholarship Previous Work and Research Proposal February 8, 2012 1 Previous Work

The research work in Bioinformatics, described below consists of two disciplines, biology and computer science. The work then takes the normal machine learning flow. The goal of the previous work is to be able to classify influenza data according to host. The previous project was the survey of existing data mining techniques in order to classify the influenza virus. The classification is according to hosts: Swine, Avian, and Human, using 1. Data Preprocessing 2. Feature selection 3. Classification 1.1 Data Preprocessing

The data preprocessing section uses famous Bioinformatics tools, in order to prepare the data for the feature selection process. It involves
1. Downloading data from online gene banks such as www.ncbi.org [1], and www.fludb.org [2]. The data is in the form of genetic material, specifically

nucleotides: The nucleotides are letters representing the building unit of genes. The data being downloaded is DNA data, and consists of four letters arranged in several patterns according to the virus, and host: A, G, C, and T. 2. Once the genetic data is downloaded it is adjusted using tools such as Bioedit[3],and Mafft[4] for data alignment. The data alignment makes extracting the features easier by arranging similar data segments together.

1.2 Data Extraction Genetic data is huge in size, which is why features must be selected. This was done from scratch by coding in Matlab. 1. The data extraction is performed using information gain, across all three virus classes. 2. The data with the most information gain would be most suitable for classification, and will be selected in order to classify the genes. 3. The information gain is carried in two tiers. This is because the influenza virus data is so massive. Influenza virus can also be divided by subtype. The subtype is determined by proteins on the surface of the virus known as antigens. The two main antigens being H, and N. By maximizing the information gain against the four H antigen subtypes; H1, H2, H3, and H5. After that the information gain is optimized against the influenza host types: Human, Avian, and Swine. The classification as shown in the next section will also be divided into two tiers as will be seen in the next section.
4. The following algorithm is used to calculate the information gain, is

subtracting the entropy from the remainder. The entropy is calculated using the equation. [5]

5. The remainder is calculated using the following equation.[5]

The information gain is finally calculated using the following equation. [5]

1.3

Classification

Classification can be achieved used more than one technique. Some techniques are more suitable for certain techniques than others. 1. Check the classification accuracy using neural networks 2. Check the classification accuracy using decision trees. 3. Measuring the performance for both techniques

Improving Influenza A Host Classification using Novel Feature Selection and Classification techniques Main Points of Research 3.1 Research proposal

There are several strains of Influenza virus, the main strain causing problems being Influenza A virus. The influenza A virus has a high tendency to mutate therefore causing pandemics. Recently the influenza virus gained the ability to infect more than one host, in previous cases; Avian and Swine Flu were able to infect both Avian and Human hosts, and Swine and Human hosts respectively making the spread of these pandemics twice as fast. The viruses also possess the ability to kill the host in case of Avian, or pose a threat to the individuals' life within a certain age range in case of Swine Flu. Pandemics of the influenza virus have basically afflicted many countries especially infamous variations such as Swine Flu, and Avian Flu. To solve this if Influenza A can be classified according to host using the genetic information, future pandemics can be tracked and stopped easily. However it must be done in a swift manner.

3.2

Goals and Meth odology

Feature selection, and classification techniques already exist and have been surveyed in the previous project. Unlike the first project, the main goal is to create an optimized feature selection, classification or preprocessing system to get the same result across a faster time line. Bioinformatics calculations are usually very programming intensive and thus building algorithms or organizing techniques to achieve the same results in lesser time is crucial in order to help advance researches. The model can be further abstracted to be used with other mutative viruses other than influenza by grouping the different data mining techniques using scripts. (Usually a range of detached programs are used to process the genetic data, instead of a unified system.) My main future contribution would be to would be to use a graph based decision trees to see if it would improve the feature extraction and classification algorithm speed [7]. If done right the graph based decision tree would be able to do the feature selection and classification in one step, with minimal data preprocessing. 3.3 Time Plan

Main phases of research work: Phase I Existing Paper and Literature Review Phase II Designing and Developing the System

Phase III System Implementation, Deployment and Testing Phase IV- Taking the research to the next level. Phase I - Literature Review (Already implemented) 1. Data Mining in Bioinformatics [6] 2. Data Mining genetic data in other viruses [5] 3. Data Mining applied to Influenza Flu virus 4. Novel techniques in Classification of genetic data [7] Phase II - Designing and Developing (In progress) Linking the different modules together via a script. More than one program is usually used for alignment, feature selection, and classification. Making a script to group all three together would be beneficial, especially in the abstraction phase. 2. Building an improved graph based classification system
1.

Phase III Testing the System's Performance 1. Performance test and evaluation of the old methods of preprocessing, feature extraction, and classification. 2. Performance testing and evaluation of the novel graph based decision tree. 3. Comparing both results, to measure improvement in performance.
1. 2.

Phase IV- Taking the Research to the Next Level.

Publishing the theoretical part of the paper before March 15. Transferring the research to a university abroad for further development in Lancaster University, or transferring to an available project available in Lancaster and matching with my research expertise. 3.4 Expected Results

Phase II should be completed before March 15th where a proposal should be submitted and defended, during February to the supervisors of both disciplines. Phase III should take place on March 15th-May 15th as well as one, when the research is passed to Kyoto university or a suitable university. Phase IV should take place from September 2013 or Spring 2014 depending on availability. The final result of the research will be a system that is able to take genetic input from online research databases, and output the origin of the host of the virus, at a faster rate than before.

References [1] The Influenza Research Database. N.p.. Web.


<http://www.flu.lanl.gov/>.

[2] "National Center for Biotechnology


Information."Influenza Virus Resource. N.p.. Web. <http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html>.

[3] Hall, Tom.


"http://www.mbio.ncsu.edu/bioedit/bioedit.html."Bioedit: Biological Sequence Alignment Editor. Ibis Biosciences Carlsbad, CA 92008, n.d. Web. <http://www.flu.lanl.gov/>.

[4] Katoh, Kazutaka. "Mafft-a multiple sequence alignment


program." .CBRC, AIST., <http://mafft.cbrc.jp/alignment/software/>. n.d. Web.

[5] Leung, KS, Eddie YT Ng, KH Lee, Henry LY Chan, Stephen KW Tsui, Tony SK Mok , Chi-Hang Tse , and Joseph JY Sung. "Data Mining on DNA Sequences of Hepatitis B Virus by Nonlinear Integrals." n. page. Web. 8 Feb. 2013. <http://www.f.waseda.jp/watada/News2006/SLE20060819Prof Leung/SLE20060819ProfLeung.pdf> [6] Y, Saeys, Inza U, and Larraaga P. "A Review of Feature Selection Techniques in Bioinformatics." 23.19 (2007): 2507-17. Print. <http://www.ncbi.nlm.nih.gov/pubmed/17720704>. [7] Geamsakul, Warodom, Takashi Matsuda, Tetsuya Yoshida, Hiroshi Motoda , and Takashi Washio . "Constructing a Decision Tree for Graph Structured Data." n. page. Web. 8 Feb. 2013. <http://www.ar.sanken.osakau.ac.jp/~washio/list/1.pdf>

Anda mungkin juga menyukai