windows form application in c# with database pdf : Add hyperlink to pdf online SDK application service wpf azure web page dnn 61.full0-part1940

Combining literature text mining with
microarray data: advances for system
biology modeling
AlbertoFaro,DanielaGiordanoandConcettoSpampinato
Abstract
Ahugeamountofimportantbiomedicalinformationishiddeninthebulkofresearcharticlesinbiomedicalfields.
Atthesametime,thepublicationofdatabasesofbiologicalinformationandofexperimentaldatasetsgeneratedby
high-throughputmethods is in greatexpansion, and a wealth h of f annotatedgene databases, chemical, genomic
(includingmicroarraydatasets),clinicalandothertypesofdatarepositoriesarenowavailableontheWeb.Thus
acurrentchallengeofbioinformaticsistodeveloptargetedmethodsandtoolsthatintegratescientificliterature,
biologicaldatabasesandexperimentaldataforreducingthetimeofdatabasecurationandforaccessingevidence,
eitherintheliteratureorinthedatasets,usefulfortheanalysisathand.Underthisscenario,thisarticlereviews
theknowledgediscoverysystemsthatfuseinformationfromtheliterature,gatheredbytextmining,withmicro-
arraydataforenrichingthelistsofdownandupregulatedgeneswithelementsforbiologicalunderstandingandfor
generatingandvalidatingnewbiologicalhypothesis.Finally,aneasytouseandfreelyaccessibletool,GeneWizard,
thatexploitstextminingandmicroarraydatafusionforsupportingresearchersindiscoveringgene^diseaserela-
tionshipsisdescribed.
Keywords:literaturetextmining;microarraydata;biologicaldatabases;knowledgediscovery
INTRODUCTION
Ahugeamountofbiomedicalinformationishidden
in millionsofresearcharticlespublishedinthelast
20 years and this quantity is bound to increase
exponentially[1].Similarly,thepublicationbiologi-
cal databasesisingreatexpansion, anda wealth of
annotatedgenedatabases,chemical,genomic,clinic-
al and other types of data repositories, including
drugs and microarrayexperiments are available on
theWeb.Thusatopicalchallengeofbioinformatics
is to leverage on the combination of multi-type
informationsources,foramoreeffectivesystembiol-
ogymodelingandknowledgediscovery[2,3].Afirst
importantsteptowardstheorganizationandintegra-
tion of multi-type biomedical information is the
National Center for Biotechnology Information’s
(NCBI)EntrezCross-Database[4]thatinterconnects
PubMedabstractswith NCBI’sdatabasesonDNA
sequenceandchemicalstructure, thusspeedingup
the research of data related to a given disease.
However, this system does not include disease–
gene or disease–protein compendia and it is not
Corresponding author. Concetto o Spampinato, Department of Informatics and Telecommunication Engineering – University of
Catania, Viale Andrea Doria, 6 – 95127 – Catania, Italy. Tel: : þ39 (0) 95 7382372; ; Fax: þ39 (0) 95 7382397; E-mail:
cspampin@diit.unict.it
AlbertoFaroisFullProfessorofArtificialIntelligenceattheEngineeringFacultyoftheUniversityofCatania,Italy,whereheisalso
theDeanoftheComputerEngineeringdegree.Hiscurrentresearchinterestsinclude:dynamicsystemstheoryofcognition,intelligent
learningenvironments,computervisionandmobilecomputing.
DanielaGiordanoholdstheLaureadegreeinElectronicEngineering,grade110/110 cumlaude,fromtheUniversityofCatania,
Italy(1990),andaPhDinEducationalTechnologyfromConcordiaUniversity,Montreal(1998).Since2001sheisAssociateProfessor
ofInformationSystemsoftheDIEEIDepartment,EngineeringFacultyoftheUniversityofCatania,wheresheteachesthegraduate
levelcourse‘CognitiveSystemsandHuman-ComputerInteraction’.Herresearchactivityhasdevelopedalongthefollowingtracks:
(i) Knowledge Management; (ii) Data a Mining, , information n retrieval l and d visualization; (iii) ) Image and signal l processing g with
soft-computingtechniques;and(iv)Advancedlearningtechnologies.
ConcettoSpampinatoreceivedtheLaureainComputerEngineeringin2004,grade110/110cumlaude,andthePhDin2008from
theUniversity ofCatania,whereheiscurrently ResearchAssistant.Hisresearchinterests includeimageandsignal processing for
environmentalapplications,biomedicalimageprocessing,multimediaretrievalandbioinformatics.
BRIEFINGS IN N BIOINFORMATICS. . VOL13.NO O 1. 61^82
doi:10.1093/bib/bbr018
Advance Access s published on 15 June 2011
TheAuthor2011.PublishedbyOxfordUniversityPress.ForPermissions,pleaseemail:journals.permissions@oup.com
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
Add hyperlink to pdf online - insert, remove PDF links in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Free C# example code is offered for users to edit PDF document hyperlink (url), like inserting and deleting
adding a link to a pdf; add hyperlink to pdf in
Add hyperlink to pdf online - VB.NET PDF url edit library: insert, remove PDF links in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Help to Insert a Hyperlink to Specified PDF Document Page
adding links to pdf in preview; add hyperlink to pdf in preview
capabletolinkitsresultstoexternaldatabases, e.g.
drugsdatabases. Thislinkageisessentialtosupport
the cross-linking of textual information with the
relevant biological databases and to reinforce the
connectionbetweenannotationsinbiologicaldata-
bases. Moreover, given the fast development of
high-throughput methods with h consequent release
of experimental l data, the need of f developing tar-
getedbioinformaticstoolsandmethodsthatcombine
literature,biologicaldatabasesandexperimentaldata
for reducingthe time ofdatabase curation and for
knowledge discovery in literature is heavily
demanded [5]. . Under this scenario, , a a first aim of
thisarticleisreviewingtheknowledgediscoverysys-
temsthat integrate literature information, gathered
bytextmining, with microarraydata. Usually,this
isperformed either withthe goal of enrichingthe
listsofdownandup-regulatedgeneswithelements
for biological understandingor for generating and
validating new biological hypothesis. To illustrate
how text mining g and microarray data integration
maybeachieved,wefirstprovide,inthenextsec-
tion,anoverviewofthemethodsandtoolsthatare
used to perform text mining, i.e. information
retrieval(IR),namedentityrecognition(NER),in-
formation extraction(IE)andknowledge discovery
(KD), using as a starting point previous reviews
[6–9],andfocusingontheaspectsoftheintegration
betweenbiologicaldataandtextdatathathavebeen
recently investigated. . This overview allows us, , in
‘Combining text and microarray data’ ’ section, to
pointoutthedifferencesamongthecurrentattempts
atintegratingexperimentaldataintheminingloop.
Finally, in ‘GeneWizard’ section we illustrate a
new tool,GeneWizardthatusesmicroarraydatato
evaluate and validate biological hypothesis mined
fromtext.GeneWizard,basedonthemethodspro-
posedbyFaroetal.[10,11],proposesnovelrelation-
ships between genes and diseases by integrating
literature discoveries and gene sets gathered from
microarray data analysis. . In detail, starting from m a
gene–disease relationship, itextracts a set of genes
(related to the gene of the derived association)
involved in the disease. Biological functions,
namely, biological processes, cellular components
andmolecularfunctions, arethenassociatedtothe
validated set of genes by using Gene Ontology
(http://www.geneontology.org/) (GO). Finally, in
the conclusions s we outline the key challenges s to
advancingthistypeofintegratedknowledgediscov-
erysystems.
TOOLSANDMETHODSFOR
LITERATURETEXTMINING
Theprimarygoalofliteraturetextmining[12]isto
distillknowledgethatishiddenintextofpublished
papersandtopresentittotheusersinacoherentand
concise form. More formally, , the e ultimate goal of
textminingconcerns thediscoveryofnew, previ-
ouslyunknown information, byautomatictextre-
sourcesprocessing. Generally, systemsfor literature
textminingincludefourmainmodules[13]:(i)IR
togatherrelevanttextbyqueryingdatabasesofbio-
medicalpapers;(ii)NERtofindthebiologicalenti-
ties (e.g. genes, , proteins) within text; (iii) IE to
identify predefined relationships among biological
entities from explicit statements in text; and
(iv)KDtoelicitrelationshipshiddenintheinforma-
tion derived by the previous module. Recent
text-miningsystemshavestartedtakingintoconsid-
eration the integration betweenliterature and bio-
logical,chemical,medicalanddrugsdatabases.Inthe
nextsectionseachmoduleofatext-miningsystemis
reviewed,focusingonhowtheintegrationaspectis
takenintoaccount.
IR
Informationretrievalisthefirststepofanyliterature
text-miningsystem and aimsatfindingdocuments
related to the user’s query [9] or at identifying
thetextsegments(articles,abstracts,etc.)pertaining
to a specific topic. The most t famous IR R tool for
biomedical papersisPubMed, that ismainlybased
ontwosearchmodels:(i)amodelthatusesBoolean
operators to retrieve documents by performing
queriesintheform of<DiseaseX> and<GeneY>
and(ii)avectorspacemodel[14]thatrepresentseach
documentbyavectorofindexterms,inwhicheach
term ischaracterizedbyavalueaccordingtoafre-
quency-based weighting g system. The e space vector
model is used to train machine learning methods
for discriminating relevant papers and irrelevant
papers with respect to the queries issued by the
user.However,inordertoexploitthefullpotential
ofIRsystemsformakingscientificknowledgemore
accessibleandenablingautomaticknowledgediscov-
ery,somesystemshaveextendedtext-basedsearch-
ingtooperate on other sourcesofdata (biological,
chemical, medical, drugs annotated databases).
Examples ofthese tools are: (i) Query Chem [15]
thatcombinestext-basedIR on biochemical data-
basesand WebAPItoretrievetheinformationand
relationships between compound structures and
62
Faroetal.
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
VB.NET Create PDF from Word Library to convert docx, doc to PDF in
Change Word hyperlink to PDF hyperlink and bookmark. VB.NET Demo Code for Converting Word to PDF. Add necessary references: RasterEdge.Imaging.Basic.dll.
add links to pdf online; adding hyperlinks to pdf documents
VB.NET Create PDF from Excel Library to convert xlsx, xls to PDF
Change Excel hyperlink to PDF hyperlink and bookmark. VB.NET Demo Code for Converting Excel to PDF. Add necessary references: RasterEdge.Imaging.Basic.dll.
add links in pdf; add url to pdf
(ii) EBIMed [16] that retrieves sentencesbased on
co-occurrencesbetweenbiologicalentitiesandiden-
tifiesrelationshipsbetweenprotein/genenamesand
drugs.
Ofcourse,IRshouldnotrelyonlyonmethodsfor
querytermsmatching,becausetermambiguitymay
causelow precisionandlowrecall. To addressthis
issue,anumberofIRtoolsthatexploitsestablished
domain ontologies, , to support semantic search in
biomedicalrepositoriesandtoguaranteemorepre-
cision with respect to the Boolean search h systems,
have been proposed. For instance, GoPubMed
(http://www.gopubmed.org/)[17]classifiesabstracts
using GOterms, GoWeb (http://www.gopubmed
.org/goweb) [18] combines keyword-based Web
searchwith textminingandontologiestoorganize
and navigate the results and facilitate question an-
swering, whereas Textpresso (http://www.text
presso.org/)[19]usesacustomontologytoquerya
collectionofdocumentsforinformationonspecific
classes of biological concepts s (gene, cell, etc.) and
theirrelations.
IRsystemsarecurrentlyfocusingonhowtopre-
sent and distill the search results and how to
cross-linktheseresultswiththebiologicaldatabases
usedintheretrievalprocess[17,20];infact,longlists
ofretrievedpapersprovideascarceoverviewofthe
problem andmaycreateconfusionin the userson
whichsourcesofdatahavebeenused.Forexample,
iHOP[21]convertstheinformationinPubMedinto
anavigablemulti-sourcesnetworkofgenesandpro-
teins(thatalsoincludesphenotypes,pathologiesand
genefunction), thusprovidingan intuitivealterna-
tivewayofaccessingthetenmillionofabstractsin
PubMed.
NER
Biological entities are the backbone of any
text-mining system, , but often the naming of f the
entitiesisinconsistentandimprecise[22]sincethey
arecitedwithavarietyofterms.Therefore,themain
goalofaNERsystemistofindthebiologicalentities
(mainly genes and proteins) that are mentioned
within a text and d to o associate them with known
namesor identifiers(IDs). Usually,thistaskisper-
formed in n two steps: first, the recognition of the
words that refer to entities s and then, the unique
identificationofsuchentities.
The earliest NER systems relied on rule-based
approaches (e.g. . in n [23]), , i.e. . they y were based on
manually crafted rules that described common
namingstructuresfor certain termclasses,basedon
morphological,orthographicandsyntacticcharacter-
istics[24].Asannotatedcorpora(inwhichgeneand
protein names are categorized) have become
available, the newer systems have relied on
machine-learningalgorithms [25,26], to recognize
the names on the e basis of f their peculiar features.
Differently, methods relying on dictionaries
[27, 28] depend on lists of synonyms of entities
names that are matched in n documents s usingalgo-
rithms that recognize variations in how w thenames
appear (e.g. gene ‘BRCA1’ may be written as
‘Brca1’, ‘BRCA 1’, ‘brca1’, etc.). However, the
most effective and recent t NER systems are based
onthecuration ofentitiesnameliststoreducethe
aliases[6,29],eventhoughtheirmaindifficultyisthe
lackofstandardizationofnames(e.g.eachgenehas
manynamesandabbreviations).Underthisscenario,
ontologies, taxonomies and controlledvocabularies
areofstrategicimportance for NER systems since
theyprovidesemanticinterpretation ofbio-entities
[30–32].
Arelevantexampleofcontrolledvocabularythat
canbeusedforNERisMedicalSubjectHeadings
(MeSH)containingabout30000termsandmainly
usedforindexingarticlesinMEDLINE(MEDLINE
isoneofthecomponentofPubmedthatindexesthe
records using MESH controlled vocabularies.) (i.e.
each article is summarized by a set of f controlled
terms). MeSH H covers s protein n functions in cellular
systems,butitisnotexhaustive.
Currently, the tendency y for NER systems is to
integrate different vocabularies or ontologies in
ordertoprovideastructured,accurateandcomplete
listofthebiologicalentitiesthatcanovercomethe
aforementioned drawback. In this direction, one
valuable approach, based on integration n of f several
controlled vocabularies, , is SemCat [33] consisting
ofalargenumberofsemanticallycategorizedterms
coming from different biomedical knowledge re-
sources (e.g. Unified Medical Language System
(UMLS) [34], Gene Ontology (GO) [35], Entrez
Gene [36], ProtScan [37] ] and ChemID D [38]) and
open-domain corpora [39]. An example of NER
systembasedonSemCatwasdevelopedbyTanabe
etal.[40].Thisapproachbuildsaprioritymodelfor
entityrecognitionbasedonthepositionofthewords
inasentence,i.e.awordontherightsideofasen-
tenceismorelikelyanentitywithrespecttoaword
on the leftside. Similarly, , knowledge-based d NER
approaches using platforms that t integrate different
Combiningliteraturetextminingwithmicroarraydata
63
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
.NET PDF Document Viewing, Annotation, Conversion & Processing
Extract bookmark & outlines. Extract hyperlink inside PDF. PDF Write. Insert images into PDF. Edit, remove images from PDF. Add, edit, delete links. Form Process
add a link to a pdf in acrobat; add page number to pdf hyperlink
VB.NET PDF: Basic SDK Concept of XDoc.PDF
You may add PDF document protection functionality into your VB.NET program. Hyperlink Edit. XDoc.PDF for .NET allows VB.NET developers to edit hyperlink of PDF
pdf link to specific page; accessible links in pdf
typesofontologiesfromgenetermstogeneticpath-
ways(Ageneticpathwayisalinearsequenceofgene
activities resulting from m the functional l interactions
between different genes.), to proteins, to clinical
trials],such asBioPortal (http://www.bioontology.
org/ncbo/faces/index.xhtml) and Open Biological
Ontologies (http://www.obofoundry.org/), have
beeninvestigated[41].
A recent approach based on web-services is
Whatitiz[42]thatimplementsaNERbasedonmor-
phologicalvariabilityofterms.Inparticular,itispro-
vided with numerous modules for annotating
different entities: chemical entities (whatizit
Chemical), diseases (whatizitDiseaseUMLS for ac-
cessing the UMLS Metathesaurus using the tool
MetaMap (http://metamap.nlm.nih.gov/), drugs
(whatizitDrugs maps s drugs in the e text with terms
of a controlled vocabulary built using Drugbank
(http://redpoll.pharmacy.ualberta.ca/drugbank/),
andgenes (whatizitGOsearches for geneontology
terms).
Mostrecently,algorithmsabletoidentifyanddis-
ambiguateacronymsautomatically,eveniftheseare
not mapped in any standard nomenclature, have
beeninvestigatedtoimproveNERsystemsperform-
ance[43,44].
IE
IEfromliteratureaimsatextractingpredefinedtypes
of facts in the form m of f relationships between bio-
logical entities from the retrieved documents. . The
inputstothissteparesentences,whereastheoutputs
are relationships among biological entities.
Generally,twomainapproachesexist:
 Co-occurrences processing: these approaches
identify entities that co-occur within the text,
i.e. terms that appear r in the same e texts tend to
be related. Often these methods are able to
detect co-occurrences to o extract single relation-
shipsofa certain type: gene–gene, gene–disease,
protein–protein, etc. Several works have used
co-occurrence frequencies for r extracting known
single
relationships
[45].
For
example,
Al-Mubaid and Singh in [46] proposed a text
mining approach based on co-occurrence and
term frequency analysis, by which they found
andvalidatedsixsignificantgenesforAlzheimer’s
disease.
Co-occurrences approaches have been investi-
gated also for extracting facts that involve
multi-typedata,inlinewiththecurrentresearch’s
trendoftextandbiological dataintegration.For
instance, Mukhopadhyay et al. [47] identifies
multi-wayrelationshipsinvolvingmorethantwo
biological entities, i.e. genes, proteins, diseases,
drugsandchemicals,etc.Anexampleofidentified
relationshipis‘geneAactivatesproteinBindis-
easeCfororganDunderinfluenceofchemicalE’.
Co-occurrenceapproachestendtoprovidebetter
recallthan precision anderrors arise incomplex
sentences containing multiple relationships.
Theseapproachesareunabletoextractdirectional
relationships (i.e. A involves B B but B does not
involve A) and to distinguish different t types of
relationships, e.g. they cannot identify relation-
ships in the form ‘A is not connected to B’.
Precision can be improved by integrating
co-occurrencemethodswithruleorpattern-based
approaches.However,theseapproachestendtobe
datasetdependent,i.e.theruleorpatternsetsare
derivedfromtrainingdataoftennotapplicableto
other data different t from m the ones s used during
thetraining[48].
 Patterns parsingbyNaturalLanguageProcessing
(NLP): all the above mentioned issues are ad-
dressed byNLP approaches, which combine the
analysisofsyntaxandsemanticsinatextforob-
tainingrelationshipsbetweenfacts.Theworkflow
oftheseapproachesis:first,thetextistokenizedto
identify the boundaries of the words and sen-
tences, then a part-of-speech tagging (e.g.
[49,50])systemassignslabelssuchasnoun,verb,
adjectivetoeachword.Afterwards,asyntaxtreeis
computed for each sentence to detect noun
phrases and represent their relationships [6].
NERisthenusedto tagthe relevantbiological
entitiesintheserelationships.Finally,inorderto
identifythe evidences for entitiesrelationships,a
rule set based on the syntax tree and on the
semantic labels [51, 52] is used. For example
Fundel’setal.[53]developedRelExtoobtainde-
pendencytreesfromMEDLINEabstractsbyusing
Stanford Lexicalized Parser (http://nlp.stanford.
edu/downloads/lex-parser.shtml). Thesetreesare
then enriched with genes andproteinsbyusing
ProMiner[54],adictionary-basedNER.Finally,
a set of f three simple rules (e.g. ‘A A activates B’,
‘ActivationofAbyB’ and ‘Interaction between
AandB’)isappliedtoobtaincandidaterelation-
shipsthatarethensubmittedtoafilteringmodule
thatusesnegation check,enumerationresolution
64
Faroetal.
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
VB.NET Create PDF from PowerPoint Library to convert pptx, ppt to
Export PowerPoint hyperlink to PDF. VB.NET Demo Code for Converting PowerPoint to PDF. Add necessary references: RasterEdge.Imaging.Basic.dll.
pdf link to attached file; pdf hyperlinks
How to C#: Basic SDK Concept of XDoc.PDF for .NET
You may add PDF document protection functionality into your C# program. Hyperlink Edit. XDoc.PDF for .NET allows C# developers to edit hyperlink of PDF document
adding a link to a pdf in preview; active links in pdf
and restriction to the domain of interest for
screeningthecandidaterelationships.Oftenpartial
language-parsingapproachesareunableto detect
relationshipsthatspanmultiplesentences[6],and
fullparsing,providingmoreelaboratesyntacticin-
formation,isadoptedtoachievepotentiallybetter
results [55]. A typical full grammar parsing g ex-
ample is the Pro3Gres dependency parser [56]
thatintegrateshand-writtengrammarwithastat-
isticallanguagemodelforparsingunrestrictedtext
byusingdeepknowledgeoftheEnglishlanguage.
Fullparsingapproacheshavebeenrecentlyinte-
grated with multi-type data for extracting facts
involvingmoreconcepts: InfoPubMed[57]rec-
ognizes different types of interaction between
gene and proteins on MEDLINE abstracts by
combining full parsing, , machine learning tech-
niques and ontologies for NER; Pharmspresso
[58], identifies important pharmacogenomics
factsinarticlesreferencedtohumangenes,poly-
morphisms,drugsanddiseasesbyfulltextparsing
andbyexploringbiological, chemicalanddrugs
databases.
Anapproachbasedonfull-textprocessingthatalso
usesthe‘not’ conceptinproteinsrelationships, i.e.
protein A bindsprotein Bbut notprotein C,was
developedbyKim[59].Theyfound41471protein–
proteincontrastsavailableattheweb-addresshttp://
biocontrasts.biopathways.org/.
The currenttrend is to favor fulltexts over ab-
stractssincebiologicalentitiesidentifiedfrommining
only abstracts can be strongly underestimated be-
cause of abstracts’ concise nature. Methods for
mining full l biomedical texts need to be improved
substantially, especially in converting PDF or
HTML documents to plain n text and in handling
grammatical errors [60]. Another shortcoming of
currentmethodsisthattheydonotconsiderinfor-
mation hidden in tables and figures. Recently,
approachesthatintegrate textdata, biologicaldata-
bases and non-textual data (e.g. . images, graphics,
etc.)havebeen proposed and a comparative listis
providedin[7].AnexampleisSLIF[61]thatcom-
binesfigure’scaptionmining,imageprocessingand
specificdomainontologiestoextractbiomedicalin-
formation from fluorescence microscopy images.
A biologicalentityrecognitionsystemfindsprotein
andcelltypenamesintheminedcaptionsandthese
entities are associated with the patterns extracted
from the related images. Finally, a web-interface
andaXML-basedweb-serviceallowuserstoinves-
tigateandquerythederivedinformation.
KD
Trying to discover r hidden or implicit biomedical
links and to propose them as potential scientific
hypothesesisthemaingoalofknowledgediscovery
systems.Infact,thepreviouslydescribedIEsystems
extract onlypre identified or explicit relationships.
Swanson,pioneeroftheresearchinknowledgedis-
coveryfromtext,in[62]demonstrated,byusingthe
semi-automatedArrowsmithsystem[63],hownew
knowledge canbeinferredfromexistingliterature.
Inferringindirectrelationshipsimpliestousefactsin
the form A A leads s to B and B leads to C, then a
relationshipsmaybeinferredbetweenAandC.In
detail, theuser providesahypothesisbetweentwo
biologicalentities(AisrelatedtoC)thatisfurther
provedbysearchingforrelatedterms(B)supporting
thegiven hypothesis. Anexample ofinferred rela-
tionshipsistheone‘fishoil-Raynaud’sdisease’dis-
covered by Swanson [64] or the relationship
betweenmagnesiumdeficiencyandmigrainehead-
ache [65]. These two discoveries were confirmed
experimentally [66, 67]. Several methods relying
on natural language processing exist to discover
knowledgeaboutgeneregulation[68],proteinphos-
phorylation [69, 70], gene–disease or gene–gene
interaction[71–73].
OneofthemostcompletesystemsthatusesNLP
is GeneWays [74] that examines entire articles s to
extract thephysical interactions amongdiseaseand
genes hidden in n the literature. Differently, many
other systemsare basedon co-occurrence, i.e. the
idea that two concepts (biological entities) are
relatediftheyoccurinthesamecontextsinthelit-
erature.Theycanbebased eitheron (i) firstorder
co-occurrences,e.g.entityAco-occurswithentityB
[73], [10] or (ii) second-order co-occurrences
[75, 76], i.e. entity A co-occurs with entity B
whichco-occurswithentityC,thereforethereisa
relationship between entities A and C. These
approaches share the assumption that hidden and
validrelationshipsmaybefoundbysuitablyscreen-
ing the huge number of facts retrieved by the
co-occurrence approaches.Forinstance, Jelieretal.
[76]proposetheassociativeconceptspace(ACS)to
filtertheirrelevantrelationshipsandtermsobtained
byapplyingasecondorderco-occurrenceapproach.
Indetail,ACSreflectsnotonlytheco-occurrenceof
Combiningliteraturetextminingwithmicroarraydata
65
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
C# Create PDF from Word Library to convert docx, doc to PDF in C#.
Change Word hyperlink to PDF hyperlink and bookmark. C#.NET Sample Code: Convert Word to PDF in C#.NET Project. Add necessary references:
add a link to a pdf; add hyperlink pdf
C# Create PDF from PowerPoint Library to convert pptx, ppt to PDF
Export PowerPoint hyperlink to PDF in .NET console application. C#.NET Demo Code: Convert PowerPoint to PDF in C#.NET Application. Add necessary references:
pdf edit hyperlink; add hyperlink to pdf
two entities, but t also indirect, , multi-step p relation-
shipsbetweenentities.
Veryfew systemshave been designed to extract
complexandmultipletypesofrelationships(e.g.find
allthegenesinvolvedinadiseaseandalltherelated
proteins) that necessarily require different types of
data to be integrated. Anni 2.0 [77] uses an
ontology-basedinterface to MEDLINE to identify
different types of associations between biomedical
concepts, including g genes, , proteins and diseases. It
resortstotheideaofconceptprofiling,i.e.a listof
concepts is presented d where each h concept t is asso-
ciated to o the analyzed d text together with a a score
describingitsimportance.Anexampleofassociation
derivedwithAnni2.0.is:‘GeneKLK3isboundto
theprostatecancer,morespecificallywithmalignant
neoplasmofprostate’. Polysearch [78] is arecently
developed web-tool able to identify associations
from published abstracts and many y well-annotated
databases. Itenablesuserstoperformqueriesinthe
form: ‘Find all genes associated with a prostate
cancer’.Uptodateitsupportsmorethan50classes
ofqueries miningmore than a a dozen oftext, ab-
stractsandbioinformaticsdatabases.Akeyfunction-
alityofPolySearchisthatitextractsandanalyzestext
data not only from PubMed but also from other
databasessuchasDrugBank[79]andHumanGene
Mutation Database (HGMD) [80]. Gendoo [81]
identifiesdiseaserelevantgenesandaimsatunder-
standingtheirmechanismsbyinterpretingdatapro-
videdbygenomesequencesandtranscriptomics.In
detail, in Gendoo the On-line Mendelian
Inheritance in Man (OMIM) (http://www.ncbi
.nlm.nih.gov/omim/) knowledge-based system,
thatcontainsabout20000entriesforhumangenes
and for genetic diseases, , is re-organized by using
MeSH, thus improving OMIM’s exploitability by
computerautomation.
Insummary,severalKDmethodshavebeenpro-
posed in the last years s where integration between
biologicaldataandunstructured/structuredtexthas
beenachievedinatleastoneoftheIR,NERandIE
sub-systems.However,torealizethefullpotentialof
text mining, , new w methods s that integrate complex
texts,biologicalandalsorawexperimental dataare
needed,withafocusonenablingbiologiststoexploit
biologicalknowledgemoreeffectively.Thisisneces-
sarybecauseanyknowledgediscoverymethodgen-
erateshypothesesaboutrelationshipstobevalidated
empirically, andreusingavailableexperimentaldata
isaneffectivestrategytospeedupscientificprogress.
COMBININGTEXTAND
MICROARRAYDATA
Literature text-mining methods are useful to dis-
cover hidden or indirect relationships, however
their integration with high-throughput methods
(e.g. microarray) is heavily demanded. To fulfill
this need, in the last years bioinformatics efforts
have been n directed toward the implementation of
toolssupportingtheintegrationofbiologicalandex-
perimentaldatawithliteratureinformationinorder
to infer biological l hypothesis that can n assume the
form of pathways, , gene regulatory networks, or,
moreingeneral,biologicalnetworksinvolvingdif-
ferententitiessuchasgenes,proteins,diseases,drugs
from
experimental data gathered from
high-
throughputmethods.
DNA microarray technology, one of the most
commonhigh-throughputmethods,allowsresearch-
erstocomeacrossbiologicalfunctionsonagenomic
scale. However, the list of the produced down
and upregulated genes is very cryptic, thus
requiring a huge e effort in data a interpretation [13,
82]. Moreover, , the e selection ofsuch lists of genes
(i.e.theclusterstobeanalyzed)isdemandedtothe
researchersthat,giventheamountofdatainvolved,
mightnotpursuethebestselectionsinceitisdifficult
tocatchthecorrelationbetweentheclusterandthe
biologicalaspecttobeinvestigated.
Therefore,inthelast10years,theattentionofthe
bioinformaticscommunityhasbeendirectedmainly
toelicitingsuchcorrelationstounderstandthebio-
logicalmeaningofthe producedlistsofgenes, in-
stead of investigatingnovel clusteringand d statistics
methodologies.Understandingthebiologicalmean-
ingofasetofupanddownregulatedgenesderived
frommicroarrayexperimentsisoneaspectofcurrent
bioinformaticseffortsin data integration; theother
oneforeseestextminingcombinedwithmicroarray
datatargetedtothegenerationandtotheevaluation
of biological hypotheses, which can be obtained
either byminingmicroarraydatathatinvolvesdata
clusteringandmanualselectionoftheclusterstobe
analyzedorbyminingtheliteratureusingtheknow-
ledgediscoverysystemsbeforedescribedor byex-
ploiting the knowledge of the biomedical
researchers.
Understandingbiologicalmeaning
Themostnaturalwaytoassignabiologicalmeaning
to asetofgenesthathasbeenobtainedbymining
microarray data is to project it onto biological
66
Faroetal.
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
processes that t can n be represented in the form, in
orderofincreasingcomplexity,ofGOterms,path-
waysandgeneregulatorynetworks(Ageneregula-
torynetworkrepresentsacollectionofsegmentsof
DNA that maps gene regulations in living cells.)
previously identified or manually compiled by
researchers[82](seeFigure1).
The most common (and simple) approach for
genelistprojectionistouseGOforinterrelatinga
listofgeneswithabiologicalprocessand/oramo-
lecularfunctionand/oracellularcomponent.GOis
alsousedtoranksignificantgenes(producedinthe
microarray experiment) in n relationship to o the GO
categories. A list of about 70 methods and
tools that carry y out GO-based microarray analysis
is reviewed in [83]. An approach that goes
beyond simple GO classification is Onto-Express
[84], since it associates lists of up and down
regulated genes with functional profiles built by
correlatingGOterms(biologicalprocesses,chemical
components, molecular functions) with expression
profiles.
However,theGOclassificationdoesnotprovide
exhaustiveinformationaboutthebiologicalcontext
ofagivensetofupanddownregulatedgenes.This
canbeachievedbypathwayanalysisand/orbyregu-
latorynetworkanalysis.Pathwayanalysismainlyin-
vestigates the functional and physical interaction
among genes instead of using the gene-centered
view as GO-based approaches. These systems try
to mapgenesderivedfrommicroarrayexperiments
onto precompiled pathways derived by manually
analyzingthe literature. Mostnon-commercial sys-
temsforpathwayanalysisrelyontheKEGGdatabase
(http://www.genome.jp/kegg/) that contains a
collection of pathways representing the current
knowledge on gene and molecular interaction.
A comprehensive list of tools for pathways
analysis can be found at the weblink http://
www.geneontology.org/GO.tools.microarray.shtml.
Pathwaymappingofmicroarraydata, usually,gen-
eratesmorethanonepathway,therefore,itisneces-
sarytorankthemaccordingtotheirrelevancetothe
dataset.PathwaysrankingisprovidedinGenMAPP
Figure 1: Understandingbiologicalmeaningofasetofregulatedgenes.Themostcommonwaysforunderstanding
thebiologicalmeaningofasetofgenesare:(i)toprojectitontobiologicalprocessesrepresentedintheformof
GO terms,pathwaysandgeneregulatorynetworks(bluerectangle)and/or(ii)toannotatethelistsofregulated
genesbasedonliteratureprofiling(redrectangle).Pathwaysandgeneregulatorynetworksareusuallyderivedby
manualliteratureanalysis.
Combiningliteraturetextminingwithmicroarraydata
67
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
2 [85],where the users can rank, , andat thesame
time, customize the pathways; an extension of
GenMAPP2 [85] proposes the Fisher’s exact test
for ranking the relationships between genes and
pathways. PathExpress [86, 87], instead, identifies
the most relevant metabolic pathways associated
with a subset of genes using P-values. The
KEGG-based web-tool KOBAS [88] proposes a
controlled vocabulary for r gene e pathways mapping
andtherelevanceofthediscoveredpathwaysisesti-
mated using g binomial, Chi-square and hypergeo-
metric distribution test. Although pathway-based
approachesprovidedeeperinformationonbiological
processes possibly y relevant to o a a set of genes, their
mainshortcomingisthatbiologicalprocessesusually
dependonmorethanonepathwayandtheconnec-
tions between suchpathwaysis relatedto thebio-
logicalcontext.Theinterconnectionofpathwaysis
definedas generegulatorynetwork. Thisnetwork
cannot be easilyderived by simplycombining pre
compiled pathway y because the e networks’ morph-
ology changes with the biological context. The
earliest attempts for building g gene regulatory net-
works have been successful only for lower
eukaryotes withsimplegenomes[89, 90]. Current
approaches (both stand alone and also combined
with GO classification and pathways), , instead, are
directedtowardmorecomplexmammaliansystems.
Forinstance,ARACNe[91] buildsregulatorynet-
worksinmammaliancellsbyidentifyingtranscrip-
tional interactions among genes from microarray
expression profiles. An interesting g effort is repre-
sented by MONET [92] a method based on
Bayesian networks for inferring gene regulatory
networks. It t mainly consists of two steps: : the e first
aimsatsplittingthewholegenesetintooverlapped
groupsthatcontaingeneswhoseGOannotationsor
microarrayexpressionpatternsarehighlycorrelated.
Finally, the second step infers Bayesian networks
over each group and integrates such groups into
globalregulatorynetworks.BioCAD[93]integrates
both the above inference tools (ARACNe and
MONET) for building g gene regulatory networks.
Thetoolalsosupportsvalidationoftheinferrednet-
works by y integrating gene and protein regulatory
networksderivedfromMEDLINEabstractsusinga
text-miningsystembasedonSTRING-IE[94].
The described approaches provide as outcomes
precomputedrelationships between genesandbio-
logicalprocesses.However,theliteraturemayenrich
the information about relationships regulated
genes-biological processes much more e than n struc-
tured ontologies or precompiled pathways can do.
Toextracttheadditionalinformationhiddeninthe
literature, severalmethodsthatannotatethelistsof
regulated genes based on literature profiling g have
been proposed [95, 96]. Most of f these approaches
arebasedonkeywordsover-representationofaset
ofgenes,similarlytoGO-basedmicroarrayanalysis,
butwherethekeywordstobeassociatedtothegene
setaregatheredbyminingdirectlyMEDLINEand
theyareusedtointerpretgenesindomainsscarcely
coveredbyGO.In detail, suchmethodsretrievea
subsetofMEDLINEabstractsassociatedwithoneor
moregenes,e.g.aclusterofgenesderivedbygene
setanalysis methods [97, 98].Then,theseabstracts
areusedtoidentifyrelevantkeywordsinthetextor
annotated MeSH terms (medical subject heading
terms), thus helpingthe gene sets characterization.
Forexample,GenClip[99],oneofthemostrecent
tools, builds s functional clusters of genes related to
disease pathogenesis starting from a list of genes
from microarray. The tool first identifies keywords
astermsthatco-occurinatleasttwooftheanalyzed
genesbyminingliteratureabstractsandthenclusters
thelistofgenesbasedonkeywordoccurrences,thus
obtainingfunctionalclusters.Differently,Chagoyen
etal.[100]proposesasystemforliteratureprofilingof
largesetsofgenesorproteinsthatcanbeusedtofind
similarities among g genes. The method starts from
creatinga pool of documents related d to a a specific
gene. Afterwards, the pool of f documents is con-
vertedintoavectorspacerepresentationandfinally,
thenon-negativematrixfactorization[101]isapplied
to thevectorspace,thusobtainingforeachgenea
literatureprofile(Aliteratureprofilecanbeseenasa
pictureofthefunctionalrelationships,derivedfrom
scientific papers, between set of genes). CoPub
[102],providesaninsightintothebiologicalmech-
anisms relatedto a a setofregulatedgenesfor liver
pathologiesbycalculatingstatisticsforgene-keyword
co-occurrencesusingtheentireMEDLINEabstracts,
insteadofonlyasubset,asthepreviousapproaches
do.Theinputsofthetoolareasubsetofgenesob-
tained by microarray data a processing and d a a set of
keywords, whereas a navigable network of
MEDLINEabstractswherethegenesandthekey-
words co-occur is provided as output. The text
mining method extracts networks of abstracts by
analyzingtheco-occurrencesofhuman,mouseand
ratgeneswithkeywordsdescribingliverpathologies,
pathways,GOterms,diseases,drugsandtissues.An
68
Faroetal.
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
approachthatexploitsliteratureforsupportinggene
listinterpretationistheoneproposedbyJelieretal.
[103],whichusesassociationsderivedfromliterature
(usingtheAnni2.0tool)toprovideaninterpretation
ofgeneexpressionchanges.Indetail,theypropose
the literature-weighted global test to o compute e the
correlation between associations (in the form
gene-biomedical concept) obtained bymining g the
literatureandlistofgenesextractedfrommicroarray
andtheyprovideasoutputthescoresreflectingthe
importanceofageneforaconceptofanassociation.
Table1showssomeoftheweb availabletoolsfor
understandingbiological meaningofasetofregu-
latedgenesderivedfromexperimentaldata.
Hypothesisgeneration
Hypothesisgeneration has the objective to suggest
undiscoveredassociations betweenbiomedicalcon-
cepts;thisisdifferentfromtheattemptsatproviding
a biological l meaningto a set offacts (e.g. lists of
genes)extractedfromexperimental data.The most
explored venue for hypothesis generation concerns
thediscoveryofgenes andotherbiological entities
(togetherwiththeirrole)involvedinaspecificdis-
ease(disease-centricanalysis).Infact,predictingbio-
logicalentities(andtheirrole)involvedinadisease
beforeexperimentalanalysismaysavetimeandeffort
byindicatingwhere the research should lookinto.
Textminingandmicroarraydata havebeen com-
bined in two main ways to achieve this
goal:(i)startingfromamicroarrayrelatedtoaspecific
disease, a list ofgenes(thehypothesis) is extracted
(e.g.oneormorecluster)andthentheroleofsuch
genes(i.e. theprioritization)inthegivendiseaseis
exploredusinginformationextractedfromliterature
and(ii) the hypothesis is generated from literature
mining in the e form m of associations between genes
anda given disease and then these associationsare
filteredandvalidatedbyresortingtomicroarraydata.
Thefirstapproachesfollowthisworkflow:givena
setofgenes(thehypothesis)eithergathereddirectly
frommicroarraydataanalysisorpreviouslystoredin
publicdatabases(suchasGeneExpressionOmnibus),
theliterature(mainlyMEDLINE)ismined,starting
fromthissetofgenes,inordertoelicitagenepri-
oritizationforthegivendiseaseortofindoutother
biological concepts involved in the same disease
(Diseasemodeling)(seeFigure2).Theworkflowis
similar to the biological understanding approach’s
onewiththedifferencethatinthiscasetheoutcome
isarefinementandacloseexaminationoftheinput
hypothesisregardingconceptsandtheir role in the
given disease, whereas inthe context ofbiological
understanding the output is the association of a
meaningtoalistofgenes(facts).
One of f the most complete tools for r hypothesis
generation from microarray is s G2D [104, , 105]. It
performsgenesprioritizationrelatedtoinheriteddis-
easesbycombiningMeshannotationsinMEDLINE
andasetofgeneswiththeGOannotationsofentries
Table1: Listofthewebavailabletoolsforunderstandingbiologicalmeaningofasetofregulatedgenesderived
fromexperimentaldata
Description
UsedResources
Availableat
GenMAPP2
Visualize gene expression n data
biologicalpathways
GO,KEGG
http://www.genmapp.org/
PathExpress
MappingofasetofGenesonto
Pathways.
KEGGSwiss-Protdatabase
a
Blastx
b
http://bioinfoserver.rsbs.anu.edu
.au/utils/PathExpress/
KOBAS
Identifystatisticallyenriched
pathwaysforasetofgenesor
proteins
PathwaysDatabase:KEGG,PID
Curated
c
BioCyc
d
andPanther
e
http://kobas.cbi.pku.edu.cn/home.do
ARACNe
Estimategeneregulatorynetworks
inmammaliancellsusing
microarrayexpressionprofiles
Expressionprofiledatasetofhuman
Blymphocytecellsbuiltbythe
authors
http://wiki.c2b2.columbia.edu/
califanolab/index.
php/Software/ARACNE
GenCLIP
Clusteringofgenelistsbyliterature
profiling
NCBIEUtilitiesforTextMining
GeneList:HUGO
f
EntrezGene,
orUnigene
g
http://www.genclip.com
CoPub
Findbiomedicalconceptsfrom
Medlinelinkedtoageneset
(Affymetrixidentifiers)
NCBIE-UtilitiesforTextMining
http://services.nbic.
nl/cgi-bin/copub/CoPub.pl
a
http://expasy.org/sprot/;
b
http://blast.ncbi.nlm.nih.gov/;
c
http://pid.nci.nih.gov/;
d
http://biocyc.org/;
e
http://www.pantherdb.org/pathway/;
f
http://www.genenames.org/;
g
http://www.ncbi.nlm.nih.gov/unigene.
Combiningliteraturetextminingwithmicroarraydata
69
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
inNCBI RefSeq[106](collectionofannotatedse-
quences, including genomic DNA, , transcripts and
proteins). More specifically, it receives as input a
genomic region n and an OMIM disease identifier
andprovidesasoutputthegenespotentiallyinvolved
inthegivendisease.Indetail,forthediseaseunder
analysis the MESH terms from the ‘Disease
Category’ associated to publications in OMIM are
retrieved. These terms are then associated with
chemical, drugs and molecular functions by using
GO. Thedetectedmolecular functionsareusedto
identifyasequenceofDNAbyqueryingtheRefSeq
proteindatabase.Thissequenceisintegratedwitha
chromosomallocationforthegivendiseaseprovided
bytheOMIMdatabaseinordertoobtainalistof
genesrelatedtotheanalyzeddisease.G2Dwasori-
ginallydevelopedonlyfor Mendeliandiseases;cur-
rently, it t also o works for complex genetic diseases
[107].Likewise,Tiffinetal.[108]proposegenespri-
oritization according to the relationship disease-
affected tissue. The method integrates literature
discoveries (co-occurring disease and tissue names
in MEDLINE) and human gene expression data
from the Ensemble database [109] to link gene
expressionstodiseasesbyusingananatomicalontol-
ogy.First,thetoolassociatesanatomicaltermsfrom
anontologyforhumananatomicalsystemsandcell
types(eVOC[110])todiseasesnames,basedonthe
co-occurrenceinPubmedabstracts.Eachtermofthe
eVOContologyisthenrankedaccordingtothefre-
quency of annotation. . The top-scoring terms are
comparedwiththetermsalreadyannotatedtocan-
didate disease genes s using the Ensemble database.
Thegenesthatmismatchwiththealreadyannotated
genesrepresentthelistofgenestobeexplored.
The second approaches (Figure 3) generate
hypotheses(intheformofassociationsgene-disease)
byminingliterature,thentheyvalidatesuchhypoth-
esesbycheckingifthereisanyevidenceofthedis-
covered relationships in the experimental data. To
thebestofourknowledge,fewmethodsuseknow-
ledge gathered from m the literature for hypotheses
generation
and
validate
these
sets using
high-throughputmethods,thusallowingtheidenti-
ficationofnovelbiologicalentitiesrelationships.
Anapproachinthisdirectionistheoneproposed
by Faro et t al. [11], where hypothesis generation
about gene–diseases relationships is made by
miningspecializedliteratureusingtheco-occurrence
processing approach described in [10] and the
inferredrelationshipsareselectedandthenvalidated
by means of microarray data analysis. The used
text-miningalgorithmtendstoprovidebetterrecall
than precision, i.e. it provides more relationships
Figure 2: Hypothesisgenerationbymicroarraydataanalysis.Thefirstapproachfollowsthisworkflow:givenaset
ofgenes(thehypothesis)eithergathereddirectlyfrommicroarraydataanalysisorpreviouslystoredinpublicdata-
bases,theliteratureismined,startingfromthissetofgenes,inordertoelicitageneprioritizationforthegivendis-
easeortofindoutotherbiologicalconceptsinvolvedinthesamedisease(Diseasemodeling).
70
Faroetal.
 by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from 
Documents you may be interested
Documents you may be interested