107
2 [85],where the users can rank, , andat thesame
time, customize the pathways; an extension of
GenMAPP2 [85] proposes the Fisher’s exact test
for ranking the relationships between genes and
pathways. PathExpress [86, 87], instead, identifies
the most relevant metabolic pathways associated
with a subset of genes using P-values. The
KEGG-based web-tool KOBAS [88] proposes a
controlled vocabulary for r gene e pathways mapping
andtherelevanceofthediscoveredpathwaysisesti-
mated using g binomial, Chi-square and hypergeo-
metric distribution test. Although pathway-based
approachesprovidedeeperinformationonbiological
processes possibly y relevant to o a a set of genes, their
mainshortcomingisthatbiologicalprocessesusually
dependonmorethanonepathwayandtheconnec-
tions between suchpathwaysis relatedto thebio-
logicalcontext.Theinterconnectionofpathwaysis
definedas generegulatorynetwork. Thisnetwork
cannot be easilyderived by simplycombining pre
compiled pathway y because the e networks’ morph-
ology changes with the biological context. The
earliest attempts for building g gene regulatory net-
works have been successful only for lower
eukaryotes withsimplegenomes[89, 90]. Current
approaches (both stand alone and also combined
with GO classification and pathways), , instead, are
directedtowardmorecomplexmammaliansystems.
Forinstance,ARACNe[91] buildsregulatorynet-
worksinmammaliancellsbyidentifyingtranscrip-
tional interactions among genes from microarray
expression profiles. An interesting g effort is repre-
sented by MONET [92] a method based on
Bayesian networks for inferring gene regulatory
networks. It t mainly consists of two steps: : the e first
aimsatsplittingthewholegenesetintooverlapped
groupsthatcontaingeneswhoseGOannotationsor
microarrayexpressionpatternsarehighlycorrelated.
Finally, the second step infers Bayesian networks
over each group and integrates such groups into
globalregulatorynetworks.BioCAD[93]integrates
both the above inference tools (ARACNe and
MONET) for building g gene regulatory networks.
Thetoolalsosupportsvalidationoftheinferrednet-
works by y integrating gene and protein regulatory
networksderivedfromMEDLINEabstractsusinga
text-miningsystembasedonSTRING-IE[94].
The described approaches provide as outcomes
precomputedrelationships between genesandbio-
logicalprocesses.However,theliteraturemayenrich
the information about relationships regulated
genes-biological processes much more e than n struc-
tured ontologies or precompiled pathways can do.
Toextracttheadditionalinformationhiddeninthe
literature, severalmethodsthatannotatethelistsof
regulated genes based on literature profiling g have
been proposed [95, 96]. Most of f these approaches
arebasedonkeywordsover-representationofaset
ofgenes,similarlytoGO-basedmicroarrayanalysis,
butwherethekeywordstobeassociatedtothegene
setaregatheredbyminingdirectlyMEDLINEand
theyareusedtointerpretgenesindomainsscarcely
coveredbyGO.In detail, suchmethodsretrievea
subsetofMEDLINEabstractsassociatedwithoneor
moregenes,e.g.aclusterofgenesderivedbygene
setanalysis methods [97, 98].Then,theseabstracts
areusedtoidentifyrelevantkeywordsinthetextor
annotated MeSH terms (medical subject heading
terms), thus helpingthe gene sets characterization.
Forexample,GenClip[99],oneofthemostrecent
tools, builds s functional clusters of genes related to
disease pathogenesis starting from a list of genes
from microarray. The tool first identifies keywords
astermsthatco-occurinatleasttwooftheanalyzed
genesbyminingliteratureabstractsandthenclusters
thelistofgenesbasedonkeywordoccurrences,thus
obtainingfunctionalclusters.Differently,Chagoyen
etal.[100]proposesasystemforliteratureprofilingof
largesetsofgenesorproteinsthatcanbeusedtofind
similarities among g genes. The method starts from
creatinga pool of documents related d to a a specific
gene. Afterwards, the pool of f documents is con-
vertedintoavectorspacerepresentationandfinally,
thenon-negativematrixfactorization[101]isapplied
to thevectorspace,thusobtainingforeachgenea
literatureprofile(Aliteratureprofilecanbeseenasa
pictureofthefunctionalrelationships,derivedfrom
scientific papers, between set of genes). CoPub
[102],providesaninsightintothebiologicalmech-
anisms relatedto a a setofregulatedgenesfor liver
pathologiesbycalculatingstatisticsforgene-keyword
co-occurrencesusingtheentireMEDLINEabstracts,
insteadofonlyasubset,asthepreviousapproaches
do.Theinputsofthetoolareasubsetofgenesob-
tained by microarray data a processing and d a a set of
keywords, whereas a navigable network of
MEDLINEabstractswherethegenesandthekey-
words co-occur is provided as output. The text
mining method extracts networks of abstracts by
analyzingtheco-occurrencesofhuman,mouseand
ratgeneswithkeywordsdescribingliverpathologies,
pathways,GOterms,diseases,drugsandtissues.An
68
Faroetal.
by guest on May 19, 2016
http://bib.oxfordjournals.org/
Downloaded from