54
36
There are a number of open-source text analytic applications. The Natural Language
Processing Group at Sheffield University, United Kingdom, has developed “GATE –
General Architecture for Text Engineering”
16
, which claims a large user community but
comes with a modular structure that may confuse newcomers. RapidMiner Enterprise
17
offers a free community edition of its data mining software, but I have not been able to
download the associated text analysis plug-in. AutoMap, a software developed by the
Carnegie Mellon University, Center for Computational Analysis of Social and
Organizational Systems (CASOS)
18
(Carley 2009), has good text preparation tools (see
Sidebar below). Tutorials and sponsorship indicate that lately it has been enhanced
chiefly for terrorist network detection.
This is a fast moving field with which the documentation, public upgrades and reasonable
effort to make informed selections can barely keep pace. Without an education in
linguistics, even the statistically minded outsider can participate only up to a point. For
the hurried consultant (and for the harried monitor), these applications seem too heavy
and too solitary in their community. An exception could be made for situations in which
network structures – physical, social and semantic – have to be investigated deeply. An
example that comes to mind is Moore et al.’s (2003) study of the Mozambique 2000
flood response, assuming that the authors collected a considerable number of documents
from the 65 NGOs in the network.
[Sidebar:] Text preprocessing for effective analysis
TextSTAT takes word forms as they come. The search string “empower”, for example, returns
“empower”, “empowering”, and “empowerment”. Their instances must be inspected in a separate
concordance for each form. This is not optimal in the search for concept prominence and for
underlying meaning structures. As we have already seen in the Wordscores section, reducing
words to their radicals, through an operation called stemming, may make for better and easier
analysis. Stemming is one of several text preprocessing steps that natural language processing
software such as AutoMap provides beyond the elementary functionalities of TextSTAT and
Wordscores. A brief enumeration of these operations may give a first idea of the processes
involved. I largely follow Leser (2008), with some additions from Carley (2009):
• Format conversion: The software may require conversion of all corpus documents into
one particular format that it can read, such as .txt.
• Removal of special characters and/or numbers: This facilitates indexing and
searching.
• Conversion to lower case: Combines words that happen to be lower or title case by
accident of sentence position, but loses abbreviations and makes named entity
recognition more difficult.
• Stop word removal: Frequent words whose removal does not normally change
document meaning in text analytics (it does in everyday language, including our normal
reading!). The ten most frequent stop words in English are: the, of, and, to, a, in, that, is,
was, it. Removing the top six (the, of, and, to, a, in) typically eliminates a fifth of the
tokens.
16
http://gate.ac.uk/index.html
17
http://rapid-i.com/component/option,com_frontpage/Itemid,1/lang,en/
18
http://www.casos.cs.cmu.edu/projects/automap/
. Updated in June 2009, after my initial trial.
52
37
• Named entity recognition: Proper names are very important for the naive understanding,
but also for the latent meaning search in texts. Many consist of more than one token. The
Lutheran World Federation is not some federation of all Lutheran worlds, whatever this
could mean, but the worldwide federation of Lutheran churches.
• Speech tagging: Attaching to each word a tag of its supposed position / function within
its sentence later helps with processing the text in “actor – organization – activities” and
similar schemes.
• Anaphora resolution: In natural language, the meaning of most pronouns is made clear
by grammar and context. “The LWF delegates passed two resolutions. They
discussed
them again the following morning”, meaning “The delegates
discussed the resolutions
again..”. In text analytics, this may need to be made explicit.
• Stemming: Reduces words to their base forms so that different word forms with the
same meaning are collapsed. Often these are neither a standard word in the language
(e.g., “theolog”), nor the exact linguistic root.
• Thesaurus creation: A set of fixed terms and relationships between them allows texts to
be organized in hierarchical manner. Thus “group liability” may be part of “loan
repayment”, but not of “technical support”. Both may be part of “microcredit”.
To repeat, the intent is not to equip oneself for all these operations, but to acquire a sensibility for
some of the ways in which modern text analytics deals with linguistic complexity.
Qualitative research software
Much of this paper has so far dealt with word lists or term lists. Common sense and
linguistics, however, tell us that meaning resides in sentences rather than in words. In fact,
it resides in word, sentence and wider context – in what preceded, and (if already known
or anticipated) in what follows.
The necessity to pay close attention to meaning structures is one of numerous reasons that
have spawned an explosion of qualitative research, and more recently also of “mixed
methods” approaches (the combined use of qualitative and quantitative methods). The
methodological field is vast and growing (Denzin and Lincoln (2005) is one among many
large handbook-type works) and does not concern us here except to point to the existence
of text analysis applications specifically couched in qualitative research traditions.
Apart from commercial packages, some of which have attracted a community of users of
consequential size and support – leaders include “Atlas.ti” (Hwang 2008) and “QDA
Miner” (Lewis and Maas 2007) -, a few open-source applications are available (for links
to some, see again Altman, op.cit.). Notable for the institutional prominence of their
sponsor, EZ-Text
19
and AnSWR
20
are two applications created within the US Center for
Disease Control (CDC), primarily to support qualitative research with patients.
Given scant experience with such applications, I limit my observations to two, regarding
both text analytics and qualitative research:
19
http://www.cdc.gov/hiv/topics/surveillance/resources/software/ez-text/index.htm
20
http://www.cdc.gov/hiv/topics/surveillance/resources/software/answr/index.htm
. I installed this software
on a computer some years back, but found the documentation insufficient. The AnSWR Web page has not
been modified after May 2007.
43
38
First, the learning curve is clearly much higher. Some of the research institutes or sellers
behind such software organize training courses; typically these last a full week. As to
freeware, more than once I found that the documentation was outdated (it taught an
earlier version) or too abridged to guide self-learners through the initial hurdles.
Second, it is true that humanitarian and development evaluation ToR occasionally
demand qualitative approaches. The flash word for these kinds of expectations is
“triangulation”. But one may wonder whether the desk officers drafting the ToR are
conscious of the challenges that serious triangulation places on an evaluation team and its
host organization. As far as the computer applications are concerned, some other factors
conspire against their use in evaluations and similar assignments. Apart from rare and
lucky partnerships with local academics already familiar with the particular program that
the expatriate team member brings to the task, reliance on advanced software during team
work may turn the user into a social and cognitive isolate.
Discussing barriers to successful mixed-method approaches, Bryman (2007) explicitly
mentions synchronization issues: “The timelines of the quantitative and qualitative
components may be out of kilter so that one is completed sooner than the other” (ibd.:
14). Which side advances faster depends also on institutional barriers to acquiring
documents speedily, say, for example, policy documents from capital city headquarters vs.
spreadsheets from decentralized field monitoring units
21
. Bamberger et al. (op.cit.: 84)
are generally pessimistic about the use of qualitative data analysis packages under time
pressure: “they take a long time to set up and the purpose is usually to provide more
comprehensive analysis rather than to save time.”
This does not preclude that situations exist in evaluations and field research in which
advanced text analytical and qualitative research software significantly enhances
productivity. Davis, in a workshop report on panel surveys and life history methods
(Baulch and Scott 2006), relates the use of such a program for the subsequent
categorization of life histories that he collected among the poor of Bangladesh (ibd.: 8).
Yet, by and large, the decision to invest the time (and, for commercial products, money)
in learning and working with such applications must be weighted by the individual
researcher considering her personal situation.
21
I have been embroiled in similar dilemmas myself. At one time, I was hired as the number cruncher in a
politically sensitive review of a large UN humanitarian program (and creatively designated as “relief
economist”). All the other researchers in the team were qualitative-leaning. Due to the accident of data
acquisition, I was the only one with “showable output” by the time the team presented at a conference
attended by openly hostile government bureaucrats. Predictably, the presentation of relief goods
transportation scenarios was singled out for contextual gaps. These were caused by the delay in working up
historical and institutional aspects. Research software played a minor role in this to the extent that the
political sensitivity obliged my fellow team members to reference hosts of slowly arriving documents in
time-consuming watertight bibliographical annotations.
19
39
[Sidebar:] Food vendors and meaning structures
Another young man in Dili, East Timor, let me take this picture of his pretty arrangement of
clementines, exuding a tranquility free from all time pressures. Different from his age mates on
the title page, he sells an article unqualified by any texts, carefully managed with a local
technology.
Yet, tranquility is the exception, not the rule. Itinerant food vendors move almost continuously,
rapidly commuting between places and hours that incline their customers to buy. The work is hard,
competition stiff. The man bore two such clusters, dangling from a shoulder yoke.
The picture does drive home a point in text analysis. Shaped by the physical properties of fruit
and string, as well as by the man’s stamina, marketing savvy and personal preferences, the
cluster behaves as an analog computer. We notice the hexagonal compaction; the position of
every fruit can be described with just a few parameters of an almost perfect lattice.
At the same time, this high degree of order conveys no knowledge whatsoever of the properties
of other emergent levels. Seeing the cluster tells us nothing about whether these fruit have seeds,
or how much money the vendor makes when he sells them. Similarly, the statistical structures
that text analysis may detect say nothing about the ultimate meaning of a text as a whole, let
alone of its pragmatic consequences. They do give us internal landmarks that facilitate the holistic
quest.
38
40
Outlook: The dictatorship of time and the community of
learners
This paper made four basic assumptions:
1. Humanitarian and development workers at times work with voluminous, complex
or otherwise difficult text documents.
2. Such situations may necessitate more than revision, ultimately prompting a new
text that interprets those at hand.
3. Often this type of work needs to be done within tough time constraints.
4. Computer-assisted text analysis can make it more efficient.
The temporal dimension is thus the leading one in this rationale. This can be questioned.
The social and substantive dimensions of working with the text documents of relief
agencies, social movement NGOs, the Red Cross, etc. may seem, in the minds of some,
to hold more important directives. After all, what follows from the fact of life that time is
always short?
The social dimension covers the reliability of text analysis – would another consultant
interpret the same texts differently? – as well as such other aspects as the impact of the
digital divide on collaborative arrangements. In the substantive dimension, there are
validity challenges. It is not unknown to find reports, some with far-reaching claims, in
which “text analysis” is hardly more than a codeword for insufficient field exposure. And,
do the constructs and metrics of text analysis actually prove anything beyond, or distinct
from, what the original texts purport to convey?
These are important questions, but I defend the “dictatorship of time” on two grounds:
• Any intelligent reading of texts is time-consuming. The discussion and synthesis,
in working teams and then with principals, of the findings may take even more
time. Devices that accelerate the initial processing of texts liberate time for later
synthesis, debate and other important activities such as field visits. They help to
redistribute the elements of learning processes while at the same time giving us a
firmer handle on those texts of which we must take note.
• Second, besides the chronological and social time of the group that works with a
shared set of texts, every participant lives his and her own biographic time. This
includes the rhythms at which we replenish our professional and technical skills.
You and I lose some skills inadvertently, shed obsolete ones deliberately, strive
for some beneficial new ones, and remain ignorant of many others that would pay
even greater dividends. We don’t do it alone. Yet, the windows for learning
together, across social boundaries and divergent agendas, remain open for brief
moments only. Alone I learn for years; this particular group together - maybe for
one hour. If others are to use my tools, I need to arrange a rapid transfer.
Documents you may be interested
Documents you may be interested