132
Chapter 12. Full Text Search
to_tsvector
parses a textual document into tokens, reduces the tokens to lexemes, and returns a
tsvector
which lists the lexemes together with their positions in the document. The document is
processed according to the specified or default text search configuration. Here is a simple example:
SELECT to_tsvector(’english’, ’a fat
cat sat on a mat - it ate a fat rats’);
to_tsvector
-----------------------------------------------------
’ate’:9 ’cat’:3 ’fat’:2,11 ’mat’:7 ’rat’:12 ’sat’:4
Inthe example above weseethatthe resulting
tsvector
does not containthewords
a
,
on
,or
it
,the
word
rats
became
rat
,and the punctuation sign
-
was ignored.
The
to_tsvector
function internally calls a parser which breaks the document text into tokens and
assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where
the list can vary dependingon the token type. The first dictionary that recognizes the token emits one
or more normalizedlexemes torepresentthe token. For example,
rats
became
rat
becauseoneof the
dictionaries recognizedthat the word
rats
is a plural form of
rat
.Somewords arerecognizedas stop
words (Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful
in searching. In our example these are
a
,
on
,and
it
.If no dictionary in the list recognizes the token
then it is also ignored. In this example that happened to the punctuation sign
-
because there are in
fact no dictionariesassignedfor its tokentype (
Space symbols
), meaningspacetokens willnever be
indexed. The choices of parser, dictionaries and which types of tokens toindex are determined by the
selectedtextsearch configuration(Section12.7). Itis possible tohavemanydifferentconfigurations in
the same database, and predefined configurations are available for various languages. In our example
we used the default configuration
english
for the Englishlanguage.
The function
setweight
can be used to label the entries of a
tsvector
with a given weight, where
aweight is one of the letters
A
,
B
,
C
,or
D
.This is typically usedto markentries coming from different
parts of adocument, suchas title versus body. Later, this informationcanbe used for rankingof search
results.
Because
to_tsvector
(
NULL
)will return
NULL
,it is recommended to use
coalesce
whenever a
field might be null. Here is the recommended method for creating a
tsvector
from a structured
document:
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,”)), ’A’)
||
setweight(to_tsvector(coalesce(keyword,”)), ’B’)
||
setweight(to_tsvector(coalesce(abstract,”)), ’C’) ||
setweight(to_tsvector(coalesce(body,”)), ’D’);
Here we have used
setweight
to label the source of each lexeme in the finished
tsvector
,and
then merged the labeled
tsvector
values using the
tsvector
concatenation operator
||
.(Section
12.4.1 gives details about these operations.)
12.3.2. Parsing Queries
PostgreSQL provides the functions
to_tsquery
and
plainto_tsquery
for converting a query to
the
tsquery
data type.
to_tsquery
offers access to more features than
plainto_tsquery
,but is
less forgiving about its input.
to_tsquery([
config regconfig
, ]
querytext text
) returns
tsquery
340
C# Word - Render Word to Other Images Besides raster image Jpeg, images forms like Png, Bmp, Gif, .NET Graphics, and REImage (an intermediate class) are also supported. Add references:
add jpg signature to pdf; add jpg to pdf online C# powerpoint - Render PowerPoint to Other Images Besides raster image Jpeg, images forms like Png, Bmp, Gif, .NET Graphics, and Add necessary XDoc.PowerPoint DLL libraries into your created C# application as
add jpg to pdf document; add an image to a pdf
100
Chapter 12. Full Text Search
to_tsquery
creates a
tsquery
value from
querytext
,which must consist of single tokens sep-
arated by the Boolean operators
&
(AND),
|
(OR) and
!
(NOT). These operators can be grouped
using parentheses. In other words, the input to
to_tsquery
must already follow the general rules
for
tsquery
input, as described in Section 8.11. The difference is that while basic
tsquery
input
takes the tokens at face value,
to_tsquery
normalizes each token to a lexeme using the specified or
default configuration, anddiscards any tokens that are stop words according to the configuration. For
example:
SELECT to_tsquery(’english’, ’The & Fat & Rats’);
to_tsquery
---------------
’fat’ & ’rat’
As in basic
tsquery
input, weight(s) can be attached to each lexeme to restrict it to match only
tsvector
lexemes of those weight(s). For example:
SELECT to_tsquery(’english’, ’Fat | Rats:AB’);
to_tsquery
------------------
’fat’ | ’rat’:AB
Also,
*
can be attached toa lexeme to specify prefix matching:
SELECT to_tsquery(’supern:
*
A & star:A
*
B’);
to_tsquery
--------------------------
’supern’:
*
A & ’star’:
*
AB
Such a lexeme will match any word in a
tsvector
that begins with the given string.
to_tsquery
can also accept single-quoted phrases. This is primarily useful when the configuration
includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus
contains the rule
supernovae stars : sn
:
SELECT to_tsquery(”’supernovae stars” & !crab’);
to_tsquery
---------------
’sn’ & !’crab’
Without quotes,
to_tsquery
will generate a syntax error for tokens that are not separated by an
ANDor OR operator.
plainto_tsquery([
config regconfig
, ]
querytext text
) returns
tsquery
plainto_tsquery
transforms unformattedtext
querytext
to
tsquery
.The textis parsedandnor-
malizedmuchas for
to_tsvector
,thenthe
&
(AND) Booleanoperator is insertedbetweensurviving
words.
Example:
SELECT plainto_tsquery(’english’, ’The Fat Rats’);
plainto_tsquery
-----------------
’fat’ & ’rat’
Note that
plainto_tsquery
cannot recognize Boolean operators, weight labels, or prefix-match
labels in its input:
341
71
Chapter 12. Full Text Search
SELECT plainto_tsquery(’english’, ’The Fat & Rats:C’);
plainto_tsquery
---------------------
’fat’ & ’rat’ & ’c’
Here, all the input punctuation was discarded as being space symbols.
12.3.3. Ranking Search Results
Ranking attempts to measure how relevant documents are to a particular query, sothatwhen there are
manymatches the mostrelevantones canbe shown first. PostgreSQL provides two predefined ranking
functions, whichtakeinto account lexical, proximity, andstructuralinformation; thatis, theyconsider
how often the query terms appear inthe document, how close together the terms are in the document,
and how important is the part of the document where they occur. However, the concept of relevancy
is vague and very application-specific. Different applications might require additional information
for ranking, e.g., document modificationtime. The built-in ranking functions are only examples. You
can write your own ranking functions and/or combine their results with additional factors to fit your
specific needs.
The tworanking functions currently available are:
ts_rank([
weights float4[]
, ]
vector tsvector
,
query tsquery
[,
normalization
integer
]) returns
float4
Ranks vectors based on the frequency of their matching lexemes.
ts_rank_cd([
weights float4[]
, ]
vector tsvector
,
query tsquery
[,
normalization integer
]) returns
float4
This function computes the cover density ranking for the given document vector and query,
as described in Clarke, Cormack, and Tudhope’s "Relevance Ranking for One to Three Term
Queries" inthe journal "Information Processingand Management", 1999. Cover density is simi-
lar to
ts_rank
ranking exceptthattheproximity of matchinglexemes to each other is takeninto
consideration.
This function requires lexeme positional information to perform its calculation. Therefore, it
ignores any “stripped” lexemes in the
tsvector
.If there areno unstripped lexemes intheinput,
the result will be zero. (See Section 12.4.1 for more information about the
strip
function and
positional information in
tsvector
s.)
For both these functions, the optional
weights
argument offers the ability to weigh word instances
more or less heavily depending on how they are labeled. The weight arrays specify how heavily to
weigh each category of word, in the order:
{D-weight, C-weight, B-weight, A-weight}
If no
weights
are provided, then these defaults are used:
{0.1, 0.2, 0.4, 1.0}
Typicallyweights are usedtomarkwordsfrom special areas of the document, like the title or aninitial
abstract, sotheycan be treated with more or less importance than words in the document body.
Since a longer document has a greater chance of containing a query term it is reasonable to take
into account document size, e.g., a hundred-word document with five instances of a search word is
342
C# PDF Page Insert Library: insert pages into PDF file in C#.net from various file formats, such as PDF, Tiff, Word, Excel, PowerPoint, Bmp, Jpeg, Png, Gif, and DLLs for Inserting Page to PDF Document. Add necessary references
adding an image to a pdf; how to add a jpeg to a pdf file
102
Chapter 12. Full Text Search
probably more relevant than a thousand-word document with five instances. Both ranking functions
take an integer
normalization
option that specifies whether and how a document’s length should
impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one
or more behaviors using
|
(for example,
2|4
).
•
0(the default) ignores the document length
•
1divides the rankby 1 + the logarithm of the document length
•
2divides the rankby the document length
•
4divides the rank by the mean harmonic distance between extents (this is implemented only by
ts_rank_cd
)
•
8divides the rankby the number of unique words in document
•
16 divides the rank by 1 + the logarithm of the number of unique words in document
•
32 divides the rank by itself + 1
If more than one flag bit is specified, the transformations are applied in the order listed.
It is important to note that the ranking functions do not use any global information, so it is impos-
sible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32
(
rank/(rank+1)
)can be applied to scale all ranks into the range zero to one, but of course this is
just a cosmetic change; it will not affect the ordering of the search results.
Here is an example that selects only the ten highest-ranked matches:
SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery(’neutrino|(dark & matter)’) query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title
|
rank
-----------------------------------------------+----------
Neutrinos in the Sun
|
3.1
The Sudbury Neutrino Detector
|
2.4
A MACHO View of Galactic Dark Matter
|
2.01317
Hot Gas and Dark Matter
|
1.91171
The Virgo Cluster: Hot Plasma and Dark Matter |
1.90953
Rafting for Solar Neutrinos
|
1.9
NGC 4650A: Strange Galaxy and Dark Matter
|
1.85774
Hot Gas and Dark Matter
|
1.6123
Ice Fishing for Cosmic Neutrinos
|
1.6
Weak Lensing Distorts the Universe
| 0.818218
This is the same example using normalized ranking:
SELECT title, ts_rank_cd(textsearch, query, 32 /
*
rank/(rank+1)
*
/ ) AS rank
FROM apod, to_tsquery(’neutrino|(dark & matter)’) query
WHERE
query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title
|
rank
-----------------------------------------------+-------------------
Neutrinos in the Sun
| 0.756097569485493
The Sudbury Neutrino Detector
| 0.705882361190954
A MACHO View of Galactic Dark Matter
| 0.668123210574724
Hot Gas and Dark Matter
|
0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos
| 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter
| 0.650072921219637
343
91
Chapter 12. Full Text Search
Hot Gas and Dark Matter
| 0.617195790024749
Ice Fishing for Cosmic Neutrinos
| 0.615384618911517
Weak Lensing Distorts the Universe
| 0.450010798361481
Ranking can be expensive since it requires consulting the
tsvector
of each matching document,
which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since
practical queries often result in large numbers of matches.
12.3.4. Highlighting Results
To present search results itis ideal to show a part of each document and howit is relatedto the query.
Usually, search engines show fragments of the document with marked search terms. PostgreSQL
provides a function
ts_headline
that implements this functionality.
ts_headline([
config regconfig
, ]
document text
,
query tsquery
[,
options text
]) returns
text
ts_headline
accepts a document along with a query, and returns an excerpt from the document in
which terms from the query are highlighted. The configuration to be used to parse the document can
be specified by
config
;if
config
is omitted, the
default_text_search_config
configuration
is used.
If an
options
string is specified it must consist of a comma-separated list of one or more
option=value
pairs. The available options are:
•
StartSel
,
StopSel
:the strings withwhich to delimitquery words appearing inthe document, to
distinguish them from other excerpted words. You must double-quote these strings if they contain
spaces or commas.
•
MaxWords
,
MinWords
:these numbers determine the longest and shortest headlines to output.
•
ShortWord
:words of this length or less will be dropped at the start and end of a headline. The
default value of three eliminates common English articles.
•
HighlightAll
:Boolean flag; if
true
the whole document will be used as the headline, ignoring
the preceding three parameters.
•
MaxFragments
:maximum number of text excerpts or fragments to display. The default value of
zero selects a non-fragment-oriented headline generation method. A value greater than zero selects
fragment-based headline generation. This method finds text fragments with as many query words
as possible and stretches those fragments around the query words. As a result query words are
close to the middle of each fragment and have words on each side. Each fragment will be of at
most
MaxWords
and words of length
ShortWord
or less are dropped at the start and end of each
fragment. If not all query words are found in the document, then a single fragment of the first
MinWords
in the document will be displayed.
•
FragmentDelimiter
:Whenmorethanone fragmentis displayed, the fragmentswill be separated
by this string.
Any unspecified options receive these defaults:
StartSel=<b>, StopSel=</b>,
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
MaxFragments=0, FragmentDelimiter=" ... "
For example:
344
49
Chapter 12. Full Text Search
SELECT ts_headline(’english’,
’The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.’,
to_tsquery(’query & similarity’));
ts_headline
------------------------------------------------------------
containing given <b>query</b> terms
and return them in order of their <b>similarity</b> to the
<b>query</b>.
SELECT ts_headline(’english’,
’The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.’,
to_tsquery(’query & similarity’),
’StartSel = <, StopSel = >’);
ts_headline
-------------------------------------------------------
containing given <query> terms
and return them in order of their <similarity> to the
<query>.
ts_headline
uses the original document, not a
tsvector
summary, so it can be slow and should
be usedwith care. A typical mistake is to call
ts_headline
for every matchingdocument whenonly
ten documents are to be shown. SQL subqueries can help; here is an example:
SELECT id, ts_headline(body, q), rank
FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
FROM apod, to_tsquery(’stars’) q
WHERE ti @@ q
ORDER BY rank DESC
LIMIT 10) AS foo;
12.4. Additional Features
This sectiondescribes additionalfunctions andoperatorsthat are useful in connectionwith textsearch.
12.4.1. Manipulating Documents
Section 12.3.1 showed how raw textual documents can be converted into
tsvector
values. Post-
greSQL also provides functions and operators that can be used to manipulate documents that are
already in
tsvector
form.
345
95
Chapter 12. Full Text Search
tsvector
||
tsvector
The
tsvector
concatenation operator returns a vector which combines the lexemes and posi-
tionalinformationof thetwo vectors givenas arguments. Positions and weight labels areretained
during the concatenation. Positions appearing in the right-hand vector are offset by the largest
position mentioned in the left-hand vector, so that the result is nearly equivalent to the result
of performing
to_tsvector
on the concatenation of the two original document strings. (The
equivalence is not exact, because any stop-words removed from the end of the left-hand argu-
ment will not affect the result, whereas they would have affected the positions of the lexemes in
the right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than concatenating text before
applying
to_tsvector
,is that you can use different configurations to parse different sections
of the document. Also, because the
setweight
function marks all lexemes of the given vector
the same way, it is necessary to parse the text and do
setweight
before concatenating if you
want to label different parts of the document with different weights.
setweight(
vector tsvector
,
weight "char"
) returns
tsvector
setweight
returns a copy of the input vector in whichevery position has been labeled with the
given
weight
,either
A
,
B
,
C
,or
D
.(
D
is thedefault for newvectors andas suchisnotdisplayed on
output.) These labels are retained when vectors are concatenated, allowing words from different
parts of a document to be weighted differently by ranking functions.
Note that weight labels apply to positions, not lexemes. If the input vector has been stripped of
positions then
setweight
does nothing.
length(
vector tsvector
) returns
integer
Returns the number of lexemes stored in the vector.
strip(
vector tsvector
) returns
tsvector
Returns a vector which lists the same lexemes as the given vector, but which lacks any position
or weight information. While the returned vector is much less useful than an unstripped vector
for relevance ranking, it will usually be much smaller.
12.4.2. Manipulating Queries
Section 12.3.2 showed how raw textual queries can be converted into
tsquery
values. PostgreSQL
also provides functions and operators that can be used to manipulate queries that are already in
tsquery
form.
tsquery
&&
tsquery
Returns the AND-combination of the two given queries.
tsquery
||
tsquery
Returns the OR-combination of the two given queries.
!!
tsquery
Returns the negation (NOT) of the given query.
numnode(
query tsquery
) returns
integer
Returns the number of nodes (lexemes plus operators) in a
tsquery
.This function is useful to
determine if the
query
is meaningful (returns > 0), or contains only stop words (returns 0).
346
99
Chapter 12. Full Text Search
Examples:
SELECT numnode(plainto_tsquery(’the any’));
NOTICE:
query contains only stopword(s) or doesn’t contain lexeme(s), ignored
numnode
---------
0
SELECT numnode(’foo & bar’::tsquery);
numnode
---------
3
querytree(
query tsquery
) returns
text
Returns theportion of a
tsquery
that can beused for searching anindex. This function is useful
for detectingunindexable queries, for example those containing onlystop words or only negated
terms. For example:
SELECT querytree(to_tsquery(’!defined’));
querytree
-----------
12.4.2.1. Query Rewriting
The
ts_rewrite
family of functions search a given
tsquery
for occurrences of a target subquery,
and replace each occurrence with a substitute subquery. In essence this operation is a
tsquery
-
specific version of substring replacement. A target and substitute combination can be thought of as a
query rewrite rule. A collection of such rewrite rules can be a powerful search aid. For example, you
can expand the search using synonyms (e.g.,
new york
,
big apple
,
nyc
,
gotham
)or narrow the
searchto direct the user to some hot topic. There is some overlapin functionality between this feature
and thesaurus dictionaries (Section 12.6.4). However, you canmodify a set of rewrite rules on-the-fly
without reindexing, whereas updating a thesaurus requires reindexing to be effective.
ts_rewrite (
query tsquery
,
target tsquery
,
substitute tsquery
) returns
tsquery
This form of
ts_rewrite
simply applies a single rewrite rule:
target
is replaced by
substitute
wherever it appears in
query
.For example:
SELECT ts_rewrite(’a & b’::tsquery, ’a’::tsquery, ’c’::tsquery);
ts_rewrite
------------
’b’ & ’c’
ts_rewrite (
query tsquery
,
select text
) returns
tsquery
This form of
ts_rewrite
accepts a starting
query
and a SQL
select
command, which is
given as a text string. The
select
must yield two columns of
tsquery
type. For each row of
the
select
result, occurrences of the first column value (the target) are replaced by the second
column value (the substitute) within the current
query
value. For example:
CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
INSERT INTO aliases VALUES(’a’, ’c’);
SELECT ts_rewrite(’a & b’::tsquery, ’SELECT t,s FROM aliases’);
ts_rewrite
------------
’b’ & ’c’
347
Documents you may be interested
Documents you may be interested