48
calibre User Manual, Release 2.57.1
characters including actual characters, numbers, punctuation and so-called whitespace (linebreaks, tabulators etc.).
Please note that generally,uppercase and lowercase characters are not considered the same, thus “a” being a different
character from “A” and so forth. In calibre, regular expressions are case insensitive in the search bar, but not in the
conversionoptions. There’s awaytomake every regularexpression case insensitive,butwe’lldiscuss thatlater. It gets
complicated because regular expressions allow for variations in the strings it matches, so one expression can match
multiple strings, which is why people bother using them at all. More on that in a bit.
Care to explain?
Well, that’s why we’re here. First, this is the most important concept in regular expressions: A string by itself is a
regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using
aregular expression, the regular expression to use would be Hello, World!. And yes, it really is that simple.
You’ll notice,though, that this only matches the exact string "Hello, World!", not e.g. "Hello, wOrld!" or
"hello, world!" orany other such variation.
That doesn’t sound too bad. What’s next?
Next is the beginning of the really good stuff. Remember where I said that regular expressions can match multiple
strings? This is were it gets a little more complicated. Say, as a somewhat more practical exercise, the ebook you
wanted to convert had a nasty footer counting the pages,like “Page 5 of 423”. Obviously the page numberwould rise
from 1 to 423, thus you’d have to match 423 different strings, right? Wrong, actually: regular expressions allow you
to define sets of characters that are matched: To define a set, you put all the characters you want to be in the set into
square brackets. So, for example, the set [abc] would match either the character “a”, “b” or “c”. Sets will always
only matchone of the characters in the set. They “understand” character ranges, that is,if you wanted to match all the
lower case characters, you’d use the set [a-z] forlower- and uppercase characters you’d use [a-zA-Z] and so on.
Got the idea? So, obviously, using the expression Page [0-9] of 423 you’d be able to match the first 9 pages,
thus reducing the expressions needed to three: The second expression Page [0-9][0-9] of 423 would match
all two-digit page numbers, and I’m sure you can guess what the third expression would look like. Yes, go ahead.
Write it down.
Hey, neat! This is starting to make sense!
Iwas hoping you’d say that. But brace yourself,now it gets even better! We just saw that using sets, we could match
one of several characters at once. But you can even repeat a character or set, reducing the number of expressions
needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this:
Some so-called special characters,“+”, ”?” and “*”,repeat the single element precedingthem. (Element means either
asingle character,a character set,an escape sequence or a group (we’ll learn about those last two later)- in short, any
single entity in a regular expression.) These characters are called wildcards or quantifiers. To be more precise, ”?”
matches 0 or 1 of the preceding element, “*” matches 0 or more of the preceding element and “+” matches 1 or more
of the preceding element. A few examples: The expression a? would match either“” (which is the empty string, not
strictly useful in this case) or “a”, the expression a
*
would match “”, “a”, “aa” or any number of a’s in a row, and,
finally, the expression a+ would match “a”, “aa” or any number of a’s in a row (Note: it wouldn’t match the empty
string!). Same deal forsets: The expression [0-9]+ would match every integer number there is! I know what you’re
thinking, and you’re right: If you usethatin the above caseof matching page numbers,wouldn’t that be the single one
expression to match all the page numbers? Yes, the expression Page [0-9]+ of 423 would match every page
number in that book!
Note: A note on these quantifiers: They generally try to match as much text as possible, so be careful when using
them. This is called “greedy behaviour”- I’msure you get why. It gets problematic when you, say,try to match a tag.
Consider, for example, the string "<p class="calibre2">Title here</p>" and let’s say you’d want to
162
Chapter 1. Sections
56
calibre User Manual, Release 2.57.1
matchthe openingtag (the part betweenthefirst pair of angle brackets,a littlemore on tags later). You’d think that the
expression<p.
*
>wouldmatchthattag, butactually, it matches thewhole string! (The character ”.” is anotherspecial
character. It matches anything except linebreaks, so,basically,the expression.
*
would matchanysingle line you can
think of.) Instead, try using <p.
*
?> which makes the quantifier"
*
"non-greedy. That expression would only match
the first opening tag, as intended. There’s actually another way to accomplish this: The expression <p[^>]
*
>will
match that same opening tag- you’ll see why after the next section. Just note that there quite frequently is more than
one way towrite a regular expression.
Well, these special charactersare veryneat and all, but what if I wanted to match adotor a question
mark?
You can of course do that: Just put a backslash in front of any special character and it is interpreted as the literal
character, without any special meaning. This pair of a backslash followed by a single character is called an escape
sequence,and the act ofputting a backslash in front ofa special character is called escaping that character. An escape
sequence is interpreted as a single element. There are of course escape sequences that do more than just escaping
special characters, for example "\t" means a tabulator. We’ll get to some of the escape sequences later. Oh, and by
the way, concerning those special characters: Consider any character we discuss in this introduction as having some
function to be special and thus needing to be escaped if you want the literal character.
So, what are the most useful sets?
Knew you’d ask. Some useful sets are [0-9] matching a single number, [a-z] matching a single lowercase letter,
[A-Z] matching a single uppercase letter,[a-zA-Z]matching asingle letterand [a-zA-Z0-9]matching asingle
letter or number. You can also use an escape sequence as shorthand:
\d is equivalent to [0-9]
\w is equivalent to [a-zA-Z0-9_]
\s is equivalent to any whitespace
Note: “Whitespace” is a term for anything thatwon’t be printed. These characters include space, tabulator, line feed,
form feed and carriage return.
As a last note on sets, you can also define a set as any character but those in the set. You do that by including the
character "^" as the very first character in the set. Thus, [^a] would match any character excluding “a”. That’s
called complementing the set. Those escape sequence shorthands we saw earlier can also be complemented: "\D"
means any non-number character, thus being equivalent to [^0-9]. The other shorthands can be complemented by,
you guessed it, using the respective uppercase letter instead of the lowercase one. So, going back to the example
<p[^>]
*
>from the previous section, now you can see that the character set it’s using tries to match any character
except for a closing angle bracket.
But if I had a few varying strings I wanted to match, things get complicated?
Fear not, life still is good and easy. Consider this example: The book you’re converting has “Title” written on every
odd page and “Author” written on every even page. Looks great in print, right? But in ebooks, it’s annoying. You
can group whole expressions in normal parentheses, and the character "|" will let you match either the expression
to its right or the one to its left. Combine those and you’re done. Too fast for you? Okay, first off, we group the
expressions for odd and even pages, thus getting (Title)(Author) as our two needed expressions. Now we
make things simpler by using the vertical bar ("|" is called the vertical bar character): If you use the expression
(Title|Author) you’ll either get a match for “Title” (on the odd pages) or you’d match “Author” (on the even
pages). Well, wasn’t that easy?
1.9. Tutorials
163
44
calibre User Manual, Release 2.57.1
You can, of course, use the vertical bar without using grouping parentheses, as well. Remember when I said that
quantifiers repeat the element preceding them? Well, the vertical bar works a little differently: The expression “Ti-
tle|Author” will also match either the string “Title” or the string “Author”, just as the above example using grouping.
The vertical bar selects between the entire expression preceding and following it. So, if you wanted to match the
strings “Calibre” and “calibre” and wanted to select only between the upper- and lowercase “c”, you’d have to use
the expression (c|C)alibre, where the grouping ensures that only the “c” will be selected. If you were to use
c|Calibre,you’d get a match on the string “c” or on the string “Calibre”, which isn’t what we wanted. In short: If
in doubt, use grouping together with the vertical bar.
You missed...
... wait just a minute, there’s one last,reallyneatthingyoucan do withgroups. Ifyouhave a group thatyoupreviously
matched, you can use references to that group later in the expression: Groups are numbered starting with 1, and you
reference them by escaping the number of the group you want to reference, thus, the fifth group would be referenced
as \5. So,if you searched for ([^ ]+) \1 in the string “Test Test”,you’d match the whole string!
In the beginning, you said there was a way to make a regular expression case insensitive?
Yes, I did, thanks for paying attention and reminding me. You can tell calibre how you want certain things handled
by using something called flags. You include flags in your expression by using the special construct (?flags go
here) where, obviously, you’d replace “flags go here” with the specific flags you want. For ignoring case, the flag
is i, thus you include (?i) in your expression. Thus, test(?i) would match “Test”, “tEst”, “TEst” and any case
variation you could think of.
Anotheruseful flag lets the dot matchany character at all, including the newline, the flag s. If youwant touse multiple
flags in an expression, just put them in the same statement: (?is) would ignore case and make the dot match all. It
doesn’t matter whichflag you state first,(?si) would beequivalent tothe above. By the way,good places forputting
flags in your expression would be either the very beginning or the very end. That way, they don’t get mixed up with
anything else.
Ithink I’m beginning to understand these regular expressions now... how do I use them in calibre?
Conversions
Let’s begin with the conversion settings, which is really neat. In the Search and Replace part, you can input a regexp
(short for regular expression)that describes the string that will be replaced during the conversion. The neat part is the
wizard. Click on the wizard staff and you get a preview of what calibre “sees” during the conversion process. Scroll
down to the string you want to remove, select and copy it,paste it into the regexp field on top of the window. Ifthere
are variable parts, like page numbers orso, use sets and quantifiers to cover those,and while you’re at it, remember to
escapespecialcharacters,ifthere are some. Hitthe button labeledTest andcalibrehighlights the parts it would replace
were you to use the regexp. Once you’re satisfied, hit OK and convert. Be careful if your conversion source has tags
like this example:
Maybe, but the cops feel like you do, Anita. What's one e more dead d vampire?
New laws don't change that. . </p>
<p class="calibre4"> <b class="calibre2">Generated by y ABC Amber LIT T Conv
<a href="http://www.processtext.com/abclit.html" class="calibre3">erter,
http://www.processtext.com/abclit.html</a></b></p>
<p class="calibre4"> It had only y been two o years since Addison n v. . Clark.
The court case gave us a revised version of what life was
164
Chapter 1. Sections
76
calibre User Manual, Release 2.57.1
(shamelessly ripped out ofthisthread
78
). You’d have to remove some of the tags as well. In this example, I’d
recommend beginning with the tag <b class="calibre2">, now you have to end with the corresponding clos-
ing tag (opening tags are <tag>, closing tags are </tag>), which is simply the next </b> in this case. (Re-
fer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be de-
scribed using <b.
*
?>, the closing tag using </b>, thus we could remove everything between those tags using
<b.
*
?>.
*
?</b>. But using this expression would be a bad idea, because it removes everything enclosed by <b>-
tags (which, by the way, render the enclosed text in bold print), and it’s a fair bet that we’ll remove portions of
the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression
<b.
*
?>\s
*
Generated\s+by\s+ABC\s+Amber\s+LIT.
*
?</b> The \s with quantifiers are included here
instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Re-
member to check what calibre will remove to make sure you don’t remove any portions you want to keep if you test
anew expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also
note that should you accidentally remove more or fewer tags than you actually wanted to, calibre tries to repair the
damaged code afterdoing the removal.
Addingbooks
Another thing you can use regular expressions for is to extract metadata from filenames. You can find this feature in
the “Adding books”part ofthe settings. There’s a special feature here: Youcan use field names formetadatafields, for
example (?P<title>) would indicate that calibre uses this part ofthe stringas booktitle. The allowed field names
are listed in the windows, together with another nice test field. An example: Say you want to import a whole bunch
of files named like Classical Texts:
The Divine Comedy by Dante Alighieri.mobi. (Obvi-
ously, this is already in your library, since we all love classical italian poetry) or Science Fiction epics:
The Foundation Trilogy by Isaac Asimov.epub. This is obviously a naming scheme that calibre
won’t extract any meaningful data out of - its standard expression for extracting metadata is (?P<title>.+) -
(?P<author>[^_]+). A regular expression that works here would be [a-zA-Z]+: (?P<title>.+) by
(?P<author>.+). Please note that,inside the group forthe metadata field, you need to use expressions to describe
what the field actually matches. And also note that, when using the test field calibre provides, you need to addthe file
extension to yourtesting filename,otherwise you won’t get any matches at all,despite using a working expression.
Bulk editing metadata
The last part is regular expression search and replace in metadata fields. You can access this by selecting multiple
books in the library and using bulk metadata edit. Be very careful when using this last feature, as it can do Very
Bad Things to your library! Doublecheck that your expressions do what you want them to using the test fields, and
only mark the books you really want to change! In the regular expression search mode, you can search in one field,
replace the text with something and even write the result into another field. A practical example: Say your library
contained the books of Frank Herbert’s Dune series, named after the fashion Dune 1 - Dune,Dune 2 - Dune
Messiah and so on. Now you wantto get Dune intothe series field. Youcan do that by searching for(.
*
?)
\d+
- .
*
in the title field and replacing it with \1 in the series field. See what I did there? That’s a reference to the first
group you’re replacing the series field with. Now that you have the series all set, you only need to do another search
for .
*
?
-in the title field and replace it with "" (an empty string), again in the title field, and your metadata is all
neat and tidy. Isn’t that great? By the way, instead of replacing the entire field, you can also append or prepend to the
field,so, if you wanted the book title to be prepended with series info,youcould do that as well. As you by now have
undoubtedly noticed, there’s a checkbox labeled Case sensitive, so you won’t have to use flags to select behaviour
here.
Well, that just about concludes the very short introduction to regular expressions. Hopefully I’ll have shown you
enough to at least get you started and to enable you to continue learning by yourself- a good starting point would be
thePythondocumentationforregexps
79
.
78
http://www.mobileread.com/forums/showthread.php?t=75594”
79 https://docs.python.org/2/library/re.html
1.9. Tutorials
165
37
calibre User Manual, Release 2.57.1
One last word of warning, though: Regexps are powerful, but also really easy to get wrong. calibre provides really
great testingpossibilities tosee if yourexpressions behave as you expect themto. Use them. Try not to shootyourself
in the foot. (God, I love that expression...) But should you, despite the warning, injure your foot (or any other body
parts), try to learnfrom it.
Credits
Thanks for helping with tips,corrections and such:
• ldolse
• kovidgoyal
• chaley
• dwanthny
• kacir
• Starson17
For more about regexps seeThePythonUserManual80.
1.9.5 Integrating the calibre content server into other servers
Here, we will show you how to integrate the calibre content server into another server. The most common reason for
this is to make use of SSL or more sophisticated authentication. There are two main techniques: Running the calibre
content server as a standalone process and using a reverse proxy to connect it with your main server or running the
content server in process in your main server with WSGI. The examples below are all for Apache 2.x on linux, but
should be easily adaptable to otherplatforms.
Contents
• Usingareverseproxy(page 166)
• Inprocess (page 167)
Note: This only applies to calibre releases >= 0.7.25
Using a reverse proxy
Areverse proxy is whenyour normal server accepts incomingrequests and passes them onto the calibre server. It then
reads the response fromthe calibre serverand forwards it to the client. This means thatyou can simply runthe calibre
server as normal without trying to integrate it closely with your main server, and you can take advantage of whatever
authentication systems your main server has in place. This is the simplest approach as it allows you to use the binary
calibre install withno externaldependencies/system integration requirements. Below, is an example ofhow to achieve
this with Apache as yourmain server, but it will work with any server that supports Reverse Proxies.
First start the calibre content server as shown below:
calibre-server --url-prefix x /calibre --port t 8080
80
https://docs.python.org/2/library/re.html
166
Chapter 1. Sections
Documents you may be interested
Documents you may be interested