58
After establishing a mapping between the original spreadsheet and a rela-
tional database schema, we may want to use SQL to query the spreadsheet.
Regarding the house renting information, one may want to know who are the
clients ofthe properties that where rented between January, 2000 and January
2002? Such queriesare difficult to formulate in the spreadsheet environment. In
SQL, the above question can be formulated as follows:
select clientNo from rent
where rentStart between ’1/01/00’ and ’1/01/02’
Below we will demonstrate that the automatically derived mapping can be ex-
ploited to fire such SQL queriesat the original or the optimized spreadsheet.
In the next sections, we will formalize the correspondence between spread-
sheetsandrelationalschemasusingdatarefinementrules. Wewillpresentformal
proofs that guarantee their correctness. Moreover, we will presenta framework
that implements the transformation rules andincludesfrontends to well-known
spreadsheetsystems.Infact,the example presentedinthissectionwasprocessed
byour framework.
3 From Functional Dependencies to RDB Schemas
This section explains how to extract functional dependencies from the spread-
sheetdata andhow to construct the relational schema. We startbyintroducing
some conceptsrelated to relationaldatabases. Then, we presentanalgorithmto
extractfunctional dependencies fromdata. Finally, we describe how to use the
FDsto create a relational schema.
RelationalDatabases and Functional Dependencies A relational database DB=
{R
1
,...,R
n
}isa collection of namedrelations (or tables), R
i
⊆d
1
×...×d
k
,de-
finedoverdatasetsnotnecessarilydistinct.Eachrelation’sR
i
element(d
1
,...,d
k
)
iscalledatuple(orrow)andeachd
i
iscalledanattribute. Eachtuple isuniquely
identified by a minimum nonempty set of attributes called primary key (PK).
It could be the case of existing more then one set suitable for becoming the
primary key. They are designated candidate keys and onlyone ischosen to be-
comes primarykey. A foreign key (FK)isa setofattributeswithinone relation
that matches the primary key of some relation. A relational database schema
(RDB)isa setofrelation schemaseach ofwhich being a setofattributes. These
concepts are illustratedin Figure 3.
Another important concept is functional dependency (FD) between sets of
attributes within a relation [5]. A set B is functionally dependent on a set A,
denoted A B, if eachvalue of A is associated with exactly one value of B.
The normalisation of a database is important to prevent data redundancy.
Although, there are more normal forms, in general, a RDB is considered nor-
malised ifit respectsthe third normal form(3NF) [8], that is, if itrespectsthe
second normal formal (2NF) and all the non-key attributes are only dependent
VB.NET PDF- View PDF Online with VB.NET HTML5 PDF Viewer PDF to text, C#.NET convert PDF to images, C#.NET PDF file & pages edit, C#.NET PDF pages extract, copy, paste, C# Select text and image on PDF document. 2.
search pdf for text in multiple files; search pdf documents for text
38
Attributes
{
Relation
Tuples
Fig.3. An exampleof arelationthat represents part of ourexample.
onthekeyattributes.A relationrespectsthe 2NF ifitisinthe firstnormal form
(1NF) and itsnon-keyattributes are not functionallydependent on part ofthe
key. Finally, the 1NF is respectedif eachelementofeachrow contains only one
element.
In orderto define the RDB schema, we use the data mining algorithm Fun
[18] to compute the FD given a spreadsheet, and then database techniques,
namely Maier’s algorithm[16], to compute the RDB schema in the 3NF.
We have expressed Fun as the Haskell fun function. Next, we execute
this function with our running example (the arguments propSch and propRel
correspondto the first and remaining lines ofthe spreadsheet, respectively).
∗ghcifun propSch propRel
ownerNo oName
clientNo cName
totalDays clientNo,cName
propertyNo pAddress,rentPerDay,ownerNo,oName
pAddress propertyNo,rentPerDay,ownerNo,oName
...
The FDsderived bythe Funalgorithm depend heavilyonthe quantityand
quality of the data. Thus, for small samples of data, or data that exhibits
too many or too few dependencies, the Fun algorithm may not produce the
desired FDs. For instance, in our running example and considering only the
data shown on Figure 1, the Fun algorithm does not induce the following FD
clientNo,propertyNo rentStart,rentFinish,total rent,totalDays.
3.1 Spreadsheet Formulas
Functional dependencies are the basis for defining the RDB schema. The Fun
algorithm, however, may compute redundant FDs which may have a negative
impact on the design ofthe RDB schema. In this section, we discuss character-
istics ofspreadsheets thatcanbe used to define a more precise set offunctional
dependencies.
Spreadsheets use formulas to define the values of some elements in terms
of other elements. For example, in the house renting spreadsheet, the column
totalDays is computed by subtracting the column rentFinish to rentStart, and
it is usually written as follows G3 = F3 - E3. This formula states that the
VB.NET PDF - View PDF with WPF PDF Viewer for VB.NET Tools Tab. Item. Name. Description. Ⅰ. Hand. Pan around the PDF document. Ⅱ. Select. Select text and image to copy and paste using Ctrl+C and Ctrl+V.
how to select text in pdf reader; search text in multiple pdf
85
values of F3 and E3 determine the value of G3, thus inducing the following
functional dependency: rentStart,rentFinish totalDays. Note also that
totalDays is the primary key of a FD produced by the Fun algorithm, namely
totalDays clientNo,cName.Primarykeys,however,mustbe constantsrather
than formulas. Thus, such FDs should be eliminated.
Formulas can have references to other formulas. Consider, for example, the
second formula of the running example I3 = G3 * H3, which defines the total
rent bymultiplying the number of days by the value ofthe rent. Because G3 is
definedby anotherformula, the values thatdetermine G3 also determine I3. As
aresult, the two formulasinduce the following FDs:
rentStart,rentFinish,rentPerDay total rent
rentStart,rentFinish totalDays
Functional dependencies induced by formulas are added to the ones computed
by the Fun algorithm. In genereal a spreadsheet formula of the form X
0
=
f(X
1
,...,X
n
)induces the following functional dependency: X
1
,...,X
n
X
0
.
Inspreadsheetsystems,formulasareusuallyintroducedbycopyingthemthrough
all the elements in a column, thus making the FD explicit in all the elements.
This may not always be the case and some elements can be defined otherwise
(e.g. byusing aconstantvalue ora differentformula). In thiscase,no functional
dependencyisinduced.
3.2 Computing the RDB Schema
Havingcomputedthefunctionaldependencies,wecannow constructtheschema
of the RDB.Maierin [16]defined analgorithmcalled synthesize thatreceives
asetof FDs and returns a relational database schema respecting the 3NF.
begin synthesize :
Input a set ofFDs F
Output a complete database schema for F
1. finda reduced, minimumannularcover G for F;
2. foreach CFD(X
1
,X
2
,...,X
n
) Y inG, construct a relational schema
R=X
1
X
2
...X
n
Y withdesignated keysK ={X
1
,X
2
,...,X
n
};
3. return the setofrelational schemas constructedin step 2.
end synthesize
This concise, but complex algorithm works as follows: To find a minimum
annular cover G for F we start by compute a minimum cover G
for F. G
is minimum if it has as few FDs as any equivalent set of FDs. Transform G
intoG issimple: just combine the FDs with equivalent left sides into compound
functional dependencies (CFDs) having the form (X
1
,X
2
,...,X
n
) Y where
X
1
,...,X
n
,Y are sets ofFDs andthe leftsetsare equivalent.
NowweneedtoreducethesetofCFDsandthisisachievedwhenalltheCFDs
into the set are reduced. A CFD is reduced if no left set contains any shiftable
attributes and the right side contains no extraneous attributes. An attribute is
70
shiftable ifit can be removed fromthe leftside ofthe FDand insertedinto the
right side without changing the equivalence of the set of FDs. An attribute is
extraneous if it can be removed from the FD without changing the equivalence
of the setof FDs.
We have implementedthisalgorithminHaskell as the synthesizefunction.
It gets as argument the functional dependencies (produced by the Fun) and
returnsa set of CFD. Next, we execute synthesize withthe FDsinducedby our
running example (setsare represented by the predefined Haskell lists).
∗ghcisynthesize◦fun propSch propRel
([ownerNo],[oName])[]
([clientNo],[cName])[]
([totalDays])[cName]
([propertyNo],[pAddress],[rentPerDay])[oName]
([rentStart,rentFinish,rentPerDay])[total rent]
([rentStart,rentFinish])[totalDays]
...
Each CFD defines several candidate keys for each table. However, to fully
characterise the RDBschema we need to chose the primarykeyfromthose can-
didates. To find such key we use a simple algorithm: first, we produce all the
possible tablesusing eachcandidate keyasa primarykey. Forexample, the sec-
ondCFDaboveexpandstotwopossible tableswiththesame attributes:one has
clientNo as primary key and in the other is cName the primary key. Next, we
choosethetablewhichhasthesmallestPK,sinceingenerala’good’tablehasthe
smallestkeyaspossible.Themore attributesthe keydeterminesthebest.A final
cleanup is necessary: we remove FD that all their attributes are already repre-
sentedinotherFDs. We alsomerge twoFDswheneverantecedents/consequents
are subsets. The final resultis listed bellow.
ownerNo oName
clientNo cName
propertyNo pAddress,rentPerDay,ownerNo
clientNo,propertyNo rentStart,rentFinish,total rent,totalDays
Asa final step, the setofforeignkeyshastobecomputedbydetectingwhich
primarykeys are referenced inthe consequentofthe FD.
Next, we show the the RDB schema derived for the house renting system
example. The RDBisrepresentedasa tuple oftables. A table isa mapbetween
the PK and the remaining attributes. This datatype is constrained by an in-
variant, defining the foreign keys, which will be explained in detail in the next
section.
(clientNo ×propertyNo rentStart×rentFinish ×total rent×totalDays×
clientNo cName×
(propertyNo pAddress×rentPerDay ×ownerNo×
ownerNo oName)
inv1
)
inv2
where
inv1 =π
ownerNo
◦ρ◦π
1
⊆δ ◦π
2
inv2 =π
clientNo
◦δ ◦π
1
⊆δ◦π
1
◦π
2
∧π
propertyNo
◦δ◦π
1
⊆δ ◦π
1
◦π
2
◦π
2
89
The tablesare indexed bya binary function thatrepresents the foreign keys
restriction. Thisnotationand operators are introduced in the next section.
4 Constraint-aware Rewriting
As we have explained before, the mapping between the spreadsheet and the
RDBmodelsisperformedthroughdatarefinementsusingthe2LTsystem.Thus,
before we presentthe datarefinementrulesto map spreadsheetsinto databases,
let us briefly describe data refinements and the 2LT system
3
.
4.1 Datatype Refinement
Data refinement theory provides an algebraic framework for calculating with
datatypes. Refining a datatype A to a datatype B can be captured by the fol-
lowing diagram:
A
to
B
from
where
to : A→B is an injective and total relation;
from :B →A a surjective function;
from ·to =id
A
(identityfunctionon A);
We will use A
to
from
Basa shortnotation to the above diagram.
Refinements can be composed, that is, if A
to
from
B and B
to
from
C then
A
to
·to
from·from
C. Also, transformation in specific parts of a datatype must
be propagated to the global datatype in which they are embedded, that is, if
A
to
from
B then FA
Fto
Ffrom
FB where F is a functor that models the context
of the transformation. A functor captures i) the embedding of local datatypes
inside global datatypes and ii) the lifting of value-level functions to and from
on the local datatypes to value-level transformations on the global datatypes.
In the particular case where the refinementworks in both directionswe have an
isomorphismA
∼
=
B.
A common example is that maps are the implementation for lists [9] –
A
seq2index
list
N A – where seq2index creates a map (or finite function,
here represented asMap) where the keysare the indexes ofthe elementsofthe
list. list just collects the elements in the map. For more details about data re-
finementsthe reader is referred to [17,19,23].
Consider now a RDB with two tables, A B and A C. Suppose that
the key of the first table is a foreign key to the key of the second one. This is
representedwiththe datatypeconstraint δ◦π
1
⊆δ◦π
2
,where π
1
and π
2
repre-
sentleft and right projection, respectively, and δ denotes the domain of a map.
3
2LT
stands for Two-Level Transformations. The tool is available at
http://code.google.com/p/2lt/.
58
Constraintsondatatypesaremodelledasbooleanfunctionswhichdistinguishes
between legal valuesand valuesthat violate the constraint. A data type A with
aconstraint φ is represented as A
φ
where φ : A → Bool is a total function So,
ourexample becomesasfollows:((AB)×(AC))
δ◦π
1
⊆δ◦π
2
.Furtherdetails
about constraintdata types can be found in [4,20].
4.2 Two-Level Transformations with Constraints
Thedatarefinementtheorypresentedintheprevioussectionwasimplementedas
arewriting systemnamed 2LT in Haskell [4,9]. We will now brieflyintroduce
this system.
Atype-safe representation of types and functions is constructed using gen-
eralised algebraic datatypes (GADTs) [26]. To represent types, the following
GADT is used:
dataType t where
String::Type String
[·]
::Type a →Type [a]
·· ::Type a →Type b →Typea b
·×·::Type a →Type b →Type (a,b)
Maybe::Type a →Type (Maybe a)
·
·
::Type a →PF (a →Bool)→Type a
...
Each refinement rule is encodedas a two-level rewriting rule:
type Rule =∀a . Type a →Maybe(View (Type a))
dataView a where View ::Rep a b →Type b→View (Typea)
dataRep a b=Rep{to =PF (a →b),from =PF (b →a)}
Althoughthe refinementare froma type a to a type b, this can notbe directed
encoded since the type b is onlyknown when the transformation completes, so
the type b is represented as a view of the type a. A view means that given a
function to which transforms a type a into a type b and a vice versa function
from it ispossible to construct b from a.
These functions are represented in a point-free style, that is, without any
variables. Its representation isaccomplished bythe following GADT:
dataPF a where
π
1
::PF ((a,b)→a)
π
2
::PF ((a,b)→b)
list2set
::PF ([a]→Set a)
ρ
::PF ((a b)→Set b)
δ
::PF ((a b)→Set a)
CompList
::PF ([(a,b)]→[(b,b)])
ListId
::PF ([(a,b)]→[(b,b)])
55
·
::PF (a →b)→PF ([a] →[b])
·
::PF (a →b)→PF (Set a →Set b)
·∧·
::PF (Pred a)→PF (Pred a)→PF (Pred a)
·◦·
::PF (b →c)→PF (a →b)→PF (a →c)
··
::PF (a →b)→PF (a →c)→PF (a →(b,c))
·×·
::PF (a →b)→PF (c→d)→PF ((a,c)→(b,d))
·⊆·
::PF (a →Set b)→PF (a →Set b)→PF (Pred a)
Tables2table ::PF ((a b,cd)→(a b,Maybe d))
Table2tables ::PF ((a b,Maybe d)→(a b,c d))
Table2sstable::PF ((a b)→[(a,b)])
Sstable2table ::PF ([(a,b)]→(a b))
...
To represent datatypes with constraints the following new Type constructor
is used:
·
·
::Typea →PF (a →Bool)→Typea
Itsfirstargumentisthetypetoconstraintandthe secondoneisthe PFfunction
representing the constraint.
Eachrefinementrule canonlybe appliedto aspecific datatype, forinstance,
arefinement on the type A can not be applied to the type A×B. To allow this
some rule-based combinators were created:
nop ::Rule
-- identity
::Rule→Rule →Rule
-- sequential composition
::Rule →Rule →Rule
-- left-biased choice
many ::Rule →Rule
-- repetition
once::Rule →Rule
-- arbitrarydepth rule application
Using combinators, rules can be combined in order to create a full rewrite sys-
tems.
5 From a Database to a Spreadsheet: The Rules
In ourapproachspreadsheetsarerepresentedbya productofspreadsheettables
(fromnow designated assstables).A sstable isaproductofrowsandeachrowis
itself a product of values. Although, we wish to transform a spreadsheet into a
relational database, we will introduce rules to map databases into spreadsheets
sincethe formerare arefinementofthelater.Thus, wewill use the RDBschema
constructed, asexplainedin Section 3.2, and refine itto a spreadsheet. Because
we do this data type migration within the 2LT system, we obtain for free the
functionthatwewillusetomigratethedataofthespreadsheetintothedatabase.
Next, we describe several date refinements needed to transform a relational
database into a spreadsheet. We also present a strategy to apply these refine-
ments in order to obtainthe desirable result.
Documents you may be interested
Documents you may be interested