c# pdf processing : C# read pdf form fields software Library dll windows asp.net azure web forms MCD_WebConv_CASCON070-part1449

Automated Conversion of Table-based Websites to Structured
Stylesheets Using Table Recognition and Clone Detection
AndyY. Mao
James R.Cordy
Thomas R. Dean
School of Computing
Queen’s University
Kingston, Ontario, Canada
{mao, cordy, dean}@cs.queensu.ca
Abstract
Web standards suchas XHTML and CSSare
rapidlycomingintopracticeandhavemanyad-
vantages, including compatibility, consistency
across browsers, and increased ease of main-
tenance. Unfortunately large numbers of ex-
isting websites still use the deprecated table-
basedlayoutstyleinwhichpagestyleisunique
to each page. Existing tools for automat-
ing the transition to stylesheets provide lit-
tle help, convertingpage-by-page using aflat-
tened structure and local inline styles rather
thanacommonCSSstylesheet. Thisapproach
ignores hierarchical structure and defeats the
main purpose of moving to the newstandard,
losingalloftheadvantages.
In this work we present an automated
methodforconvertingtable-basedlayoutweb-
sites to standards-compliant modern CSS
stylesheet-basedwebsitesusingatwo-steppro-
cess. Pagesofthesitearefirstconvertedpage-
by-page using table recognition technology to
preserve hierarchical structure and layout se-
mantics in local styles. Software clone detec-
tion technology is then utilized to recognize
common layout styles in the pages and ex-
tract and minimize them to a common CSS
Copyright c c 2007AndyY.Mao,JamesR.Cordy
and Thomas R. Dean. Permission n to copy is s hereby
grantedprovidedtheoriginalcopyrightnoticeisrepro-
ducedincopiesmade.
stylesheet for the site. The result is a main-
tainable, efficient modernstandards-compliant
websitewiththesamelookandfeelastheorig-
inalbutwithallthemaintenanceadvantagesof
acustom programmednewsite.
1 Introduction
Long before thematurity ofWorldWide Web
standards,websitesimplementedstandardlay-
outs and look-and-feel of pages using table-
based layouts that are copied from one page
to another, often because the original sites
weregeneratedbyearlywebsiteeditorssuchas
ClarisHomePage. Manyofthesewebsitesare
stillaliveandactivelymaintained,andindeeda
largenumberofpopularwebsitesstillusetradi-
tionaltable-basedlayouts. Nowthatwebstan-
dardsdesignedfor expressingandmaintaining
commonlayoutandstylesuchasXHTML,DIV
layout andseparate CSSstylesheets have ma-
tured,itishighlydesirable tomigrateexisting
legacywebsitestothe newtechnology.
The use of a separate common CSS
stylesheet for a siteisanexample of the clear
advantagesofferedby suchaconversion. Sup-
pose, for example, that we wished to change
anentirewebsitefromleft-handedlogoformto
right-handed, asshown in Figure 1. To make
thischangeto atraditionaltable-basedlayout
site, every single page of the site would have
C# read pdf form fields - extract form data from PDF in C#.net, ASP.NET, MVC, Ajax, WPF
Help to Read and Extract Field Data from PDF with a Convenient C# Solution
how to make pdf editable form reader; how to extract data from pdf file using java
C# read pdf form fields - VB.NET PDF Form Data Read library: extract form data from PDF in vb.net, ASP.NET, MVC, Ajax, WPF
Convenient VB.NET Solution to Read and Extract Field Data from PDF
how to save a filled out pdf form in reader; saving pdf forms in acrobat reader
Figure1: Left-toRight-handedLogoExample
to be hand edited to move table elements be-
tween columns one by one using copy-paste.
Theamountofworkandleveloftediuminim-
plementing even this simple change would al-
mostcertainlyleadtoerrorsandanomalies. By
contrast, implementing this change in a com-
mon CSS stylesheet version of the same site
would involve only change to one style in the
stylesheetfile,leavingallpagesuntouchedand
vastlyreducingtheeffortandchancesforerror.
Inadditionto the clear advantage of a con-
sistent common style across a site, web stan-
dardsalsooffermanyotheradvantages,includ-
ingcompatibility withmodernwebsiteediting
and searching tools, greater browser indepen-
dence,andenhancedeaseofmaintenance. Ide-
ally every website shouldberedesignedtouse
thesenewstandardsfromscratch,butinprac-
tice the effort to do sofor largewebsiteswith
substantialinvestment canbeprohibitive.
Existing automation for migrating table-
basedlegacywebsitestothenewwebstandards
suchasthatofferedbyAdobeDreamweaver[1]
isatbestcursory,preservinglayoutofindivid-
ual pages separately by absolute pixel place-
ment andlocalizedinline DIVstyles, thuslos-
ingallhierarchicalstructureandcommonality
ofstyle. Theresultisaconversionequivalentto
per-page plottingof pageelements(Figure 2),
yielding a website that is actually less main-
tainable than the original and defeating the
wholepurposeofmovingtothenewstandards.
Inthispaperweproposeamorerealisticand
ambitiousautomatedconversion,leveragingta-
bleanalysisandsourcetransformationtechnol-
ogy already proven in the document recogni-
tionandsoftware reengineeringdomains. Our
conversion recognizes and preserves hierarchi-
cal structure and commonality of style across
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html> <head>
<title>Table test</title>
<style type="text/css">
div { border:1px red solid;}
</style>
</head>
<body>
<div id="Layer1" style="position:absolute;
left:15px; top:18px; width:244px; height:22px;
z-index:1; vertical-align:middle">content 1</div>
<div id="Layer2" style="position:absolute;
left:286px; top:18px; width:294px; height:46px;
z-index:2; vertical-align:middle">content 2</div>
<div id="Layer3" style="position:absolute;
left:610px; top:18px; width:161px; height:22px;
z-index:3; vertical-align:middle">content 3</div>
<div id="Layer4" style="position:absolute;
left:15px; top:42px; width:244px; height:22px;
z-index:4; vertical-align:middle">content 4</div>
. . .
</body>
</html>
Figure2: Example Dreamweaver Conversion
Positionsare absolute, styleattributes are em-
beddedandalltable hierarchy islost.
the entire site. The result of this automated
conversionisasitethatpreservesthelook and
feelof the pagesofthe originalsite while pre-
serving hierarchical layout structure. A com-
monCSSstylefilewhichisessentiallyidentical
toone thatwouldbeauthoredinadisciplined
hand-crafted migration is inferred from style
similarity(Figure3).
Our method utilizes a four step approach,
in which web pages are first converted from
HTMLtoXHTMLusinga sourcetransforma-
tion based on robust parsing [8]. The table
structure of each page is then analyzed using
tablerecognitionmethods[22]toseparate lay-
C# PDF Image Extract Library: Select, copy, paste PDF images in C#
C#.NET extract image from multiple page adobe PDF file library Extract various types of image from PDF file, like XObject Image, XObject Form, Inline Image
pdf form field recognition; pdf data extraction tool
C# PDF Text Extract Library: extract text content from PDF file in
XDoc.PDF ›› C# PDF: Extract PDF Text. C# PDF - Extract Text from PDF in C#.NET. Best C#.NET PDF text extraction library and component for free download.
extract pdf form data to excel; extracting data from pdf forms
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html> <head>
<title>Table test</title>
</head>
<style>
#top_left {float:left; margin:auto; width:250px}
#top_left_container_1 {float:left; margin:0;
width:250px}
#top_left_content_1 {float:left; margin:auto;
width:250px; border:1px red solid;}
. . .
</style>
<body>
<div id="top_left">
<div id="top_left_container_1">
<div id="top_left_content_1">
content 1
</div>
<br clear="both"/>
</div>
<div id="top_left_container_1">
<div id="top_left_content_1">
content 4
</div>
<br clear="both"/>
</div>
</div>
. . .
</body>
</html>
Figure3: ExampleConversionbyOurProcess
Positionsrelative,styleattributesinaseparate
stylesheetand nesting hierarchypreserved.
out table structure from intentional data ta-
bles and to elucidate implicit hierarchy rep-
resented by row-spanning (ROWSPAN) and
column-spanning(COLSPAN)attributesasex-
plicitnestedtables. Theaugmentedexplicitta-
ble layout structure isthen page-by-page con-
verted to the web standard DIV-based layout
with a separate CSS style file for each page,
preserving the original look andfeel. Finally,
softwareclonedetectiontechnology [14]isuti-
lizedtorecognizecommonstylesandsynthesize
asingle minimized CSS stylesheet file for the
site, convertingeachpage to use the common
stylesheet. Theentireprocessarchitecturecan
bevisualizedasshowninFigure4.
The remainder of thispaper isorganizedas
follows. Section 2 outlines the phases of our
process in detail and gives small examples of
each transformationon example HTML code.
Figure4: ConceptualProcess
Our process consists of two major steps. In
Step 1, table recognition is used to infer and
preserve hierarchical structure in a conversion
from tablesto DIVs, separating style informa-
tion intoa stylesheetfor each page. In Step2,
clone detection is used to unify and minimize
stylesintoa single consistentstylesheetfile for
the entire site.
Section3demonstratestheentireprocessusing
ourexperienceinconvertingarealentiretable-
basedlegacywebsitetomodernXHTML/CSS
standards. Section 4 relates our work to that
of others, and Section 5 discusses limitations
and future extensions of our process. Finally,
Section6summarizesandconcludesthepaper.
2 Approach
Theoverallpurposeofour approachistopro-
videanautomatedtransformationsystemthat
preservesthelayoutstructureofthepagesofa
web site specified by HTMLtable layout into
a nested hierarchical DIV structure while re-
tainingtheoriginallookandfeel,andtorecog-
nizeandextractDIVstylestoasingleunified,
minimized,maintainable CSSstyle file for the
site. The first part is achieved by converting
eachpageseparately,creatingaCSSstylesheet
for it independent of the others. The second
part uses clone detection techniquesto recog-
nizeandminimizecommonstylesintoasingle
unifiedstylesheetusedbyallthepages. Figure
4showsaconceptualviewofour method.
C# PDF Library SDK to view, edit, convert, process PDF file for C#
PDF SDK for .NET allows you to read, add, edit, update Please refer to this C# guide to learn how to for .NET to insert, delete and update PDF form fields in C#
how to save fillable pdf form in reader; extract data from pdf file
C# PDF File Compress Library: Compress reduce PDF size in C#.net
C#.NET PDF Document Optimization. resources: Since images are usually or large size, images size reducing can help to reduce PDF file size Flatten form fields.
extract pdf data to excel; extracting data from pdf files
All steps of our approach are implemented
assource transformations implemented in the
TXLsourcetransformationlanguage[6]. Each
phase consistsof anumber of source transfor-
mations, strung together to achieve the result
(Figure5). Inthissectionweoutlinethedetails
ofthesetransformations,usingsmallexamples
todemonstratethetechnique.
2.1 XHTML Conversion
In the first transformation, web pages are in-
dependentlyconvertedfromHTMLtoXHTML
usingaTXLgrammarthatutilizesrobustpars-
ing [2] to correct for HTML exceptions to
XML form. Robust parsing isa method that
attempts to parse each page as an XHTML
document, adapting to exceptions where for
example closing tags are missing. The ex-
ceptions are isolated into special nonterminal
formsthat are then targeted for correctionby
TXLsource transformationrules, resulting in
avalid XHTMLpage. Figure 6 showsan ex-
ampleofthistransformation.
2.2 Table Recognition
To preserve the original hierarchicallayoutse-
mantics of theoriginal tablelayout pagesinto
nestedDIVsections, wemustfirst understand
what the intended structure is. In HTML ta-
ble layout, some of the intended structure is
encodedintheROWSPANandCOLSPANat-
tributes of table elements (Figure 7). Since
DIV sections have no such corresponding fea-
ture,wemustfirstmakethisimplicitsubstruc-
tureexplicitbytransformingtheoriginalpages
toeliminate ROWSPANandCOLSPAN while
retainingthelayoutsemanticstheyimply.
In order to do this we have adapted ideas
fromthetablerecognitionliteratureinpattern
analysisandmachineintelligenceresearch[22].
The methodswe have adapted are called pro-
jectionandpartitioning. Thebasicideaisthat
atablecellthatspanstwoormoreothercellsin
arowor column impliesaprojectedor nested
structureonparallelrowsor columnsthat em-
beds their corresponding cells in a sub-table.
TablerecognitionmethodssuchasHandley[9]
and Huet al. [11] use this ideain analysisof
higher levelstructureoftablesindocuments.
In our case we have implemented this idea
usingatablepartitioningandnestingtransfor-
mationwhichformsapartofourconversionto
DIV structures. To assist in the analysis, we
compute an approximate layout for eachpage
byassigningpositioninformationtoeverytable
cellusingcustom attributes. Thisinformation
is used in the analysis only - the final result
isconstrainedonly bythepage’soriginalstyle
attributes. The conversion proceedsby parti-
tioningROWSPANandCOLSPANstructures
into nestedtables, separating them from par-
allelunspannedrowsorcolumnsand reducing
alltablestructurestosimple oneswithoutany
spans. Nesting of tables retains the relation-
shipbetweentheelementssothatlayoutisnot
lost. Figure 8 showsthe result of table parti-
tioningtheexampleofFigure7.
In some table layout sites, there could of
coursebeaconflictbetweenrowspansasshown
inFigure9. Just assuchcasescauseproblems
for table recognitionalgorithms, it leadstoan
ambiguity for our method and our prototype
conversion system requires hand intervention
tohandlesuch(relativelyrare) cases.
2.3 Table Identification
While the same HTML ”TABLE” feature is
used,not everytable ina webpage represents
layout information - some tablesare intended
toactually be datatables. Aspart ofour con-
version, we must identify which tablesshould
beconvertedtoDIVstructuresandwhichnot.
In our prototype, this decision is made on a
verysimplecriterion-iftheHTMLtablestruc-
ture hasatable header (THEADER) or table
footer (TFOOTER) tag (i.e., if it has labeled
columnsorrows),thenthetableisassumedto
representarealdatatableandisnotconverted.
The identification of tables to be converted
isdoneusinganother TXLsourcetransforma-
tionthat markslayout tablestobe converted
toDIVusingacustomXMLtagthatalsogives
eachtable a uniquenamefor use in attaching
CSSstylestoitlater. Figure10showspartof
theresultoftableidentificationonanexample
page.
How to C#: Basic SDK Concept of XDoc.PDF for .NET
›› C# PDF: Basic SDK Concept. C#.NET PDF: Basic Concept of .NET XDoc.PDF SDK. Introductions to Classes and APIs Included in .NET XDoc.PDF for C# Programming.
change font size pdf form reader; extract data from pdf forms
C# PDF File Merge Library: Merge, append PDF files in C#.net, ASP.
form. Append one PDF file to the end of another and save to a single PDF file. Merge PDF with byte array, fields. Merge PDF without size limitation. RasterEdge
extract data from pdf using java; how to save filled out pdf form in reader
Figure 5: ImplementationArchitecture
2.4 Conversion to DIVs with Lo-
cal Styles
Followingtablepartitioningandidentification,
identified layout tablesin pagesare converted
fromtablestructurestoDIVpartitionsforeach
tableelement,eachwithitsownindividuallo-
cal inline style preserving the style attributes
oftheoriginaltablecell. Figure 11showspart
of the result of the DIV conversion of an ex-
ample page. Relative positioning implied by
tablerowsismaintainedintheresultusingthe
FLOAT=”LEFT”styleattribute.
Likeallofour stages,theconversiontoDIV
isdoneusingTXLsourcetransformationrules.
Figure12 showsthe main transformation rule
replaceTableByDIV foridentifiedtables. Inthe
usual TXL style, this one rule automatically
searchestomatchandconvertevery identified
table in the input. As part of the transfor-
mation, it usesthe transformationsubrule re-
placeTrByDIV toconverteachrowofthetable,
and so on. The generated DIVs are uniquely
identifiedbythetableidgeneratedinthetable
identification step. These idswill attach each
DIVtoitscorrespondingstyleinthenextstep.
Aweb-basedhumaninterfaceallowsthe op-
eratoroftheconversionprocesstochoosemore
meaningful names for the generated DIVs at
thisstage (Figure13). Astheoperator enters
newnames,theinterfaceautomaticallychanges
the XHTMLsource of the generated DIVsto
correspond. Figure 14 shows the converted
DIVexampleofFigure11after renaming.
2.5 Separation of CSS Style Files
Followingthe conversiontoDIV formwithin-
line local styles, another transformation gath-
ersandconvertsallstylesineachpageintoan
individualCSSstylefileforthepage,usingthe
table ids of the previous step. This is done
C# PDF Field Edit Library: insert, delete, update pdf form field
C#.NET Demo Code: Add Form Fields to an Existing PDF File in C#.NET. This C# demo will help you to add form fields to PDF file. String
extract data out of pdf file; edit pdf form in reader
.NET PDF Document Viewing, Annotation, Conversion & Processing
XDoc.PDF SDK for .NET is completely developed in .NET, compatible with Visual C#, Visual Basic, and Delphi for .NET. Read form data from PDF form file.
can reader edit pdf forms; how to save pdf form data in reader
<html>
<body>
<table width=100% align=left>
<tr>
<td width=250>
<p>
Content 1 has two paragraphs.
<p>
This is the second one.
</td>
<td rowspan=2 width=300>
Content 2 is a row-spanning entry.
</td>
<td>
<p>
But Content 3 has one.
</td>
</tr>
<tr>
<td width=250>
Content 4 also has two paragraphs.
<p>
This is the second.
</td>
<td width=150>
Content 5.
</td>
</tr>
<tr>
<td>
Content 6.
</td>
<td colspan=2>
Content 7 is a column-spanning entry.
</td>
</tr>
</table>
</html>
<html>
<body>
<table width=”100%” align=”left”>
<tr>
<td width=”250”>
<p>
Content 1 has two paragraphs.
</p>
<p>
This is the second one.
</p>
</td>
<td rowspan=”2” width=”300”>
Content 2 is a row-spanning entry.
</td>
<td>
<p>
But Content 3 has one.
</p>
</td>
</tr>
<tr>
<td width=”250”>
Content 4 also has two paragraphs.
<p>
This is the second.
</p>
</td>
<td width=”150”>
Content 5.
</td>
</tr>
<tr>
<td>
Content 6.
</td>
<td colspan=”2”>
Content 7 is a column-spanning entry.
</td>
</tr>
</table>
</body>
</html>
Figure 6: ConversiontoXHTML
Figure7: Row-andColumn-spansinTableLayout
The layoutspecified bythe table inFigure 6. (Bordersshownto make the layoutvisible.)
usinga TXLtransformationthat extracts the
local style parameterssuch asfont and align-
ment from each DIV and creates a CSS style
for it,namedusingthe DIV’sunique tableid.
Aspartofthistransformation,theorderofpa-
rametersineachgeneratedstyle isnormalized
by sorting into alphabetical order in order to
allowsothat similar stylesaremore easilyde-
tectedinthenextphase.
Figure15showsaportionofthecorrespond-
ingextractedCSSstyle filefor the exampleof
Figure11. Followingthisstepall pagesof the
site have been converted to DIV layout, each
pagewithitsownindividualCSSstyle file.
2.6 Clone Detection on Styles
While the DIV conversion results in a com-
pletelymigratedXHTML,DIVandCSS-based
web standard website, it still hasthe undesir-
able property that eachpagehas its own CSS
style file. The remaining problem is the in-
tegration of these styles to a single uniform
stylesheet for the entire website. To achieve
this result we employ clone detection tech-
nology [14] borrowed from our previous soft-
warere-engineeringwork [7]torecognizesimi-
lar stylesacrosspagesandintegrate theminto
asingleglobalCSSstylesheet file.
The process of CSS style clone detection is
Figure8: Table PartitioningtoEliminateRowandColumnSpanning
Tablepartitioning convertsCOLSPAN and ROWSPAN attributestoequivalentnestedtable
structures,reducing alllayouttablesto simple ones. (Bordersshown tomake the layoutvisible.)
Figure9: ConflictualRowspanExample
Ourprototype isnotyetable toautomaticallyhandle rowpartitioning for thisexample,
and handassistance isrequired. (Borders shown to make the layoutvisible.)
<tag id="table1">
<table float="left">
<tag id="table1_tr1">
<tr>
<tag id="table1_tr1_td1">
<td width="250" widthi="266">
content 1
</td>
</tag>
</tr>
</tag>
<tag id="table1_tr2">
<tr>
<tag id="table1_tr2_td1">
<td width="250" widthi="266">
content 4
</td>
</tag>
</tr>
</tag>
</table>
</tag>
Figure10: Table IdentificationExample
XML custom tags (“<tag>”) mark and
uniquely name each component of tables iden-
tified aslayout tables.
achievedbytwolinkedsourcetransformations,
onewhichworksonthepages’CSSfilestode-
tect and unify style clones and another that
<div id="table1" float="left">
<div id="table1_tr1">
<div id="table1_tr1_td1" width="250"
widthi="266">
content 1
</div>
<br clear="both"/>
</div>
<div id="table1_tr2">
<div id="table1_tr2_td1" width="250"
widthi="266">
content 4
</div>
<br clear="both"/>
</div>
</div>
<div id="table2" float="left">
<div id="table2_tr1">
<div id="table2_tr1_td1" width="300"
widthi="266">
content 2
</div>
<br clear="both"/>
</div>
</div>
Figure11: DIV ConversionExample
Style attributessuchas WIDTHremaininline
at this stage. The WIDTHI style attribute is
an artifact of the layout stage of our process
and willbe removed later.
rule replaceTableByDIV
replace [html_interesting_element]
<tag 'id=TableIDParam [stringlit]>
<table RptTableParams 
[repeat html_any_tag_parameter]>
RptTableContents 
[repeat html_table_content]
</table>
</tag>
construct TrtoDIV [repeat div_tag]
_ [replaceTrByDIV each RptTableContents]
by
<div 'id=TableIDParam RptTableParams>
TrtoDIV
</div>
end rule
Figure12: MainTXLTransformationRulefor
DIVConversion
Figure 13: Interface for HandRenaming
works on the pages themselves to update the
stylereferencesinDIV sectionstorefer to the
newunifiedstylenames.
The first transformation,alloftheCSSfiles
for individualpagesareconcatenatedintoone
merged file. A simple TXL pattern-matching
rulesearchesthemergedfileforexactclonesof
eachCSSstyle. Subsequent clonesaremarked
withthenameoftheoriginalstyle andatable
ofequivalencesisoutput asalisttoacloneta-
ble file (Figure 16). Once the clone table file
has been output, the mergedCSS file is opti-
mizedbyremovingallmarkedclonestoyielda
minimalCSSstyle filefortheentirewebsite.
<div id="top_left" float="left">
<div id="top_left_container_1">
<div id="top_left_content_1" width="250"
widthi="266">
content 1
</div>
<br clear="both"/>
</div>
<div id="top_left_container_2">
<div id="top_left_content_2" width="250"
widthi="266">
content 4
</div>
<br clear="both"/>
</div>
</div>
<div id="top_middle" float="left">
<div id="top_middle_container">
<div id="top_middle_content" width="300"
widthi="266">
content 2
</div>
<br clear="both"/>
</div>
</div>
Figure14: DIVConversionExampleAfterRe-
naming
Thesecondtransformationisthenrunonev-
ery page of the site, updatingstyle references
ofeachDIVaccordingtotheclonetable. Each
style reference is looked up inthe clone table
andchangedto thenameofthestyleofwhich
itisaclone. Thefinalresultisawebsitewith
asinglemerged,optimizedCSSfileusedbyall
pagesof the site, as if the site hadbeenhand
craftedtothe modernwebstandard.
2.7 Hand Tuning
Thefinalstepintheprocessisthehandtuning
of the generated CSS styles to exactly match
minordetailsoftheoriginallookandfeel. Typ-
icallythisinvolvesaddingabitofextramargin
space to the styles for some DIV blocks and
removing an occasional redundant attribute.
Thisstepusually requires only a few minutes
ofwebprogrammertime tocomplete.
3 Experience
Our method has been tested on a number of
example websites with varying levels of table
layout complexity ranging from simple layout
#top_left {
float: left;
margin: auto;
}
#top_left_container_1 {
float: left;
margin: auto;
}
#top_left_container_2 {
float: left;
margin: auto;
}
#top_left_content_1 {
float: left;
margin: auto;
width: 250;
widthi: 266;
}
#top_left_content_2 {
float: left;
margin: auto;
width: 250;
widthi: 266;
}
#top_middle {
float: left;
margin: auto;
}
#top_middle_container {
float: left;
margin: auto;
}
#top_middle_content {
float: left;
margin: auto;
width: 300;
widthi: 266;
}
Figure15: ExampleExtractedCSSStyleFile
Asstyleattributes are extracted to aCSSstyle
file for each DIV converted page, they are re-
moved from the DIVs in the page so that all
style information appearsonlyinthestyle file.
to complex ROWSPAN andCOLSPAN struc-
turesinordertovalidateour tablerecognition
algorithms and the ability of our method to
preservelookandfeel. Inaddition,tworealen-
tiretable-basedlegacy websites,onewithsim-
ple table layout and one with complex, one
originally generated using Claris Home Page
and one with MS Front Page, have been con-
vertedtotestourclonedetectionandCSSgen-
erationmethods. Thissectionoutlinesour ex-
periences with some of these examples, first
withtwosimplelayoutsitesandthentwocom-
plex.
"top_left" -> "top_left_container_1"
"top_left" -> "top_left_container_2"
"top_left" -> "top_middle"
"top_left" -> "top_middle_container"
"top_left" -> "top_right"
"top_left" -> "top_right_container_1"
"top_left" -> "top_right_container_2"
"top_left" -> "a_top_left"
"top_left" -> "a_top_left_container_1"
"top_left" -> "a_top_left_container_2"
"top_left" -> "a_top_middle"
"top_left" -> "a_top_middle_container"
"top_left" -> "a_top_right"
"top_left" -> "a_top_right_container_1"
"top_left" -> "a_top_right_container_2"
"top_left" -> "b_top_left"
"top_left" -> "b_top_left_container_1"
"top_left" -> "b_top_left_container_2"
"top_left" -> "b_top_middle"
"top_left" -> "b_top_middle_container"
"top_left" -> "b_top_right"
"top_left" -> "b_top_right_container_1"
"top_left" -> "b_top_right_container_2"
"top_left_content_1" -> "top_left_content_2"
"top_left_content_1" -> "a_top_left_content_1"
"top_left_content_1" -> "a_top_left_content_2"
"top_left_content_1" -> "b_top_left_content_1"
"top_left_content_1" -> "b_top_left_content_2"
"top_middle_content" -> "a_top_middle_content"
"top_middle_content" -> "b_top_middle_content"
Figure16: PartialExampleClone Table
Aswell as an integratedCSSstyle file, clone
detectiongeneratesa clone equivalencetable
foruse bythe clone resolutiontransformation.
3.1 Queen’sSchool ofComputing
Home Page
The home page of the School of Computing’s
website was authored and is maintained by
hand in HTML using table layout, but with
pre-existing CSS styles that must be retained
in the result, making it an interesting differ-
entkindofchallengeforour method. The lay-
out is relatively simple, involving no ROWS-
PANsinthetables. Figure17showstheresult
of converting the front page of thissite using
ourmethod,preservingexistingstylereferences
(such as”class=wong”) while introducing our
ownfor thenewlygeneratedDIVs. Newstyles
generated by our processare concatenated to
theexistingCSSstylefilefor thesite, yielding
anidenticallook andfeel (Figure 18).
Figure18: SchoolofComputing Home Page BeforeandAfter Conversion
<div id="topcontainer">
<!--H1 -->
<div id="sep1">
<div id="topcontainercontent">
. . .
</div>
</div>
<div id="sep1">
<div id="leftsep1" class="wong">
<!-- -->
</div>
<div id="rightsep1">
. . .
</div>
<br clear="both"/>
</div>
<div id="sep1">
<div id="sep2bg">
<!-- -->
</div>
</div>
</div>
Figure 17: School of Computing Home Page
FollowingConversion
3.2 IEEE Kingston Website
The IEEE Kingston Section website is small,
consistingofonlysevenHTMLpagesand2,781
linesofcode. Itisasimplelayoutsite,withno
ROWSPANsinitstablestructures,butusesa
highlycomplexhierarchicalset of nestedtable
structures to layout its components. Due to
thiscomplexity, conversionof the siteinitially
generatedsevenCSSfilestotally5,598linesof
style specifications, posing a challenge for our
clone detection and style minimization steps.
Clone detection found 769 cloned styles that
couldbeminimizedandremoved(seeexample
Figure19),reducingthefinalCSSstylefilefor
#topnavitem1 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem2 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem3 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem4 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem5 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem6 {
float:  left;
margin:  auto;
widthi:  57;
}
#topnavitem7 {
float:  left;
margin:  auto;
widthi:  57;
}
. . .
#topnavitem1 {
float:  left;
margin:  auto;
widthi:  57;
}
Figure19: APortionof the CSSStyle Filefor
the IEEE Kingston Website Before and After
Clone Detection
the site to only 323 lines. Due to the large
numberofsimilar generatedstyles,therenam-
ingstageforthissiteusedasignificantamount
ofhumaninteractiontime. Thispointstoapo-
tentiallimitationofourmethodthatmayneed
tobeaddressedinfuturework.
3.3 James Cordy’s Home Page
The home page of the second author’s web-
site uses one ROWSPAN in the table layout
Documents you may be interested
Documents you may be interested