free asp. net mvc pdf viewer : Convert pdf to .txt file control SDK platform web page wpf .net web browser MapReduce-Aneka20090-part323

MapReduce Programming Model for .NET-based Cloud
Computing
ChaoJinandRajkumarBuyya
GridComputingandDistributedSystems(GRIDS)Laboratory
Department ofComputerScience andSoftwareEngineering
TheUniversityofMelbourne,Australia
Email:fchaojin,rajg@csse.unimelb.edu.au
Abstract. Recentlymanylargescalecomputersystemsarebuiltinordertomeet
thehighstorageandprocessingdemandsofcomputeanddata-intensiveapplica-
tions.MapReduce is one of the most popular programmingmodels designed to
support the developmentofsuch applications. Itwasinitially created byGoogle
for simplifying the development of large scale web search applications in data
centersandhasbeenproposedtoformthebasisofa‘Datacentercomputer’This
paperpresentsarealizationofMapReducefor.NET-baseddatacenters,including
theprogrammingmodel andtheruntimesystem.Thedesignandimplementation
ofMapReduce.NETaredescribedanditsperformance evaluationispresented.
1 Introduction
Recently several organizations are building large scale computersystems to meet the
increasingdemandsofhighstorageandprocessingrequirements ofcompute anddata-
intensive applications. On the industry front, companies such as Google and its com-
petitors haveconstructedlarge scale data centers to providestable web searchservices
withfastresponseandhighavailability.Ontheacademiafront, manyscientificresearch
projects increasinglyrely on large scaledata sets and powerfulprocessing ability pro-
videdbysupercomputersystems,commonlyreferredtoas e-Science [15].
ThesehugedemandsondatacentersmotivatetheconceptofCloudComputing [9]
[12]. With clouds, IT-relatedcapabilities can beprovidedas service, whichis accessi-
blethroughtheInternet. Representativesystems includeGoogleAppEngine, Amazon
ElasticCompute Cloud (EC2), MajrasoftAneka,andMicrosoftAzure. Theinfrastruc-
ture ofCloudComputingcanautomaticallyscaleuptomeetthe requests ofusers. The
scalable deployment of applications is typically facilitated by Virtual Machine (VM)
technology.
Withtheincreasingpopularity ofdatacenters, itisa challenge to provide aproper
programming model whichis ableto supportconvenientaccess tothelargescale data
for performing computations while hiding all low-level details of physical environ-
ments. Withinallthe candidates, MapReduceis one ofthemostpopularprogramming
modelsdesignedforthis purpose.ItwasoriginallyproposedbyGoogletohandlelarge-
scale websearchapplications [8]andhas beenprovedtobe aneffectiveprogramming
model for developing data mining and machine learning applications in data centers.
Convert pdf to .txt file - application software cloud:C# PDF Convert to Text SDK: Convert PDF to txt files in C#.net, ASP.NET MVC, WinForms, WPF application
C# PDF to Text (TXT) Converting Library to Convert PDF to Text
www.rasteredge.com
Convert pdf to .txt file - application software cloud:VB.NET PDF Convert to Text SDK: Convert PDF to txt files in vb.net, ASP.NET MVC, WinForms, WPF application
VB.NET Guide and Sample Codes to Convert PDF to Text in .NET Project
www.rasteredge.com
Especially, it can improve the productivity of junior developers who do not have re-
quiredexperiencesofdistributed/paralleldevelopment.Therefore, ithasbeenproposed
toformthebasis ofa ‘datacentercomputer’[5].
The .NET frameworkis the standardplatform ofMicrosoft Windows applications
and it has been extendedto supportparallelcomputing applications. For example, the
parallel extension of .NET 4.0 supports the Task Parallel Library and Parallel LINQ,
while MPI.NET [6]implements a high performance library forthe Message Passing
Interface(MPI). Moreover, theAzurecloudservicerecentlyreleasedbyMicrosoft, en-
ables developers tocreate applicationsrunninginthecloudbyusing the .NETFrame-
work.
ThispaperpresentsarealizationofMapReduceforthe.NETplatform,calledMapRe-
duce.NET. Itnotonly supports data-intensive applications, butalso facilitates a much
widervarietyofapplications,evenincludingsomecompute-intensiveapplications,such
asGenetic Algorithm(GA)applications. Inthispaper, wedescribe:
 MapReduce.NET:AMapReduce programmingmodeldesignedforthe.NETplat-
formusingtheC]programminglanguage.
 A runtime system ofMapReduce.NET deployed in an Enterprise Cloud environ-
ment,calledAneka [12].
The remainder ofthis paper is organized as follows. Section 2 gives an overview
of MapReduce. Section 3discusses related work. Section 4presents the architecture
of MapReduce.NET, while Section 5discusses the schedulingframework. Section 6
describestheperformanceevaluationofthesystem. Section 7concludes.
2 MapReduce Overview
MapReduce is triggered by the map and reduce operations in functional languages,
such as Lisp. This modelabstracts computation problems throughtwo functions: map
andreduce.Allproblemsformulatedinthiswaycanbeparallelizedautomatically.
Essentially, the MapReduce model allows users to write map/reduce components
withfunctional-stylecode. Thesecomponentsarethencomposedasadataflowgraphto
explicitly specify theirparallelism. Finally, the MapReduce runtime system schedules
these components to distributed resources for execution while handling many tough
problems:parallelization, networkcommunication, andfaulttolerance.
Amapfunctiontakesakey/valuepairasinputandproducesalistofkey/valuepairs
asoutput. The typeofoutputkeyandvaluecanbe differentfrominput:
map ::(key
1
;value
1
)) list(key
2
;value
2
)
(1)
Areducefunctiontakes akeyandassociatedvaluelistasinputandgeneratesa list
ofnewvaluesas output:
reduce :: (key
2
;list(value
2
)) ) list(value
3
)
(2)
AMapReduce application is executedina parallel mannerthrough twophases. In
the first phase, all mapoperations can be executedindependently from each other. In
2
application software cloud:Online Convert PDF to Text file. Best free online PDF txt
from other C# .NET PDF to text conversion controls, RasterEdge C# PDF to text converter control toolkit can convert PDF document to text file with good
www.rasteredge.com
application software cloud:C# PDF Converter Library SDK to convert PDF to other file formats
Allow users to convert PDF to Text (TXT) file. It's easy to be integrated into your C# program and convert PDF to .txt file with original PDF layout.
www.rasteredge.com
the second phase, eachreduce operation maydepend on the outputs generated by any
numberof map operations. All reduce operations can also be executed independently
similartomapoperations.
3 Related Work
Since MapReduce was proposed by Google as a programming model for developing
distributed data intensive applications in data centers, it has received much attention
from thecomputingindustryandacademia. Manyprojects areexploring ways to sup-
port MapReduce on various types ofdistributed architecture and for a widerrange of
applications.Forinstance,Hadoop [2]isanopensourceimplementationofMapReduce
sponsoredbyYahoo!. Phoenix [4] implemented theMapReducemodelfortheshared
memoryarchitecture,whileM.KruijfandK.SankaralingamimplementedMapReduce
fortheCellB.E.architecture [11].
Ateam from Yahoo!researchgroupmade anextension onMapReduceby adding
amerge phase afterreduce, calledMap-Reduce-Merge [7], toperform joinoperations
formultiple relateddatasets. Dryad [10] supports an interface to compose a Directed
Acyclic Graph (DAG) for data parallel applications, which can facilitate much more
complexcomponentsthanMapReduce.
Other efforts focus on enabling MapReduce to support a wider range of applica-
tions. MRPSO [1] utilizes the Hadoopimplementation of MapReduce to parallelize a
compute-intensive application, calledParticle Swarm Optimization. Researchers from
Intel currently work onmaking MapReduce suitablefor performingearthquake simu-
lation,imageprocessingandgeneralmachinelearningcomputations[14]. MRPGA[3]
is anextensionof MapReduce for GA applications based onMapReduce.NET. Data-
Intensive Scalable Computing (DISC) [13] started to explore suitable programming
modelsfordata-intensive computationsbyusingMapReduce.
4 Architecture
MapReduce.NET resembles Google’s MapReduce, but with special emphasis on the
.NET andWindows platform. The design of MapReduce.NET aims to reuse as many
existingWindowscomponentsaspossible.Fig.1illustratesthearchitectureofMapRe-
duce.NET. Its implementationis assisted byseveral component services from Aneka.
Aneka is a .NET-based platform for enterprise and public Cloud Computing [12]. It
supportsthedevelopmentand deploymentof .NET-based Cloudapplications in public
Cloudenvironments,suchasAmazonEC2. WeusedAnekatosimplifythedeployment
of MapReduce.NET in distributedenvironments. Each Aneka node consists ofa con-
figurable container, hostingmandatoryand optional services. The mandatory services
providethebasiccapabilitiesrequiredina distributedsystem, suchascommunications
between Aneka nodes, security, and membership. Optionalservices can be installedto
support the implementationofdifferent programmingmodels in Cloud environments.
MapReduce.NETis implementedas anoptionalserviceofAneka.
3
application software cloud:VB.NET Create PDF from Text to convert txt files to PDF in vb.net
Batch convert editable & searchable PDF document from TXT formats in VB.NET class. Able to copy and paste all text content from .txt file to PDF file by keeping
www.rasteredge.com
application software cloud:C# Create PDF from Text to convert txt files to PDF in C#.net, ASP
Free .NET library for creating PDF from TXT in both C# C#.NET class source code for creating PDF document from Convert plain text to PDF text with multiple fonts
www.rasteredge.com
Basic Distributed Services of Aneka 
Membership 
Failure Detector 
Configuration 
Windows 
Machine 
Windows 
Machine 
Windows 
Machine 
WinDFS (Distributed Store System) 
CIFS/NTFS 
Application 
Machine Learning 
Application 
Bioinformatics 
Application 
Web Search 
MapReduce.NET 
Executor 
Scheduler 
Client 
Fig.1:Architecture ofMapReduce.NET.
BesidesAneka, WinDFS provides adistributedstorageserviceoverthe .NETplat-
form.WinDFS organizes thediskspaces onalltheavailableresourcesas avirtualstor-
age pool and provides anobject-based interface witha flat name space, which is used
to manage data stored in it. Toprocess local files, MapReduce.NET can also directly
communicatewithCIFSorNTFS. Theremainderofthissectionpresentsdetailsonthe
programmingmodelandruntimesystem.
Table1:APIsofMapReduce.NET
classMapper
f
voidMap(MapInput< K, V> input)
g
classReducer
f
voidReduce(IReduceEnumerator input)
g
4.1 MapReduce.NETAPIs
The implementationofMapReduce.NET exposes APIs similartoGoogleMapReduce.
Table1illustratestheinterfacepresentedtousersintheC]language.Todefinemap/reduce
functions,users needtoinheritfromMapperorReducer classandoverridecorrespond-
ing abstract functions. To execute the MapReduce application, the user first needs to
create a MapReduceAppclass andset itwith the corresponding Mapper and Reducer
classes. Then, input files should be configured before starting the execution and they
canbelocalfiles orfiles inthedistributedstore.
The typeofinputkeyandvalue to theMapfunctionis theobject, which is theroot
type of all types inC]. For reduce function, the inputis organizedas a collection and
4
application software cloud:C# WPF PDF Viewer SDK to convert and export PDF document to other
2. To TIFF. Export PDF to TIFF file format. 3. To TXT. Export and convert PDF to TXT file. 4. To Image. Convert PDF to image formats, such as PNG, JPG, BMP and
www.rasteredge.com
application software cloud:VB.NET PDF - WPF PDF Viewer for VB.NET Program
are allowed to view PDF on VB.NET project, annotate PDF document with various notes and shapes, convert PDF to Word document, Tiff image, TXT file and other
www.rasteredge.com
thedatatypeisIEnumerator,whichisaninterfaceforsupportinganiterativeoperation
onthecollection. Thedatatypeofeachvalueinthe collectionisalsoobject.
With object, any type of data, including user-defined or system build-in type, can
be acceptedas input. However, for user definedtypes, users need to provide serializa-
tionanddeserializationmethods.Otherwise,thedefaultserializationanddeserialization
methodswillbeinvoked.
4.2 RuntimeSystem
TheexecutionofaMapReduce.NETapplicationconsistsof4majorphases:Map,Sort,
MergeandReduce.Theoverallflowofexecutionis illustratedinFig.2. The execution
starts with the Map phase. It iterates the input key/value pairs and invokes the map
functiondefinedbyusers oneachpair. Thegeneratedresults arepassedtotheSortand
Merge phases, whichperform sortingandmergingoperations togroupthevalues with
identicalkeys.Theresultisanarray,eachelementofwhichisagroupofvaluesforeach
key.Finally, the Reducephase takes thearrayas inputandinvokes the reduce function
definedbyusersoneachelementofthearray.
Input 
Map  
Sort  
Mapper 
Mapper 
Mapper 
Result  
Merge  
Reduce  
Reducer 
Reducer 
Reducer 
Fig.2:ComputationofMapReduce.NET.
The runtime system is based on the master-slave architecture with the execution
of MapReduce.NET orchestrated by a scheduler. The scheduler is implemented as a
MapReduce.NET Scheduler service in Aneka, while all the 4 major phases are im-
plemented asa MapReduce.NET Executor service. WithAneka, theMapReduce.NET
system canbedeployedinclusterordatacenterenvironments.Typically, itconsists of
onemastermachineforaschedulerserviceandmultipleworkermachines forexecutor
services.
The4majorphasesaregroupedintotwotasks:MaptaskandReducetask.TheMap
taskexecutesthe first2phases:mapandsort, whiletheReduce taskexecutes thelast2
phases:mergeandreduce.Theinputdataforthemapfunctionis splitintoeven-sizedm
5
application software cloud:VB.NET PDF - Convert PDF with VB.NET WPF PDF Viewer
2. To TIFF. Export PDF to TIFF file format. 3. To TXT. Export and convert PDF to TXT file. 4. To Image. Convert PDF to image formats, such as PNG, JPG, BMP and
www.rasteredge.com
application software cloud:C# Create PDF Library SDK to convert PDF from other file formats
Create writable PDF from text (.txt) file. HTML webpage to interactive PDF file creator freeware. Create multipage PDF from OpenOffice and CSV file.
www.rasteredge.com
piecestobeprocessedbymmaptasks,whichareevenlyassignedtoworkercomputers.
The intermediate results generated by map tasks are partitioned intor fragments, and
eachfragmentis processedbyone reducetask.
Input Files 
Intermediate Files 
Output Files 
Application 
MapReduce.NET 
Executor 
Storage 
Invocation of Mapper Instances 
Cache 
Sort and Partition  
Memory 
Map Task 
Invocation of Reducer Instances 
Cache 
Merge and Group 
Reduce Task 
Memory 
Fig.3:DataflowofMapReduce.NET.
The majorphases ontheMapReduce.NETexecutorare illustratedinFig. 3.
MapPhase. Theexecutorextractseachinputkey/valuepairfrom theinputfile. For
each key/valuepair, itinvokes the mapfunctiondefinedby users. Theresultgenerated
by the map function is first buffered in the memory. The memory buffer consists of
manybuckets andeachoneisfora different partition.Thegenerated result determines
its partitionthroughahashfunction, whichmaybedefinedby users. Thentheresultis
appendedtothetailofthebucketofitspartition.Whenthesizeofalltheresultsbuffered
in thememoryreaches apredefinedmaximalthreshold, theyaresenttothe Sort phase
and then written to the disk. This saves space for holding intermediate results for the
nextroundofmapinvocations.
SortPhase. Whenthesize ofbufferedresults exceedsthe maximalthreshold,each
bucketis writtentodiskas anintermediatefile. Before the bufferedresults arewritten
to disk, elements ineachbucket are sortedinmemory. Theyarewritten to disk by the
sorted order, either ascending or descending. The sorting algorithm adopted is quick
sort.
Merge Phase. To prepare inputs for the Reduce phase, we need to merge all the
intermediatefiles foreachpartition. First, the executorfetchesintermediatefileswhich
are generatedintheMapphasefrom neighboringmachines. Then, theyare mergedto
group values with the same key. Since all the key/value pairs in the intermediate files
are already in a sorted order, we deploy a heap sort to achieve the group operation.
Eachnode intheheapcorresponds to one intermediate file. Repeatedly, the key/value
paironthetopnodeis picked, andsimultaneouslythevaluesassociatedwithsamekey
aregrouped.
Reduce Phase. In our implementation, the Reduce phase is combined with the
Merge phase. During the process of heap sort, we combine all the values associated
withthe samekeyandtheninvoke thereducefunctiondefinedby userstoperform the
reduction operation on these values. All the results generated by reduce function are
writtentodiskaccordingtothe orderbywhichtheyaregenerated.
6
4.3 MemoryManagement
Managing memory efficiently is critical for the performance ofapplications. Oneach
executor, the memoryconsumedbyMapReduce.NETmainlyincludesmemorybuffers
forintermediate results, memory spacefor the sorting algorithm and buffers forinput
andoutputfiles.ThememorymanagementisillustratedinFig. 3.
The system administratorcanspecifya maximalvalueforthe sizeofmemoryused
byMapReduce.NET.Thissizeisnormallydeterminedbythephysicalconfigurationof
machinesandthememoryrequirementofapplications.
According to this maximal memory configuration, weset the memory bufferused
byintermediateresults andinput/outputfiles.Thedefaultvalueforread/write bufferof
eachfileis 16MB.Theinputandoutputfiles arefromthe localdisk.Therefore,weuse
theFileStreamclass tocontroltheaccesstolocalfiles.
The memorybufferforintermediate results is implementedbyusingtheMemoryS-
treamclass, which is a stream in memory. Allthe results generated bymapfunctions
areserializedandthenappendtothetailofthestream inmemory.
5 Scheduling Framework
This sectiondescribes the schedulingmodelforcoordinating multipleresources toex-
ecute MapReduce computations. The scheduling is managed bytheMapReduce.NET
scheduler. After users submit MapReduce.NET applications to the scheduler, it maps
Map and Reduce tasks to different resources. During the execution, it monitors the
progressofeach taskandmigratetaskswhen somenodes aremuchslowerthanothers
duetotheirheterogeneityorinterferenceofdominatingusers.
Typically,aMapReduce.NETjobconsistsofmMaptasksandr Reducetasks.Each
Map task has an input file and generates r result files. Each Reduce task has minput
fileswhicharegeneratedbymMaptasks.
NormallytheinputfilesforMaptasksareavailableinWinDFSorCIFS priortojob
execution, thus the size of each Map input file can be determined before scheduling.
However, the output files are dynamically generated by Map tasks during execution,
hencethe size ofthese outputfilesisdifficulttodetermine priortojobexecution.
The system aims to be deployed in an Enterprise Cloud environment, which es-
sentially organizes idle resources within a company or department as a virtual super
computer. Normally, resources in Enterprise Clouds are shared by the owner of re-
sources and the users of idle resources. The latter one should not disturb the normal
usage ofresource owner. Therefore, with an Enterprise Cloud, besides facing the tra-
ditionalproblemsofdistributedsystem, suchas complexcommunicationsandfailures,
we have to face soft failure. Soft failure refers to a resource involved in MapReduce
executionhavingtoquitcomputationduetodominationbyitsowner.
Duetotheabovedynamicfeatures ofMapReduce.NET applicationandEnterprise
Cloud environments, we did not choose a static scheduling algorithm. On the con-
trary, the basic schedulingframework works like the work-stealing model. Whenever
aworker node is idle, a new Map or Reduce task is assigned to it for execution with
specialpriorityontakingadvantageofdatalocality.
7
The scheduling algorithm starts with dispatchingMap tasks as independent tasks.
The Reducetasks, however, aredependentontheMaptasks. Whenevera Reducetask
is ready(i.e. allitsinputsaregeneratedbyMaptasks), itwillbescheduledaccordingto
the statusofresources. The schedulingalgorithm aims tooptimizethe executiontime,
whichis achievedbyminimizingMapandReducetasksrespectively.
6 Performance Evaluation
WehaveimplementedtheprogrammingmodelandruntimesystemofMapReduce.NET
anddeployeditondesktopmachinesofseveralstudentlaboratoriesinMelbourneUni-
versity. This section evaluates its performance based on two benchmark applications:
WordCount(WC)andDistributedSort(DS).
AlltheexperimentswereexecutedinanEnterpriseCloudconsistingof33machines
located in3 studentlaboratories. For distributedexperiments, onemachinewas set as
master and the rest were configured as worker machines. Each machine has a single
Pentium4processor,1GMBmemory,160GBharddisk(10GBisdedicatedforWinDFS
storage), 1GbpsEthernetnetworkandrunsWindows XP.
6.1 SampleApplications
The sample applications (WC and DS) are benchmarks used by Google MapReduce
andHadoop. Toimplementthe WCapplication, users justneedtosplit wordsforeach
text file in the map function and sum the number of appearance for each word in the
reduce function. For the DS application, users do not have to do anything within the
mapandreducefunctions,whileMapReduce.NET performs sortingautomatically.
The restofthis sectionpresents the overheadofMapReduce.NET. First, we show
theoverheadcausedbytheMapReduceprogrammingmodelinalocalexecution.Then
theoverheadofMapReduce.NET inadistributedenvironmentis reported.
6.2 SystemOverhead
MapReduce canberegarded as a parallel designpattern, which trades performanceto
improvethesimplicityofprogramming. Essentially, theSortandMergephases of the
MapReduce runtime system introduce extra overhead. However, the sacrificedperfor-
mancecannotbeoverwhelming.Otherwise,itwouldnotbeacceptable.Weevaluatethe
overhead ofMapReduce.NET with local execution. The input files are located on the
localdiskandall4majorphasesofMapReduce.NETexecutes sequentiallyonasingle
machine.Thisiscalleda localrunnerandcanbeusedfordebuggingpurposes.
Forlocalexecution,bothsampleapplications wereconfiguredasfollows:
 The WC applicationprocesses the example text files usedbyPhoenix [1]and the
sizeofrawdata1GB.
 TheDS applicationsortsa numberofrecordsconsistingofakeyandavalue,both
ofwhich are random integers. The input data includes 1,000 million records with
1.48GBrawdata.
8
The execution time is split into 3 parts:Map, Sort and Merge+Reduce. They cor-
respondto thetime consumedbyreadinginputs andinvokingmap functions, thetime
consumedbythesortphase(includingwritingintermediate resultstodisk)andthetime
consumedbytheReducetasks.Inthissection,weanalyzetheimpactofbuffersizefor
intermediateresultsontheexecutiontimeofapplications. Inparticular,theexperiments
wereexecuted withdifferentsizes ofmemory bufferforintermediate results. The size
of memory buffer containing intermediate results was set to be 128MB, 256MB and
512MB respectivelyandthe results forbothapplications areshowninFig. 4.
128
256
512
0
100
200
300
400
500
600
700
800
900
1000
Cache Impacts on Word Count
Cache Size (MB)
Execution Time (Sec.)
Map
Sort
Merge + Reduce
128
256
512
0
500
1000
1500
Cache Impacts on Distributed Sort
Cache Size (MB)
Execution Time (Sec.)
Map
Sort
Merge + Reduce
Fig.4:CacheImpactsofMapReduce.NET.
First, we can see that different types of application have different percentage dis-
tributionfor each part. Forthe WC application, the time consumed by the reduce and
mergephases canevenbe ignored. Thereasonisthatthesizeofresults ofWCiscom-
parativelysmall. On the contrary, the reduce and merge phases of the DS application
incura muchlargerpercentageoftotaltime consumed.
Second,outofourexpectation,increasingthesizeofthebufferforintermediatere-
sultsmaynotreducethe executiontimeforbothapplications.Onthe contrary, alarger
buffer increases the time consumed by sorting because sorting more intermediates at
one time needs deeper stack and more resources. A larger memory buffer generates
fewerintermediatefiles,buteachischaracterizedbyalargersize. Theread/writebuffer
of each input/output files is configured per file, and the defaultvalue is 16MB. There-
fore,withalargerbufferforintermediateresults intheMapandSortphase, theReduce
phase consumes longertime becausetheoverallsize ofofinputfile buffers is smaller.
However, alargermemorybufferdoesnothave significantimpacts ontheMapphase.
6.3 Overhead Comparisonwith Hadoop
This sectioncomparestheoverheadofMapReduce.NET withHadoop, anopensource
implementation of MapReduce in Java. Hadoop is supported by Yahoo! and aims to
be a general purpose distributed platform. We use the latest stable release of Hadoop
(version0.18.3).
9
Hadoop
MapReduce
0
100
200
300
400
500
600
700
800
900
Performance Comparation of Word Count
Execution Time (Sec.)
Map
Sort
Merge + Reduce
Hadoop
MapReduce.NET
0
200
400
600
800
1000
1200
1400
Performance Comparation of Distributed Sort
Execution Time (Sec.)
Map
Sort
Merge + Reduce
Fig.5:OverheadComparisonofHadoopandMapReduce.NET.
Tocomparetheoverhead, werunthelocalrunnerofHadoopandMapReduce.NET
respectivelywiththesameinputsizeforbothapplications. The size ofbufferforinter-
mediateresultswasconfiguredtobe128MBforbothimplementations. Theconfigura-
tionofWC andDS applications are thesame as Section6.2. The JVMadoptedinour
experimentisSunJRE1.6.0,whiletheversionofthe.NETframeworkis2.0.Theresults
are shown in Fig. 5. MapReduce.NET performs better than Hadoopfor both applica-
tions.Specifically, bothMapandMerge+Reducephase ofMapReduce.NETconsumes
less timethanHadoop, butmoretimethanHadoopintheSortphase.
Reasons for this are:(a) the deserialization and serialization operations achieved
by MapReduce.NET is more efficient than Hadoop; (b) the Merge phase of Hadoop
involves extra IO operations than MapReduce.NET. In particular, for the Map phase,
themajoroverheadofbothapplications consists ofinvocationofdeserializationofraw
inputdataandmapfunctionscombinedwithreadingdiskoperations.Accordingtoour
experiments, however, we did not find significant performance difference of disk IO
operations byusingJRE1.6.0 and.NET 2.0over Windows XP. In the Merge+Reduce
phase,themajoroverheadincludesserialization,deserializationandreadingandwriting
disk.Hadoopsplitsthelargeinputfiles intoanumberofsmallpieces(32piecesforWC
and49piecesforDS)andeachpiececorrespondstoaMaptask.Then, Hadoopfirsthas
tomergealltheintermediateresultsforthesamepartitionfrommultipleMaptasksprior
to starting the combinedMergeand Reducephase. MapReduce.NET does notrequire
this extra overhead. Therefore it performs better than Hadoop in the Merge+Reduce
phase.
In the Sortphase, the sortingalgorithm implemented by Hadoop is more efficient
thanitscorrespondingimplementationinMapReduce.NET.BothMapReduce.NETand
Hadoopimplementthesamesortingalgorithm, henceidentifyingthedifferenceinper-
formance between two implementations implies a deep investigation involving the in-
ternalsofthetwovirtualmachines.
6.4 SystemScalability
In this section, we evaluate the scalable performance of MapReduce.NET in a dis-
tributedenvironment.Applications were configuredasfollows:
10
Documents you may be interested
Documents you may be interested