57
12.9. Exercises
151
tionsintheHTMLthatbrowsersgenerallyignore. Youcandownloadthe
BeautifulSoupcodefrom
www.crummy.com
.
port: Anumberthatgenerallyindicates s whichapplicationyouare contacting
whenyoumakeasocketconnectiontoaserver.Asanexample,webtraffic
usuallyusesport80whilee-mailtrafficusesport25.
scrape: Whenaprogrampretendstobeawebbrowserandretrievesawebpage
andthenlooksatthewebpagecontent. Oftenprogramsarefollowingthe
linksinonepagetofindthenextpagesotheycantraverseanetworkof
pagesorasocialnetwork.
socket: Anetworkconnectionbetweentwoapplicationswheretheapplications
cansendandreceivedataineitherdirection.
spider: Theactofawebsearchengineretrievingapageandthenallthepages
linkedfromapageandsoonuntiltheyhavenearlyallofthepagesonthe
Internetwhichtheyusetobuildtheirsearchindex.
12.9 Exercises
Exercise12.1 Changethesocketprogram
socket1.py
toprompttheuserforthe
URLsoitcanreadanywebpage.Youcanuse
split(’/’)
tobreaktheURLinto
itscomponentpartssoyoucanextractthehostnameforthesocket
connect
call.
Adderrorcheckingusing
try
and
except
tohandletheconditionwheretheuser
entersanimproperlyformattedornon-existentURL.
Exercise12.2 Changeyoursocketprogramsothatitcountsthenumberofchar-
actersithasreceivedandstopsdisplayinganytextafterithasshown3000charac-
ters.Theprogramshouldretrievetheentiredocumentandcountthetotalnumber
ofcharactersanddisplaythecountofthenumberofcharactersattheendofthe
document.
Exercise12.3 Use
urllib
toreplicatethepreviousexerciseof(1)retrievingthe
documentfromaURL,(2)displayingupto3000characters,and(3)countingthe
overallnumberofcharactersinthedocument.Don’tworryabouttheheadersfor
thisexercise,simplyshowthefirst3000charactersofthedocumentcontents.
Exercise12.4 Changethe
urllinks.py
programtoextractandcountparagraph
(p)tagsfromtheretrievedHTMLdocumentanddisplaythecountofthepara-
graphsastheoutputofyourprogram. Donotdisplaytheparagraphtext-only
countthem.Testyourprogramonseveralsmallwebpagesaswellassomelarger
webpages.
Exercise12.5 (Advanced)Changethesocketprogramsothatitonlyshowsdata
aftertheheadersandablanklinehavebeenreceived. Rememberthat
recv
is
receivingcharacters(newlinesandall)-notlines.