ORPP logo

Webbots, Spiders, and Screen Scrapers : (Record no. 59208)

MARC details
000 -LEADER
fixed length control field 11268nam a22005173i 4500
001 - CONTROL NUMBER
control field EBC3017639
003 - CONTROL NUMBER IDENTIFIER
control field MiAaPQ
005 - DATE AND TIME OF LATEST TRANSACTION
control field 20240729124046.0
006 - FIXED-LENGTH DATA ELEMENTS--ADDITIONAL MATERIAL CHARACTERISTICS
fixed length control field m o d |
007 - PHYSICAL DESCRIPTION FIXED FIELD--GENERAL INFORMATION
fixed length control field cr cnu||||||||
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION
fixed length control field 240724s2012 xx o ||||0 eng d
020 ## - INTERNATIONAL STANDARD BOOK NUMBER
International Standard Book Number 9781593274320
Qualifying information (electronic bk.)
020 ## - INTERNATIONAL STANDARD BOOK NUMBER
Canceled/invalid ISBN 9781593273972
035 ## - SYSTEM CONTROL NUMBER
System control number (MiAaPQ)EBC3017639
035 ## - SYSTEM CONTROL NUMBER
System control number (Au-PeEL)EBL3017639
035 ## - SYSTEM CONTROL NUMBER
System control number (CaPaEBR)ebr10574793
035 ## - SYSTEM CONTROL NUMBER
System control number (OCoLC)795714370
040 ## - CATALOGING SOURCE
Original cataloging agency MiAaPQ
Language of cataloging eng
Description conventions rda
-- pn
Transcribing agency MiAaPQ
Modifying agency MiAaPQ
050 #4 - LIBRARY OF CONGRESS CALL NUMBER
Classification number TK5105.884 -- .S37 2012eb
082 0# - DEWEY DECIMAL CLASSIFICATION NUMBER
Classification number 025.04
100 1# - MAIN ENTRY--PERSONAL NAME
Personal name Schrenk, Michael.
245 10 - TITLE STATEMENT
Title Webbots, Spiders, and Screen Scrapers :
Remainder of title A Guide to Developing Internet Agents with PHP/CURL.
250 ## - EDITION STATEMENT
Edition statement 2nd ed.
264 #1 - PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
Place of production, publication, distribution, manufacture San Francisco :
Name of producer, publisher, distributor, manufacturer No Starch Press, Incorporated,
Date of production, publication, distribution, manufacture, or copyright notice 2012.
264 #4 - PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
Date of production, publication, distribution, manufacture, or copyright notice ©2012.
300 ## - PHYSICAL DESCRIPTION
Extent 1 online resource (394 pages)
336 ## - CONTENT TYPE
Content type term text
Content type code txt
Source rdacontent
337 ## - MEDIA TYPE
Media type term computer
Media type code c
Source rdamedia
338 ## - CARRIER TYPE
Carrier type term online resource
Carrier type code cr
Source rdacarrier
505 0# - FORMATTED CONTENTS NOTE
Formatted contents note Intro -- Brief Contents -- Contents In Detail -- Introduction -- Old-School Client-Server Technology -- The Problem with Browsers -- What to Expect from This Book -- Learn from My Mistakes -- Master Webbot Techniques -- Leverage Existing Scripts -- About the Website -- About the Code -- Requirements -- Hardware -- Software -- Internet Access -- A Disclaimer (This Is Important) -- PART I: Fundamental Concepts and Techniques -- 1: What's in It for You? -- Uncovering the Internet's True Potential -- What's in It for Developers? -- Webbot Developers Are in Demand -- Webbots Are Fun to Write -- Webbots Facilitate "Constructive Hacking" -- What's in It for Business Leaders? -- Customize the Internet for Your Business -- Capitalize on the Public's Inexperience with Webbots -- Accomplish a Lot with a Small Investment -- Final Thoughts -- 2: Ideas for Webbot Projects -- Inspiration from Browser Limitations -- Webbots That Aggregate and Filter Information for Relevance -- Webbots That Interpret What They Find Online -- Webbots That Act on Your Behalf -- Figure 2-3: An example pokerbot -- A Few Crazy Ideas to Get You Started -- Help Out a Busy Executive -- Save Money by Automating Tasks -- Protect Intellectual Property -- Monitor Opportunities -- Verify Access Rights on a Website -- Create an Online Clipping Service -- Plot Unauthorized Wi-Fi Networks -- Track Web Technologies -- Allow Incompatible Systems to Communicate -- Final Thoughts -- 3: Downloading Web Pages -- Think About Files, Not Web Pages -- Downloading Files with PHP's Built-in Functions -- Downloading Files with fopen() and fgets() -- Downloading Files with file() -- Introducing PHP/CURL -- Multiple Transfer Protocols -- Form Submission -- Basic Authentication -- Cookies -- Redirection -- Agent Name Spoofing -- Referer Management -- Socket Management -- Installing PHP/CURL -- LIB_http.
505 8# - FORMATTED CONTENTS NOTE
Formatted contents note Familiarizing Yourself with the Default Values -- Using LIB_http -- Learning More About HTTP Headers -- Examining LIB_http's Source Code -- Final Thoughts -- 4: Basic Parsing Techniques -- Content Is Mixed with Markup -- Parsing Poorly Written HTML -- Standard Parse Routines -- Using LIB_parse -- Splitting a String at a Delimiter: split_string() -- Parsing Text Between Delimiters: return_between() -- Parsing a Data Set into an Array: parse_array() -- Parsing Attribute Values: get_attribute() -- Removing Unwanted Text: remove() -- Useful PHP Functions -- Detecting Whether a String Is Within Another String -- Replacing a Portion of a String with Another String -- Parsing Unformatted Text -- Measuring the Similarity of Strings -- Final Thoughts -- Don't Trust a Poorly Coded Web Page -- Parse in Small Steps -- Don't Render Parsed Te xt While Debugging -- Use Regular Expressions Sparingly -- 5: Advanced Parsing with Regular Expressions -- Pattern Matching, the Key to Regular Expressions -- PHP Regular Expression Types -- PHP Regular Expressions Functions -- Resemblance to PHP Built-In Functions -- Learning Patterns Through Examples -- Parsing Numbers -- Detecting a Series of Characters -- Matching Alpha Characters -- Matching on Wildcards -- Specifying Alternate Matches -- Regular Expressions Groupings and Ranges -- Regular Expressions of Particular Interest to Webbot Developers -- Parsing Phone Numbers -- Where to Go from Here -- When Regular Expressions Are (or Aren't) the Right Parsing Tool -- Strengths of Regular Expressions -- Disadvantages of Pattern Matching While Parsing Web Pages -- Which Are Faster: Regular Expressions or PHP's Built-In Functions? -- Final Thoughts -- 6: Automating Form Submission -- Reverse Engineering Form Interfaces -- Form Handlers, Data Fields, Methods, and Event Triggers -- Form Handlers -- Data Fields -- Methods.
505 8# - FORMATTED CONTENTS NOTE
Formatted contents note Multipart Encoding -- Event Triggers -- Unpredictable Forms -- JavaScript Can Change a Form Just Before Submission -- Form HTML Is Often Unreadable by Humans -- Cookies Aren't Included in the Form, but Can Affect Operation -- Analyzing a Form -- Final Thoughts -- Don't Blow Your Cover -- Correctly Emulate Browsers -- Avoid Form Errors -- 7: Managing Large Amounts of Data -- Organizing Data -- Naming Conventions -- Storing Data in Structured Files -- Storing Text in a Database -- Storing Images in a Database -- Database or File? -- Making Data Smaller -- Storing References to Image Files -- Compressing Data -- Removing Formatting -- Thumbnailing Images -- Final Thoughts -- PART II: Projects -- 8: Price-Monitoring Webbots -- The Target -- Designing the Parsing Script -- Initialization and Downloading the Target -- Further Exploration -- 9: Image-Capturing Webbots -- Example Image-Capturing Webbot -- Creating the Image-Capturing Webbot -- Binary-Safe Download Routine -- Directory Structure -- The Main Script -- Further Exploration -- Final Thoughts -- 10: Link-Verification Webbots -- Creating the Link-Verification Webbot -- Initializing the Webbot and Downloading the Target -- Setting the Page Base -- Parsing the Links -- Running a Verification Loop -- Generating Fully Resolved URLs -- Downloading the Linked Page -- Displaying the Page Status -- Running the Webbot -- LIB_http_codes -- LIB_resolve_addresses -- Further Exploration -- 11: Search-Ranking Webbots -- Description of a Search Result Page -- What the Search-Ranking Webbot Does -- Running the Search-Ranking Webbot -- How the Search-Ranking Webbot Works -- The Search-Ranking Webbot Script -- Initializing Variables -- Starting the Loop -- Fetching the Search Results -- Parsing the Search Results -- Final Thoughts -- Be Kind to Your Sources -- Search Sites May Treat Webbots Differently Than Browsers.
505 8# - FORMATTED CONTENTS NOTE
Formatted contents note Spidering Search Engines Is a Bad Idea -- Familiarize Yourself with the Google API -- Further Exploration -- 12: Aggregation Webbots -- Choosing Data Sources for Webbots -- Example Aggregation Webbot -- Familiarizing Yourself with RSS Feeds -- Writing the Aggregation Webbot -- Adding Filtering to Your Aggregation Webbot -- Further Exploration -- 13: FTP Webbots -- Example FTP Webbot -- PHP and FTP -- Further Exploration -- 14: Webbots That Read Email -- The POP3 Protocol -- Logging into a POP3 Mail Server -- Reading Mail from a POP3 Mail Server -- Executing POP3 Commands with a Webbot -- Further Exploration -- Email-Controlled Webbots -- Email Interfaces -- 15: Webbots That Send Email -- Email, Webbots, and Spam -- Sending Mail with SMTP and PHP -- Configuring PHP to Send Mail -- Sending an Email with mail() -- Writing a Webbot That Sends Email Notifications -- Keeping Legitimate Mail out of Spam Filters -- Sending HTML-Formatted Email -- Further Exploration -- Using Returned Emails to Prune Access Lists -- Using Email as Notification That Your Webbot Ran -- Leveraging Wireless Technologies -- Writing Webbots That Send Text Messages -- 16: Converting a Website into a Function -- Writing a Function Interface -- Defining the Interface -- Analyzing the Target Web Page -- Using describe_zipcode() -- Final Thoughts -- Distributing Resources -- Using Standard Interfaces -- Designing a Custom Lightweight "Web Service" -- PART III: Advanced Technical Considerations -- 17: Spiders -- How Spiders Work -- Example Spider -- LIB_simple_spider -- harvest_links() -- archive_links() -- get_domain() -- exclude_link() -- Experimenting with the Spider -- Adding the Payload -- Further Exploration -- Save Links in a Database -- Separate the Harvest and Payload -- Distribute Tasks Across Multiple Computers -- Regulate Page Requests -- 18: Procurement Webbots and Snipers.
505 8# - FORMATTED CONTENTS NOTE
Formatted contents note Procurement Webbot Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Evaluate Purchase Triggers -- Make Purchase -- Evaluate Results -- Sniper Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Synchronize Clocks -- Time to Bid? -- Submit Bid -- Evaluate Results -- Testing Your Own Webbots and Snipers -- Further Exploration -- Final Thoughts -- 19: Webbots and Cryptography -- Designing Webbots That Use Encryption -- SSL and PHP Built-in Functions -- Encryption and PHP/CURL -- A Quick Overview of Web Encryption -- Final Thoughts -- 20: Authentication -- What Is Authentication? -- Types of Online Authentication -- Strengthening Authentication by Combining Techniques -- Authentication and Webbots -- Example Scripts and Practice Pages -- Basic Authentication -- Session Authentication -- Authentication with Cookie Sessions -- Authentication with Query Sessions -- Final Thoughts -- 21: Advanced Cookie Management -- How Cookies Work -- PHP/CURL and Cookies -- How Cookies Challenge Webbot Design -- Purging Temporary Cookies -- Managing Multiple Users' Cookies -- Further Exploration -- 22: Scheduling Webbots and Spiders -- Preparing Your Webbots to Run as Scheduled Tasks -- The Windows XP Task Scheduler -- Scheduling a Webbot to Run Daily -- Complex Schedules -- The Windows 7 Task Scheduler -- Non-calendar-based Triggers -- Final Thoughts -- Determine the Webbot's Best Periodicity -- Avoid Single Points of Failure -- Add Variety to Your Schedule -- 23: Scraping Difficult Websites with Browser Macros -- Barriers to Effective Web Scraping -- AJAX -- Bizarre JavaScript and Cookie Behavior -- Flash -- Overcoming Webscraping Barriers with Browser Macros -- What Is a Browser Macro? -- The Ultimate Browser-Like Webbot -- Installing and Using iMacros -- Creating Your First Macro -- Final Thoughts.
505 8# - FORMATTED CONTENTS NOTE
Formatted contents note Are Macros Really Necessary?.
588 ## - SOURCE OF DESCRIPTION NOTE
Source of description note Description based on publisher supplied metadata and other sources.
590 ## - LOCAL NOTE (RLIN)
Local note Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
650 #0 - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name entry element Web search engines.
650 #0 - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name entry element Internet programming.
650 #0 - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name entry element Internet searching.
650 #0 - SUBJECT ADDED ENTRY--TOPICAL TERM
Topical term or geographic name entry element Intelligent agents (Computer software).
655 #4 - INDEX TERM--GENRE/FORM
Genre/form data or focus term Electronic books.
776 08 - ADDITIONAL PHYSICAL FORM ENTRY
Relationship information Print version:
Main entry heading Schrenk, Michael
Title Webbots, Spiders, and Screen Scrapers
Place, publisher, and date of publication San Francisco : No Starch Press, Incorporated,c2012
International Standard Book Number 9781593273972
797 2# - LOCAL ADDED ENTRY--CORPORATE NAME (RLIN)
Corporate name or jurisdiction name as entry element ProQuest (Firm)
856 40 - ELECTRONIC LOCATION AND ACCESS
Uniform Resource Identifier <a href="https://ebookcentral.proquest.com/lib/orpp/detail.action?docID=3017639">https://ebookcentral.proquest.com/lib/orpp/detail.action?docID=3017639</a>
Public note Click to View

No items available.

© 2024 Resource Centre. All rights reserved.