URLs.tgz Readme =============== 1. All the URLs listed in the URLs.tgz were crawled from Internet between Feb 13, 2003 and Mar 18, 2003 for research purposes. 2. Check your DISK SPACE before un-taring the file, the size of the the un-tar result will be more than 1.1GB! 3. The files in URLs.tgz are organized as the tree shown below: Use "gzip -dc URLs.tgz | tar xvf -" to extract the files: URLs.tgz | URLs | |---data_xxxx.yy (xxxx: starting page, yy: domain with no yy for U.S.) | |---all (all URLs from crawler) | | | |---fifo00000?.done (http URLs list, 100,000 urls in each) | | | |---non-http.url (URLs for other protocols: rtsp, mms...) | |---video (video URLs) | |---WindowMedia.txt.done (Windows Media URLs) | |---RealPlayer.txt.done (RealNetworks URLs) | |---QuickTime.txt.done (Quick Time URLs) 4. For any questions related to the crawling, please contact: Mark Claypool (claypool@cs.wpi.edu), Robert Kinicki (rek@cs.wpi.edu), or Mingzhe Li (lmz@cs.wpi.edu).