Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente |
httrack [Le 06/02/2012, 20:33] – [Utilisation] steph138 | httrack [Le 26/07/2020, 21:20] (Version actuelle) – L'Africain |
---|
| {{tag>Bionic internet programmation BROUILLON}} |
| |
| ---- |
| |
| ====== Aspiration de sites avec httrack ====== |
| |
| **Httrack** est un célèbre aspirateur de sites web. |
| |
| <note warning> |
| Les sites volumineux (le forum et la documentation Ubuntu-fr compris), **ne doivent pas** être aspirés automatiquement, sous peine de blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à des contenus hors lignes. L'aspiration demande au site visé des ressources matérielles bien plus importante que le simple affichage d'une page web. Demandez l'autorisation au webmaster avant de procéder ! N'oublions pas non plus les problématiques liées à la propriété intellectuelle.</note> |
| |
| |
| ===== Installation ===== |
| Il existe deux versions de httrack : |
| * La version de base : [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>httrack]]** |
| * La version graphique, qui va utiliser votre navigateur préféré : [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>webhttrack]]**. |
| |
| |
| =====Utilisation===== |
| httrack --mirror http://website.com |
| |
| httrack(1) General Commands Manual httrack(1) |
| |
| |
| |
| NAME |
| httrack - offline browser : copy websites to a local directory |
| |
| SYNOPSIS |
| httrack [ url ]... [ -filter ]... [ +filter ]... [ -O, --path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g, --get-files ] [ -i, --continue ] [ -Y, |
| --mirrorlinks ] [ -P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b, --bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [ -mN, --max-files[=N] ] [ |
| -MN, --max-size[=N] ] [ -EN, --max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN, --connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN, --sockets[=N] |
| ] [ -TN, --timeout[=N] ] [ -RN, --retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN, --host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n, --near ] [ -t, |
| --test ] [ -%L, --list ] [ -%S, --urllist ] [ -NN, --structure[=N] ] [ -%D, --cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN, --long-names[=N] ] |
| [ -KN, --keep-links[=N] ] [ -x, --replace-external ] [ -%x, --disable-passwords ] [ -%q, --include-query-string ] [ -o, --generate-errors ] [ -X, |
| --purge-old[=N] ] [ -%p, --preserve ] [ -%T, --utf8-conversion ] [ -bN, --cookies[=N] ] [ -u, --check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN, --ro†|
| bots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [ -%B, --tolerant ] [ -%s, --updatehack ] [ -%u, --urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N] |
| ] [ -%w, --disable-module ] [ -F, --user-agent ] [ -%R, --referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l, --language ] [ -%a, --accept ] [ -%X, |
| --headers ] [ -C, --cache[=N] ] [ -k, --store-all-in-cache ] [ -%n, --do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ] [ -q, --quiet ] [ -z, |
| --extra-log ] [ -Z, --debug-log ] [ -v, --verbose ] [ -f, --file-log ] [ -f2, --single-log ] [ -I, --index ] [ -%i, --build-top-index ] [ -%I, |
| --search-index ] [ -pN, --priority[=N] ] [ -S, --stay-on-same-dir ] [ -D, --can-go-down ] [ -U, --can-go-up ] [ -B, --can-go-up-and-down ] [ -a, |
| --stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l, --stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H, --debug-headers ] [ -%!, --disable-secu†|
| rity-limits ] [ -V, --userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N] ] [ |
| |
| DESCRIPTION |
| httrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, |
| and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" web†|
| site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, |
| and resume interrupted downloads. |
| |
| EXAMPLES |
| httrack www.someweb.com/bob/ |
| mirror site www.someweb.com/bob/ and only this site |
| |
| httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/* |
| mirror the two sites together (with shared links) and accept any .jpg files on .com sites |
| |
| httrack www.someweb.com/bob/bobby.html +* -r6 |
| means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web |
| |
| httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080 |
| runs the spider on www.someweb.com/bob/bobby.html using a proxy |
| |
| httrack --update |
| updates a mirror in the current folder |
| |
| httrack |
| will bring you to the interactive mode |
| |
| httrack --continue |
| continues a mirror in the current folder |
| |
| OPTIONS |
| General options: |
| -O path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (--path <param>) |
| |
| |
| Action options: |
| -w *mirror web sites (--mirror) |
| |
| -W mirror web sites, semi-automatic (asks questions) (--mirror-wizard) |
| |
| -g just get files (saved in the current directory) (--get-files) |
| |
| -i continue an interrupted mirror using the cache (--continue) |
| |
| -Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks) |
| |
| |
| Proxy options: |
| -P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>) |
| |
| -%f *use proxy for ftp (f0 don t use) (--httpproxy-ftp[=N]) |
| |
| -%b use this local hostname to make/send requests (-%b hostname) (--bind <param>) |
| |
| |
| Limits options: |
| -rN set the mirror depth to N (* r9999) (--depth[=N]) |
| |
| -%eN set the external links depth to N (* %e0) (--ext-depth[=N]) |
| |
| -mN maximum file length for a non-html file (--max-files[=N]) |
| |
| -mN,N2 maximum file length for non html (N) and html (N2) |
| |
| -MN maximum overall size that can be uploaded/scanned (--max-size[=N]) |
| |
| -EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-time[=N]) |
| |
| -AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-rate[=N]) |
| |
| -%cN maximum number of connections/seconds (*%c10) (--connection-per-second[=N]) |
| |
| -GN pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N]) |
| |
| |
| Flow control: |
| -cN number of multiple connections (*c8) (--sockets[=N]) |
| |
| -TN timeout, number of seconds after a non-responding link is shutdown (--timeout[=N]) |
| |
| -RN number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N]) |
| |
| -JN traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N]) |
| |
| -HN host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (--host-control[=N]) |
| |
| |
| Links options: |
| -%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don t use) (--extended-parsing[=N]) |
| |
| -n get non-html files near an html file (ex: an image located outside) (--near) |
| |
| -t test all URLs (even forbidden ones) (--test) |
| |
| -%L <file> add all URL located in this text file (one URL per line) (--list <param>) |
| |
| -%S <file> add all scan rules located in this text file (one scan rule per line) (--urllist <param>) |
| |
| |
| Build options: |
| -NN structure type (0 *original structure, 1+: see below) (--structure[=N]) |
| |
| -or user defined structure (-N "%h%p/%n%q.%t") |
| |
| -%N delayed type check, don t make any link test but wait for files download to start instead (experimental) (%N0 don t use, %N1 use for unknown |
| extensions, * %N2 always use) |
| |
| -%D cached delayed type check, don t wait for remote type during updates, to speedup them (%D0 wait, * %D1 don t wait) (--cached-delayed-type-check) |
| |
| -%M generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html) |
| |
| -LN long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible) (--long-names[=N]) |
| |
| -KN keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3 absolute URI links, K5 transparent |
| proxy link) (--keep-links[=N]) |
| |
| -x replace external html links by error pages (--replace-external) |
| |
| -%x do not include any password for external password protected websites (%x0 include) (--disable-passwords) |
| |
| -%q *include query string for local files (useless, for information purpose only) (%q0 don t include) (--include-query-string) |
| |
| -o *generate output html file in case of error (404..) (o0 don t generate) (--generate-errors) |
| |
| -X *purge old files after update (X0 keep delete) (--purge-old[=N]) |
| |
| -%p preserve html files as is (identical to -K4 -%F "" ) (--preserve) |
| |
| -%T links conversion to UTF-8 (--utf8-conversion) |
| |
| |
| Spider options: |
| -bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N]) |
| |
| -u check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but /, u2 check always) (--check-type[=N]) |
| |
| -j *parse Java Classes (j0 don t parse, bitmask: |1 parse default, |2 don t parse .class |4 don t parse .js |8 don t be aggressive) |
| (--parse-java[=N]) |
| |
| -sN follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (--robots[=N]) |
| |
| -%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10) |
| |
| -%k use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don t use) (--keep-alive) |
| |
| -%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant) |
| |
| -%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack) |
| |
| -%u url hacks: various hacks to limit duplicate URLs (strip , www.foo.com==foo.com..) (--urlhack) |
| |
| -%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>) |
| |
| -can also be used to force a specific file type: --assume foo.cgi=text/html |
| |
| -@iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--protocol[=N]) |
| |
| -%w disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module <param>) |
| |
| |
| Browser ID: |
| -F user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>) |
| |
| -%R default referer field sent in HTTP headers (--referer <param>) |
| |
| -%E from email address sent in HTTP headers (--from <param>) |
| |
| -%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>) |
| |
| -%l preffered language (-%l "fr, en, jp, *" (--language <param>) |
| |
| -%a accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1" (--accept <param>) |
| |
| -%X additional HTTP header line (-%X "X-Magic: 42" (--headers <param>) |
| |
| |
| Log, index, cache |
| -C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N]) |
| |
| -k store all files in cache (not useful if files on disk) (--store-all-in-cache) |
| |
| -%n do not re-download locally erased files (--do-not-recatch) |
| |
| -%v display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display) |
| |
| -Q no log - quiet mode (--do-not-log) |
| |
| -q no questions - quiet mode (--quiet) |
| |
| -z log - extra infos (--extra-log) |
| |
| -Z log - debug (--debug-log) |
| |
| -v log on screen (--verbose) |
| |
| -f *log in files (--file-log) |
| |
| -f2 one single log file (--single-log) |
| |
| -I *make an index (I0 don t make) (--index) |
| |
| -%i make a top index for a project folder (* %i0 don t make) (--build-top-index) |
| |
| -%I make an searchable index for this mirror (* %I0 don t make) (--search-index) |
| |
| |
| Expert options: |
| -pN priority mode: (* p3) (--priority[=N]) |
| |
| -p0 just scan, don t save anything (for checking links) |
| |
| -p1 save only html files |
| |
| -p2 save only non html files |
| |
| -*p3 save all files |
| |
| -p7 get html files before, then treat other files |
| |
| -S stay on the same directory (--stay-on-same-dir) |
| |
| -D *can only go down into subdirs (--can-go-down) |
| |
| -U can only go to upper directories (--can-go-up) |
| |
| -B can both go up&down into the directory structure (--can-go-up-and-down) |
| |
| -a *stay on the same address (--stay-on-same-address) |
| |
| -d stay on the same principal domain (--stay-on-same-domain) |
| |
| -l stay on the same TLD (eg: .com) (--stay-on-same-tld) |
| |
| -e go everywhere on the web (--go-everywhere) |
| |
| -%H debug HTTP headers in logfile (--debug-headers) |
| |
| |
| Guru options: (do NOT use if possible) |
| -#X *use optimized engine (limited memory boundary checks) (--fast-engine) |
| |
| -#0 filter test (-#0 *.gif www.bar.com/foo.gif ) (--debug-testfilters <param>) |
| |
| -#1 simplify test (-#1 ./foo/bar/../foobar) |
| |
| -#2 type test (-#2 /foo/bar.php) |
| |
| -#C cache list (-#C *.com/spider*.gif (--debug-cache <param>) |
| |
| -#R cache repair (damaged cache) (--repair-cache) |
| |
| -#d debug parser (--debug-parsing) |
| |
| -#E extract new.zip cache meta-data in meta.zip |
| |
| -#f always flush log files (--advanced-flushlogs) |
| |
| -#FN maximum number of filters (--advanced-maxfilters[=N]) |
| |
| -#h version info (--version) |
| |
| -#K scan stdin (debug) (--debug-scanstdin) |
| |
| -#L maximum number of links (-#L1000000) (--advanced-maxlinks) |
| |
| -#p display ugly progress information (--advanced-progressinfo) |
| |
| -#P catch URL (--catch-url) |
| |
| -#R old FTP routines (debug) (--repair-cache) |
| |
| -#T generate transfer ops. log every minutes (--debug-xfrstats) |
| |
| -#u wait time (--advanced-wait) |
| |
| -#Z generate transfer rate statictics every minutes (--debug-ratestats) |
| |
| |
| Dangerous options: (do NOT use unless you exactly know what you are doing) |
| -%! bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits) |
| |
| -IMPORTANT |
| NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS |
| |
| -USE IT WITH EXTREME CARE |
| |
| |
| Command-line specific options: |
| -V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>) |
| |
| -%W use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback <param>) |
| |
| |
| Details: Option N |
| -N0 Site-structure (default) |
| |
| -N1 HTML in web/, images/other files in web/images/ |
| |
| -N2 HTML in web/HTML, images/other in web/images |
| |
| -N3 HTML in web/, images/other in web/ |
| |
| -N4 HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example) |
| |
| -N5 Images/other in web/xxx and HTML in web/HTML |
| |
| -N99 All files in web/, with random names (gadget !) |
| |
| -N100 Site-structure, without www.domain.xxx/ |
| |
| -N101 Identical to N1 exept that "web" is replaced by the site s name |
| |
| -N102 Identical to N2 exept that "web" is replaced by the site s name |
| |
| -N103 Identical to N3 exept that "web" is replaced by the site s name |
| |
| -N104 Identical to N4 exept that "web" is replaced by the site s name |
| |
| -N105 Identical to N5 exept that "web" is replaced by the site s name |
| |
| -N199 Identical to N99 exept that "web" is replaced by the site s name |
| |
| -N1001 Identical to N1 exept that there is no "web" directory |
| |
| -N1002 Identical to N2 exept that there is no "web" directory |
| |
| -N1003 Identical to N3 exept that there is no "web" directory (option set for g option) |
| |
| -N1004 Identical to N4 exept that there is no "web" directory |
| |
| -N1005 Identical to N5 exept that there is no "web" directory |
| |
| -N1099 Identical to N99 exept that there is no "web" directory |
| |
| Details: User-defined option N |
| %n Name of file without file type (ex: image) |
| %N Name of file, including file type (ex: image.gif) |
| %t File type (ex: gif) |
| %p Path [without ending /] (ex: /someimages) |
| %h Host name (ex: www.someweb.com) |
| %M URL MD5 (128 bits, 32 ascii bytes) |
| %Q query string MD5 (128 bits, 32 ascii bytes) |
| %k full query string |
| %r protocol name (ex: http) |
| %q small query string MD5 (16 bits, 4 ascii bytes) |
| %s? Short name version (ex: %sN) |
| %[param] param variable in query string |
| %[param:before:after:empty:notfound] advanced variable extraction |
| |
| Details: User-defined option N and advanced variable extraction |
| %[param:before:after:empty:notfound] |
| |
| -param : parameter name |
| |
| -before |
| : string to prepend if the parameter was found |
| |
| -after : string to append if the parameter was found |
| |
| -notfound |
| : string replacement if the parameter could not be found |
| |
| -empty : string replacement if the parameter was empty |
| |
| -all fields, except the first one (the parameter name), can be empty |
| |
| |
| |
| ===== Utilisation en ligne de commande ===== |
| |
| Crée un miroir : |
| |
| <code>httrack --mirror http://www.monsite.com</code> |
| |
| Mettre à jour le projet courant : |
| |
| <code>httrack --update</code> |
| |
| Nettoyage du cache et fichier log : |
| |
| <code>httrack --clean</code> |
| |
| ===== Voir aussi ===== |
| * [[http://www.httrack.com/|Site officiel]] |
| |
| ---- |