| Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente |
| httrack [Le 23/10/2016, 02:26] – [Aspiration de sites avec httrack] fouadessahlaoui | httrack [Le 26/07/2020, 21:20] (Version actuelle) – L'Africain |
|---|
| | {{tag>Bionic internet programmation BROUILLON}} |
| |
| | ---- |
| | |
| | ====== Aspiration de sites avec httrack ====== |
| | |
| | **Httrack** est un célèbre aspirateur de sites web. |
| | |
| | <note warning> |
| | Les sites volumineux (le forum et la documentation Ubuntu-fr compris), **ne doivent pas** être aspirés automatiquement, sous peine de blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à des contenus hors lignes. L'aspiration demande au site visé des ressources matérielles bien plus importante que le simple affichage d'une page web. Demandez l'autorisation au webmaster avant de procéder ! N'oublions pas non plus les problématiques liées à la propriété intellectuelle.</note> |
| | |
| | |
| | ===== Installation ===== |
| | Il existe deux versions de httrack : |
| | * La version de base : [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>httrack]]** |
| | * La version graphique, qui va utiliser votre navigateur préféré : [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>webhttrack]]**. |
| | |
| | |
| | =====Utilisation===== |
| | httrack --mirror http://website.com |
| | |
| | httrack(1) General Commands Manual httrack(1) |
| | |
| | |
| | |
| | NAME |
| | httrack - offline browser : copy websites to a local directory |
| | |
| | SYNOPSIS |
| | httrack [ url ]... [ -filter ]... [ +filter ]... [ -O, --path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g, --get-files ] [ -i, --continue ] [ -Y, |
| | --mirrorlinks ] [ -P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b, --bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [ -mN, --max-files[=N] ] [ |
| | -MN, --max-size[=N] ] [ -EN, --max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN, --connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN, --sockets[=N] |
| | ] [ -TN, --timeout[=N] ] [ -RN, --retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN, --host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n, --near ] [ -t, |
| | --test ] [ -%L, --list ] [ -%S, --urllist ] [ -NN, --structure[=N] ] [ -%D, --cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN, --long-names[=N] ] |
| | [ -KN, --keep-links[=N] ] [ -x, --replace-external ] [ -%x, --disable-passwords ] [ -%q, --include-query-string ] [ -o, --generate-errors ] [ -X, |
| | --purge-old[=N] ] [ -%p, --preserve ] [ -%T, --utf8-conversion ] [ -bN, --cookies[=N] ] [ -u, --check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN, --ro†|
| | bots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [ -%B, --tolerant ] [ -%s, --updatehack ] [ -%u, --urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N] |
| | ] [ -%w, --disable-module ] [ -F, --user-agent ] [ -%R, --referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l, --language ] [ -%a, --accept ] [ -%X, |
| | --headers ] [ -C, --cache[=N] ] [ -k, --store-all-in-cache ] [ -%n, --do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ] [ -q, --quiet ] [ -z, |
| | --extra-log ] [ -Z, --debug-log ] [ -v, --verbose ] [ -f, --file-log ] [ -f2, --single-log ] [ -I, --index ] [ -%i, --build-top-index ] [ -%I, |
| | --search-index ] [ -pN, --priority[=N] ] [ -S, --stay-on-same-dir ] [ -D, --can-go-down ] [ -U, --can-go-up ] [ -B, --can-go-up-and-down ] [ -a, |
| | --stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l, --stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H, --debug-headers ] [ -%!, --disable-secu†|
| | rity-limits ] [ -V, --userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N] ] [ |
| | |
| | DESCRIPTION |
| | httrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, |
| | and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" web†|
| | site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, |
| | and resume interrupted downloads. |
| | |
| | EXAMPLES |
| | httrack www.someweb.com/bob/ |
| | mirror site www.someweb.com/bob/ and only this site |
| | |
| | httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/* |
| | mirror the two sites together (with shared links) and accept any .jpg files on .com sites |
| | |
| | httrack www.someweb.com/bob/bobby.html +* -r6 |
| | means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web |
| | |
| | httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080 |
| | runs the spider on www.someweb.com/bob/bobby.html using a proxy |
| | |
| | httrack --update |
| | updates a mirror in the current folder |
| | |
| | httrack |
| | will bring you to the interactive mode |
| | |
| | httrack --continue |
| | continues a mirror in the current folder |
| | |
| | OPTIONS |
| | General options: |
| | -O path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (--path <param>) |
| | |
| | |
| | Action options: |
| | -w *mirror web sites (--mirror) |
| | |
| | -W mirror web sites, semi-automatic (asks questions) (--mirror-wizard) |
| | |
| | -g just get files (saved in the current directory) (--get-files) |
| | |
| | -i continue an interrupted mirror using the cache (--continue) |
| | |
| | -Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks) |
| | |
| | |
| | Proxy options: |
| | -P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>) |
| | |
| | -%f *use proxy for ftp (f0 don t use) (--httpproxy-ftp[=N]) |
| | |
| | -%b use this local hostname to make/send requests (-%b hostname) (--bind <param>) |
| | |
| | |
| | Limits options: |
| | -rN set the mirror depth to N (* r9999) (--depth[=N]) |
| | |
| | -%eN set the external links depth to N (* %e0) (--ext-depth[=N]) |
| | |
| | -mN maximum file length for a non-html file (--max-files[=N]) |
| | |
| | -mN,N2 maximum file length for non html (N) and html (N2) |
| | |
| | -MN maximum overall size that can be uploaded/scanned (--max-size[=N]) |
| | |
| | -EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-time[=N]) |
| | |
| | -AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-rate[=N]) |
| | |
| | -%cN maximum number of connections/seconds (*%c10) (--connection-per-second[=N]) |
| | |
| | -GN pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N]) |
| | |
| | |
| | Flow control: |
| | -cN number of multiple connections (*c8) (--sockets[=N]) |
| | |
| | -TN timeout, number of seconds after a non-responding link is shutdown (--timeout[=N]) |
| | |
| | -RN number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N]) |
| | |
| | -JN traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N]) |
| | |
| | -HN host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (--host-control[=N]) |
| | |
| | |
| | Links options: |
| | -%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don t use) (--extended-parsing[=N]) |
| | |
| | -n get non-html files near an html file (ex: an image located outside) (--near) |
| | |
| | -t test all URLs (even forbidden ones) (--test) |
| | |
| | -%L <file> add all URL located in this text file (one URL per line) (--list <param>) |
| | |
| | -%S <file> add all scan rules located in this text file (one scan rule per line) (--urllist <param>) |
| | |
| | |
| | Build options: |
| | -NN structure type (0 *original structure, 1+: see below) (--structure[=N]) |
| | |
| | -or user defined structure (-N "%h%p/%n%q.%t") |
| | |
| | -%N delayed type check, don t make any link test but wait for files download to start instead (experimental) (%N0 don t use, %N1 use for unknown |
| | extensions, * %N2 always use) |
| | |
| | -%D cached delayed type check, don t wait for remote type during updates, to speedup them (%D0 wait, * %D1 don t wait) (--cached-delayed-type-check) |
| | |
| | -%M generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html) |
| | |
| | -LN long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible) (--long-names[=N]) |
| | |
| | -KN keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3 absolute URI links, K5 transparent |
| | proxy link) (--keep-links[=N]) |
| | |
| | -x replace external html links by error pages (--replace-external) |
| | |
| | -%x do not include any password for external password protected websites (%x0 include) (--disable-passwords) |
| | |
| | -%q *include query string for local files (useless, for information purpose only) (%q0 don t include) (--include-query-string) |
| | |
| | -o *generate output html file in case of error (404..) (o0 don t generate) (--generate-errors) |
| | |
| | -X *purge old files after update (X0 keep delete) (--purge-old[=N]) |
| | |
| | -%p preserve html files as is (identical to -K4 -%F "" ) (--preserve) |
| | |
| | -%T links conversion to UTF-8 (--utf8-conversion) |
| | |
| | |
| | Spider options: |
| | -bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N]) |
| | |
| | -u check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but /, u2 check always) (--check-type[=N]) |
| | |
| | -j *parse Java Classes (j0 don t parse, bitmask: |1 parse default, |2 don t parse .class |4 don t parse .js |8 don t be aggressive) |
| | (--parse-java[=N]) |
| | |
| | -sN follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (--robots[=N]) |
| | |
| | -%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10) |
| | |
| | -%k use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don t use) (--keep-alive) |
| | |
| | -%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant) |
| | |
| | -%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack) |
| | |
| | -%u url hacks: various hacks to limit duplicate URLs (strip , www.foo.com==foo.com..) (--urlhack) |
| | |
| | -%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>) |
| | |
| | -can also be used to force a specific file type: --assume foo.cgi=text/html |
| | |
| | -@iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--protocol[=N]) |
| | |
| | -%w disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module <param>) |
| | |
| | |
| | Browser ID: |
| | -F user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>) |
| | |
| | -%R default referer field sent in HTTP headers (--referer <param>) |
| | |
| | -%E from email address sent in HTTP headers (--from <param>) |
| | |
| | -%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>) |
| | |
| | -%l preffered language (-%l "fr, en, jp, *" (--language <param>) |
| | |
| | -%a accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1" (--accept <param>) |
| | |
| | -%X additional HTTP header line (-%X "X-Magic: 42" (--headers <param>) |
| | |
| | |
| | Log, index, cache |
| | -C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N]) |
| | |
| | -k store all files in cache (not useful if files on disk) (--store-all-in-cache) |
| | |
| | -%n do not re-download locally erased files (--do-not-recatch) |
| | |
| | -%v display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display) |
| | |
| | -Q no log - quiet mode (--do-not-log) |
| | |
| | -q no questions - quiet mode (--quiet) |
| | |
| | -z log - extra infos (--extra-log) |
| | |
| | -Z log - debug (--debug-log) |
| | |
| | -v log on screen (--verbose) |
| | |
| | -f *log in files (--file-log) |
| | |
| | -f2 one single log file (--single-log) |
| | |
| | -I *make an index (I0 don t make) (--index) |
| | |
| | -%i make a top index for a project folder (* %i0 don t make) (--build-top-index) |
| | |
| | -%I make an searchable index for this mirror (* %I0 don t make) (--search-index) |
| | |
| | |
| | Expert options: |
| | -pN priority mode: (* p3) (--priority[=N]) |
| | |
| | -p0 just scan, don t save anything (for checking links) |
| | |
| | -p1 save only html files |
| | |
| | -p2 save only non html files |
| | |
| | -*p3 save all files |
| | |
| | -p7 get html files before, then treat other files |
| | |
| | -S stay on the same directory (--stay-on-same-dir) |
| | |
| | -D *can only go down into subdirs (--can-go-down) |
| | |
| | -U can only go to upper directories (--can-go-up) |
| | |
| | -B can both go up&down into the directory structure (--can-go-up-and-down) |
| | |
| | -a *stay on the same address (--stay-on-same-address) |
| | |
| | -d stay on the same principal domain (--stay-on-same-domain) |
| | |
| | -l stay on the same TLD (eg: .com) (--stay-on-same-tld) |
| | |
| | -e go everywhere on the web (--go-everywhere) |
| | |
| | -%H debug HTTP headers in logfile (--debug-headers) |
| | |
| | |
| | Guru options: (do NOT use if possible) |
| | -#X *use optimized engine (limited memory boundary checks) (--fast-engine) |
| | |
| | -#0 filter test (-#0 *.gif www.bar.com/foo.gif ) (--debug-testfilters <param>) |
| | |
| | -#1 simplify test (-#1 ./foo/bar/../foobar) |
| | |
| | -#2 type test (-#2 /foo/bar.php) |
| | |
| | -#C cache list (-#C *.com/spider*.gif (--debug-cache <param>) |
| | |
| | -#R cache repair (damaged cache) (--repair-cache) |
| | |
| | -#d debug parser (--debug-parsing) |
| | |
| | -#E extract new.zip cache meta-data in meta.zip |
| | |
| | -#f always flush log files (--advanced-flushlogs) |
| | |
| | -#FN maximum number of filters (--advanced-maxfilters[=N]) |
| | |
| | -#h version info (--version) |
| | |
| | -#K scan stdin (debug) (--debug-scanstdin) |
| | |
| | -#L maximum number of links (-#L1000000) (--advanced-maxlinks) |
| | |
| | -#p display ugly progress information (--advanced-progressinfo) |
| | |
| | -#P catch URL (--catch-url) |
| | |
| | -#R old FTP routines (debug) (--repair-cache) |
| | |
| | -#T generate transfer ops. log every minutes (--debug-xfrstats) |
| | |
| | -#u wait time (--advanced-wait) |
| | |
| | -#Z generate transfer rate statictics every minutes (--debug-ratestats) |
| | |
| | |
| | Dangerous options: (do NOT use unless you exactly know what you are doing) |
| | -%! bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits) |
| | |
| | -IMPORTANT |
| | NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS |
| | |
| | -USE IT WITH EXTREME CARE |
| | |
| | |
| | Command-line specific options: |
| | -V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>) |
| | |
| | -%W use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback <param>) |
| | |
| | |
| | Details: Option N |
| | -N0 Site-structure (default) |
| | |
| | -N1 HTML in web/, images/other files in web/images/ |
| | |
| | -N2 HTML in web/HTML, images/other in web/images |
| | |
| | -N3 HTML in web/, images/other in web/ |
| | |
| | -N4 HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example) |
| | |
| | -N5 Images/other in web/xxx and HTML in web/HTML |
| | |
| | -N99 All files in web/, with random names (gadget !) |
| | |
| | -N100 Site-structure, without www.domain.xxx/ |
| | |
| | -N101 Identical to N1 exept that "web" is replaced by the site s name |
| | |
| | -N102 Identical to N2 exept that "web" is replaced by the site s name |
| | |
| | -N103 Identical to N3 exept that "web" is replaced by the site s name |
| | |
| | -N104 Identical to N4 exept that "web" is replaced by the site s name |
| | |
| | -N105 Identical to N5 exept that "web" is replaced by the site s name |
| | |
| | -N199 Identical to N99 exept that "web" is replaced by the site s name |
| | |
| | -N1001 Identical to N1 exept that there is no "web" directory |
| | |
| | -N1002 Identical to N2 exept that there is no "web" directory |
| | |
| | -N1003 Identical to N3 exept that there is no "web" directory (option set for g option) |
| | |
| | -N1004 Identical to N4 exept that there is no "web" directory |
| | |
| | -N1005 Identical to N5 exept that there is no "web" directory |
| | |
| | -N1099 Identical to N99 exept that there is no "web" directory |
| | |
| | Details: User-defined option N |
| | %n Name of file without file type (ex: image) |
| | %N Name of file, including file type (ex: image.gif) |
| | %t File type (ex: gif) |
| | %p Path [without ending /] (ex: /someimages) |
| | %h Host name (ex: www.someweb.com) |
| | %M URL MD5 (128 bits, 32 ascii bytes) |
| | %Q query string MD5 (128 bits, 32 ascii bytes) |
| | %k full query string |
| | %r protocol name (ex: http) |
| | %q small query string MD5 (16 bits, 4 ascii bytes) |
| | %s? Short name version (ex: %sN) |
| | %[param] param variable in query string |
| | %[param:before:after:empty:notfound] advanced variable extraction |
| | |
| | Details: User-defined option N and advanced variable extraction |
| | %[param:before:after:empty:notfound] |
| | |
| | -param : parameter name |
| | |
| | -before |
| | : string to prepend if the parameter was found |
| | |
| | -after : string to append if the parameter was found |
| | |
| | -notfound |
| | : string replacement if the parameter could not be found |
| | |
| | -empty : string replacement if the parameter was empty |
| | |
| | -all fields, except the first one (the parameter name), can be empty |
| | |
| | |
| | |
| | ===== Utilisation en ligne de commande ===== |
| | |
| | Crée un miroir : |
| | |
| | <code>httrack --mirror http://www.monsite.com</code> |
| | |
| | Mettre à jour le projet courant : |
| | |
| | <code>httrack --update</code> |
| | |
| | Nettoyage du cache et fichier log : |
| | |
| | <code>httrack --clean</code> |
| | |
| | ===== Voir aussi ===== |
| | * [[http://www.httrack.com/|Site officiel]] |
| | |
| | ---- |