Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentesRévision précédente
Prochaine révision
Révision précédente
httrack [Le 10/04/2008, 15:16] 213.190.73.26, 127.0.0.1httrack [Le 26/07/2020, 21:20] (Version actuelle) L'Africain
Ligne 1: Ligne 1:
 +{{tag>Bionic internet programmation BROUILLON}}
  
 +----
 +
 +====== Aspiration de sites avec httrack ======
 +
 +**Httrack** est un célèbre aspirateur de sites web.
 +
 +<note warning>
 +Les sites volumineux (le forum et la documentation Ubuntu-fr compris), **ne doivent pas** être aspirés automatiquement, sous peine de blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à des contenus hors lignes. L'aspiration demande au site visé des ressources matérielles bien plus importante que le simple affichage d'une page web. Demandez l'autorisation au webmaster avant de procéder ! N'oublions pas non plus les problématiques liées à la propriété intellectuelle.</note>
 +
 +
 +===== Installation =====
 +Il existe deux versions de httrack :
 +  * La version de base :  [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>httrack]]** 
 +  * La version graphique, qui va utiliser votre navigateur préféré : [[:tutoriel:comment_installer_un_paquet|installez le paquet]]   **[[apt>webhttrack]]**.
 +
 +
 +=====Utilisation=====
 +httrack --mirror http://website.com
 +
 +httrack(1)                                                           General Commands Manual                                                          httrack(1)
 +
 +
 +
 +NAME
 +       httrack - offline browser : copy websites to a local directory
 +
 +SYNOPSIS
 +       httrack  [  url  ]...  [ -filter ]... [ +filter ]... [ -O, --path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g, --get-files ] [ -i, --continue ] [ -Y,
 +       --mirrorlinks ] [ -P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b, --bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [ -mN,  --max-files[=N]  ]  [
 +       -MN, --max-size[=N] ] [ -EN, --max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN, --connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN, --sockets[=N]
 +       ] [ -TN, --timeout[=N] ] [ -RN, --retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN, --host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n, --near ] [  -t,
 +       --test  ] [ -%L, --list ] [ -%S, --urllist ] [ -NN, --structure[=N] ] [ -%D, --cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN, --long-names[=N] ]
 +       [ -KN, --keep-links[=N] ] [ -x, --replace-external ] [ -%x, --disable-passwords ] [ -%q,  --include-query-string  ]  [  -o,  --generate-errors  ]  [  -X,
 +       --purge-old[=N]  ]  [  -%p, --preserve ] [ -%T, --utf8-conversion ] [ -bN, --cookies[=N] ] [ -u, --check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN, --ro‐
 +       bots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [ -%B, --tolerant ] [ -%s, --updatehack ] [ -%u, --urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N]
 +        [  -%w,  --disable-module  ]  [  -F,  --user-agent ] [ -%R, --referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l, --language ] [ -%a, --accept ] [ -%X,
 +       --headers ] [ -C, --cache[=N] ] [ -k, --store-all-in-cache ] [ -%n, --do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ] [  -q,  --quiet  ]  [  -z,
 +       --extra-log  ]  [  -Z,  --debug-log  ]  [  -v,  --verbose  ]  [  -f, --file-log ] [ -f2, --single-log ] [ -I, --index ] [ -%i, --build-top-index ] [ -%I,
 +       --search-index ] [ -pN, --priority[=N] ] [ -S, --stay-on-same-dir ] [ -D, --can-go-down ] [  -U,  --can-go-up  ]  [  -B,  --can-go-up-and-down  ]  [  -a,
 +       --stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l, --stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H, --debug-headers ] [ -%!, --disable-secu‐
 +       rity-limits ] [ -V, --userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N] ] [
 +
 +DESCRIPTION
 +       httrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML,  images,
 +       and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" web‐
 +       site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site,
 +       and resume interrupted downloads.
 +
 +EXAMPLES
 +       httrack www.someweb.com/bob/
 +               mirror site www.someweb.com/bob/ and only this site
 +
 +       httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*
 +               mirror the two sites together (with shared links) and accept any .jpg files on .com sites
 +
 +       httrack www.someweb.com/bob/bobby.html +* -r6
 +              means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web
 +
 +       httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
 +              runs the spider on www.someweb.com/bob/bobby.html using a proxy
 +
 +       httrack --update
 +              updates a mirror in the current folder
 +
 +       httrack
 +              will bring you to the interactive mode
 +
 +       httrack --continue
 +              continues a mirror in the current folder
 +
 +OPTIONS
 +   General options:
 +       -O     path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (--path <param>)
 +
 +
 +   Action options:
 +       -w     *mirror web sites (--mirror)
 +
 +       -W     mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
 +
 +       -g     just get files (saved in the current directory) (--get-files)
 +
 +       -i     continue an interrupted mirror using the cache (--continue)
 +
 +       -Y     mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)
 +
 +
 +   Proxy options:
 +       -P     proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
 +
 +       -%f    *use proxy for ftp (f0 don t use) (--httpproxy-ftp[=N])
 +
 +       -%b    use this local hostname to make/send requests (-%b hostname) (--bind <param>)
 +
 +
 +   Limits options:
 +       -rN    set the mirror depth to N (* r9999) (--depth[=N])
 +
 +       -%eN   set the external links depth to N (* %e0) (--ext-depth[=N])
 +
 +       -mN    maximum file length for a non-html file (--max-files[=N])
 +
 +       -mN,N2 maximum file length for non html (N) and html (N2)
 +
 +       -MN    maximum overall size that can be uploaded/scanned (--max-size[=N])
 +
 +       -EN    maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-time[=N])
 +
 +       -AN    maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-rate[=N])
 +
 +       -%cN   maximum number of connections/seconds (*%c10) (--connection-per-second[=N])
 +
 +       -GN    pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N])
 +
 +
 +   Flow control:
 +       -cN    number of multiple connections (*c8) (--sockets[=N])
 +
 +       -TN    timeout, number of seconds after a non-responding link is shutdown (--timeout[=N])
 +
 +       -RN    number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N])
 +
 +       -JN    traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N])
 +
 +       -HN    host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (--host-control[=N])
 +
 +
 +   Links options:
 +       -%P    *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don t use) (--extended-parsing[=N])
 +
 +       -n     get non-html files  near  an html file (ex: an image located outside) (--near)
 +
 +       -t     test all URLs (even forbidden ones) (--test)
 +
 +       -%L    <file> add all URL located in this text file (one URL per line) (--list <param>)
 +
 +       -%S    <file> add all scan rules located in this text file (one scan rule per line) (--urllist <param>)
 +
 +
 +   Build options:
 +       -NN    structure type (0 *original structure, 1+: see below) (--structure[=N])
 +
 +       -or    user defined structure (-N "%h%p/%n%q.%t")
 +
 +       -%N    delayed  type  check,  don  t  make  any link test but wait for files download to start instead (experimental) (%N0 don t use, %N1 use for unknown
 +              extensions, * %N2 always use)
 +
 +       -%D    cached delayed type check, don t wait for remote type during updates, to speedup them (%D0 wait, * %D1 don t wait) (--cached-delayed-type-check)
 +
 +       -%M    generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html)
 +
 +       -LN    long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible) (--long-names[=N])
 +
 +       -KN    keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3  absolute  URI  links,  K5  transparent
 +              proxy link) (--keep-links[=N])
 +
 +       -x     replace external html links by error pages (--replace-external)
 +
 +       -%x    do not include any password for external password protected websites (%x0 include) (--disable-passwords)
 +
 +       -%q    *include query string for local files (useless, for information purpose only) (%q0 don t include) (--include-query-string)
 +
 +       -o     *generate output html file in case of error (404..) (o0 don t generate) (--generate-errors)
 +
 +       -X     *purge old files after update (X0 keep delete) (--purge-old[=N])
 +
 +       -%p    preserve html files  as is  (identical to  -K4 -%F "" ) (--preserve)
 +
 +       -%T    links conversion to UTF-8 (--utf8-conversion)
 +
 +
 +   Spider options:
 +       -bN    accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N])
 +
 +       -u     check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but /, u2 check always) (--check-type[=N])
 +
 +       -j     *parse  Java  Classes  (j0  don  t  parse,  bitmask:  |1  parse  default,  |2  don  t  parse  .class  |4  don  t parse .js |8 don t be aggressive)
 +              (--parse-java[=N])
 +
 +       -sN    follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (--robots[=N])
 +
 +       -%h    force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
 +
 +       -%k    use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don t use) (--keep-alive)
 +
 +       -%B    tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
 +
 +       -%s    update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
 +
 +       -%u    url hacks: various hacks to limit duplicate URLs (strip , www.foo.com==foo.com..) (--urlhack)
 +
 +       -%A    assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>)
 +
 +       -can   also be used to force a specific file type: --assume foo.cgi=text/html
 +
 +       -@iN   internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--protocol[=N])
 +
 +       -%w    disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module <param>)
 +
 +
 +   Browser ID:
 +       -F     user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>)
 +
 +       -%R    default referer field sent in HTTP headers (--referer <param>)
 +
 +       -%E    from email address sent in HTTP headers (--from <param>)
 +
 +       -%F    footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
 +
 +       -%l    preffered language (-%l "fr, en, jp, *" (--language <param>)
 +
 +       -%a    accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1" (--accept <param>)
 +
 +       -%X    additional HTTP header line (-%X "X-Magic: 42" (--headers <param>)
 +
 +
 +   Log, index, cache
 +       -C     create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
 +
 +       -k     store all files in cache (not useful if files on disk) (--store-all-in-cache)
 +
 +       -%n    do not re-download locally erased files (--do-not-recatch)
 +
 +       -%v    display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display)
 +
 +       -Q     no log - quiet mode (--do-not-log)
 +
 +       -q     no questions - quiet mode (--quiet)
 +
 +       -z     log - extra infos (--extra-log)
 +
 +       -Z     log - debug (--debug-log)
 +
 +       -v     log on screen (--verbose)
 +
 +       -f     *log in files (--file-log)
 +
 +       -f2    one single log file (--single-log)
 +
 +       -I     *make an index (I0 don t make) (--index)
 +
 +       -%i    make a top index for a project folder (* %i0 don t make) (--build-top-index)
 +
 +       -%I    make an searchable index for this mirror (* %I0 don t make) (--search-index)
 +
 +
 +   Expert options:
 +       -pN    priority mode: (* p3) (--priority[=N])
 +
 +       -p0    just scan, don t save anything (for checking links)
 +
 +       -p1    save only html files
 +
 +       -p2    save only non html files
 +
 +       -*p3   save all files
 +
 +       -p7    get html files before, then treat other files
 +
 +       -S     stay on the same directory (--stay-on-same-dir)
 +
 +       -D     *can only go down into subdirs (--can-go-down)
 +
 +       -U     can only go to upper directories (--can-go-up)
 +
 +       -B     can both go up&down into the directory structure (--can-go-up-and-down)
 +
 +       -a     *stay on the same address (--stay-on-same-address)
 +
 +       -d     stay on the same principal domain (--stay-on-same-domain)
 +
 +       -l     stay on the same TLD (eg: .com) (--stay-on-same-tld)
 +
 +       -e     go everywhere on the web (--go-everywhere)
 +
 +       -%H    debug HTTP headers in logfile (--debug-headers)
 +
 +
 +   Guru options: (do NOT use if possible)
 +       -#   *use optimized engine (limited memory boundary checks) (--fast-engine)
 +
 +       -#   filter test (-#0  *.gif   www.bar.com/foo.gif ) (--debug-testfilters <param>)
 +
 +       -#   simplify test (-#1 ./foo/bar/../foobar)
 +
 +       -#   type test (-#2 /foo/bar.php)
 +
 +       -#   cache list (-#C  *.com/spider*.gif  (--debug-cache <param>)
 +
 +       -#   cache repair (damaged cache) (--repair-cache)
 +
 +       -#   debug parser (--debug-parsing)
 +
 +       -#   extract new.zip cache meta-data in meta.zip
 +
 +       -#   always flush log files (--advanced-flushlogs)
 +
 +       -#FN   maximum number of filters (--advanced-maxfilters[=N])
 +
 +       -#   version info (--version)
 +
 +       -#   scan stdin (debug) (--debug-scanstdin)
 +
 +       -#   maximum number of links (-#L1000000) (--advanced-maxlinks)
 +
 +       -#   display ugly progress information (--advanced-progressinfo)
 +
 +       -#   catch URL (--catch-url)
 +
 +       -#   old FTP routines (debug) (--repair-cache)
 +
 +       -#   generate transfer ops. log every minutes (--debug-xfrstats)
 +
 +       -#   wait time (--advanced-wait)
 +
 +       -#   generate transfer rate statictics every minutes (--debug-ratestats)
 +
 +
 +   Dangerous options: (do NOT use unless you exactly know what you are doing)
 +       -%!    bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits)
 +
 +       -IMPORTANT
 +              NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
 +
 +       -USE   IT WITH EXTREME CARE
 +
 +
 +   Command-line specific options:
 +       -V     execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
 +
 +       -%W    use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback <param>)
 +
 +
 +   Details: Option N
 +       -N0    Site-structure (default)
 +
 +       -N1    HTML in web/, images/other files in web/images/
 +
 +       -N2    HTML in web/HTML, images/other in web/images
 +
 +       -N3    HTML in web/,  images/other in web/
 +
 +       -N4    HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example)
 +
 +       -N5    Images/other in web/xxx and HTML in web/HTML
 +
 +       -N99   All files in web/, with random names (gadget !)
 +
 +       -N100  Site-structure, without www.domain.xxx/
 +
 +       -N101  Identical to N1 exept that "web" is replaced by the site s name
 +
 +       -N102  Identical to N2 exept that "web" is replaced by the site s name
 +
 +       -N103  Identical to N3 exept that "web" is replaced by the site s name
 +
 +       -N104  Identical to N4 exept that "web" is replaced by the site s name
 +
 +       -N105  Identical to N5 exept that "web" is replaced by the site s name
 +
 +       -N199  Identical to N99 exept that "web" is replaced by the site s name
 +
 +       -N1001 Identical to N1 exept that there is no "web" directory
 +
 +       -N1002 Identical to N2 exept that there is no "web" directory
 +
 +       -N1003 Identical to N3 exept that there is no "web" directory (option set for g option)
 +
 +       -N1004 Identical to N4 exept that there is no "web" directory
 +
 +       -N1005 Identical to N5 exept that there is no "web" directory
 +
 +       -N1099 Identical to N99 exept that there is no "web" directory
 +
 +   Details: User-defined option N
 +          %n  Name of file without file type (ex: image)
 +          %N  Name of file, including file type (ex: image.gif)
 +          %t  File type (ex: gif)
 +          %p  Path [without ending /] (ex: /someimages)
 +          %h  Host name (ex: www.someweb.com)
 +          %M  URL MD5 (128 bits, 32 ascii bytes)
 +          %Q  query string MD5 (128 bits, 32 ascii bytes)
 +          %k  full query string
 +          %r  protocol name (ex: http)
 +          %q  small query string MD5 (16 bits, 4 ascii bytes)
 +             %s?  Short name version (ex: %sN)
 +          %[param]  param variable in query string
 +          %[param:before:after:empty:notfound]  advanced variable extraction
 +
 +   Details: User-defined option N and advanced variable extraction
 +          %[param:before:after:empty:notfound]
 +
 +       -param : parameter name
 +
 +       -before
 +              : string to prepend if the parameter was found
 +
 +       -after : string to append if the parameter was found
 +
 +       -notfound
 +              : string replacement if the parameter could not be found
 +
 +       -empty : string replacement if the parameter was empty
 +
 +       -all   fields, except the first one (the parameter name), can be empty
 +
 +
 +
 +===== Utilisation en ligne de commande =====
 +
 +Crée un miroir :
 +
 +<code>httrack --mirror http://www.monsite.com</code>
 +
 +Mettre à jour le projet courant :
 +
 +<code>httrack --update</code>
 +
 +Nettoyage du cache et fichier log :
 +
 +<code>httrack --clean</code>
 +
 +===== Voir aussi =====
 +  * [[http://www.httrack.com/|Site officiel]]
 +
 +----