Mirror Configuration¶
The [mirror] section of the configuration file contains general options for how Bandersnatch should operate. This includes settings like the source repository to mirror, how to store mirrored files, and the kinds of files to include in the mirror.
The following options are currently required:
Examples¶
These examples only show [mirror]
options; a complete configuration may include mirror filtering plugins and/or options for a storage backend.
Minmal¶
A basic configuration showing some of the more common options:
[mirror]
; base destination path for mirrored files
directory = /srv/pypi
; upstream package repository to mirror
master = https://pypi.org
; parallel downloads - keep low to avoid overwhelming upstream
workers = 3
; per-request time limit
timeout = 15
; global time limit - applied to aiohttp coroutines
global-timeout = 18000
; continue syncing when an error occurs
stop-on-error = false
This will mirror index files and package release files from PyPI and store the mirror in /srv/pypi
. Add configuration for mirror filtering plugins to optionally filter what packages are mirrored in a variety of ways.
Alternative Download Source¶
It is possible to download metadata from one repository, but package release files from another:
[mirror]
directory = /srv/pypi
; Project and package metadata received from this repository
master = https://pypi.org
; Package distribution artifacts downloaded from here if possible
download-mirror = https://pypi-mirror.example.com/
This will download release files from https://pypi-mirror.example.com
if possible and fall back to PyPI if a download fails. See download-mirror. Add download-mirror-no-fallback to download release files exclusively from download-mirror
.
If you are using a download source that does not have HTTPS support, you can supply this configuration:
[mirror]
...
allow-non-https = true
...
Note: It is not recommended to use this option in production environments as it may expose you to security vulnerabilities.
Always ensure your PyPI server is running over https://
in production.
Index Files Only¶
It is possible to mirror just index files without downloading any package release files:
[mirror]
directory = /srv/pypi-filtered
master = https://pypi.org
simple-format = ALL
release-files = false
root_uri = https://files.pythonhosted.org/
This will mirror index files for projects and versions allowed by your mirror filters, but will not download any package release files. File URLs in index files will use the configured root_uri
. See release-files and root_uri.
Option Reference¶
directory
¶
The directory where mirrored files are stored. This option is always required.
- Type:
folder path
- Required:
yes
The exact interpretation of this value depends on the configured storage backend. For the default filesystem backend, the directory used should meet the following requirements:
The filesystem must be case-sensitive.
The filesystem must support large numbers of sub-directories.
The filesystem must support large numbers of files (inodes)
storage-backend
¶
The storage backend used to save data and metadata when mirroring packages.
- Type:
string
- Required:
no
- Default:
filesystem
See also
Available storage backends are documented at Storage options for bandersnatch.
simple-format
¶
The Simple Repository API index file formats to generate.
- Type:
one of
HTML
,JSON
, orALL
- Required:
no
- Default:
ALL
PEP 691 – JSON-based Simple API for Python Package Indexes extended the Simple Repository API to support both HTML and JSON. Bandersnatch generates project index files in both formats by default. Set this option to restrict index files to a single data format.
simple-format index files describes the generated folder structure and file names.
release-files
¶
Mirror package release files. Release files are the uploaded sdist and wheel files for mirrored projects.
- Type:
boolean
- Required:
no
- Default:
true
Disabling this will mirror repository index files and/or project metadata without downloading any associated package files. release-files folder structure describes the folder structure for mirrored package release files.
Note
If release-files = false
, you should also specify the root_uri option.
json
¶
Save copies of JSON project metadata downloaded from PyPI.
- Type:
boolean
- Required:
no
- Default:
false
When enabled, this saves copies of all JSON project metadata downloaded from PyPI’s JSON API. These files are used by the bandersnatch verify subcommand.
json API metadata files describes the folder structure generated by this option. The format of the saved JSON is not standardized and is specific to Warehouse.
Note
This option does not effect the generation of simple repository API index files in JSON format (simple-format).
root_uri
¶
A base URL to generate absolute URLs for package release files.
- Type:
URL
- Required:
no
- Default:
dynamic; see description
Bandersnatch creates index files containing relative URLs by default. Setting this option generates index files with absolute URLs instead, using the specified string for the base URL.
If release-files is disabled and this option is unset, Bandersnatch uses a default value of https://files.pythonhosted.org/
.
Note
This is generally not necessary, but was added for the official internal PyPI mirror, which requires serving packages from <https://files.pythonhosted.org>
.
diff-file
¶
File location to write a list of all new or changed files during a mirror operation.
- Type:
file or folder path
- Required:
no
- Default:
none
If set, Bandersnatch creates a plain-text file at the specified location containing a list of all files created or updated during the last mirror/sync operation. The files are listed as absolute paths, one per line.
This is useful when mirroring to an offline network where it is required to only transfer new files to the downstream mirror. The diff file can be used to copy new files to an external drive, sync the list of files to an SSH destination such as a diode, or send the files through some other mechanism to an offline system.
If the specified path is a directory, Bandersnatch will use the file name “mirrored-files
” within that directory.
The file will be overwritten on each mirror operation unless diff-append-epoch is enabled.
Example Usage¶
The diff file can be used with rsync for copying only new files:
rsync -av --files-from=/srv/pypi/mirrored-files / /mnt/usb/
It can also be used with 7zip to create split archives for transfers:
7za a -i@"/srv/pypi/mirrored-files" -spf -v100m path_to_new_zip.7z
diff-append-epoch
¶
Append the current epoch time to the file name for diff-file.
- Type:
boolean
- Required:
no
- Default:
false
For example, the configuration:
[mirror]
; ...
diff-file = /srv/pypi/new-files
diff-append-epoch = true
Will generate diff files with names like /srv/pypi/new-files-1568129735
. This can be used to track diffs over time by creating a new diff file each time Bandersnatch runs.
hash-index
¶
Group generated project index folders by the first letter of their normalized project name.
- Type:
boolean
- Required:
no
- Default:
false
Enabling this changes the way generated index files are organized. Project folders are grouped into subfolders alphabetically as shown here: hash-index index files. This has the effect of splitting up a large /web/simple
directory into smaller subfolders, each containing a subset of the index files. This can improve file system efficiency when mirroring a very large number of projects, but requires a web server capable of translating Simple Repository API URLs into file paths.
Warning
It is recommended to set this to false
for full pip/pypi compatibility.
The path structure created by this option is incompatible with the Simple Repository API. Serving the generated web/simple/
folder directly will not work with pip. hash-index
should only be used with a web server that can translate request URIs into alternative filesystem locations.
Requests for subfolders of /web/simple
must be re-written using the first letter of the requested project name:
Requested path:
/simple/someproject/index.html
Translated path:
/simple/s/someproject/index.html
Example Apache RewriteRule
Configuration¶
Configuration like the following is required to use the hash-index
option with an Apache web server:
RewriteRule ^([^/])([^/]*)/$ /mirror/pypi/web/simple/$1/$1$2/
RewriteRule ^([^/])([^/]*)/([^/]+)$/ /mirror/pypi/web/simple/$1/$1$2/$3
Example NGINX rewrite
Configuration¶
Configuration like the following is required to use hash-index
with an NGINX web server:
rewrite ^/simple/([^/])([^/]*)/$ /simple/$1/$1$2/ last;
rewrite ^/simple/([^/])([^/]*)/([^/]+)$/ /simple/$1/$1$2/$3 last;
master
¶
The URL of the Python package repository server to mirror.
- Type:
URL
- Required:
no
- Default:
https://pypi.org
Bandersnatch requests metadata for projects and packages from this repository server, and downloads package release files from the URLs specified in the received metadata. The default value mirrors packages from PyPI.
The URL must use the https:
protocol.
Note
The specified server must support PyPI’s JSON API for Bandersnatch to mirror any projects.
See also
Bandersnatch can download package release files from an alternative source by configuring a download-mirror.
proxy
¶
Use an HTTP or SOCKS proxy server.
- Type:
URL
- Required:
no
- Default:
none
The proxy server is used when sending requests to a repository server set by the master or download-mirror option. The URL scheme must be one of http
, https
, socks4
, or socks5
.
If this configuration option is not set, Bandersnatch will also use the first URL found in the following environment variables in order: SOCKS5_PROXY
, SOCKS4_PROXY
, SOCKS_PROXY
, HTTPS_PROXY
, HTTP_PROXY
, ALL_PROXY
.
See also
HTTP proxies are supported through the aiohttp
library. The aiohttp manual has more details on what connection types are supported: https://docs.aiohttp.org/en/stable/client_advanced.html#proxy-support
SOCKS proxies are supported through the aiohttp_socks
library: aiohttp-socks.
timeout
¶
The network request timeout to use for all connections, in seconds. This is the maximum allowed time for individual web requests.
- Type:
number, in seconds
- Required:
no
- Default:
10
Note
It is recommended to set this to a relatively low value, e.g. 10 - 30 seconds. This is so temporary problems will fail quickly and allow retrying, instead of having a process hang infinitely and leave TCP unable to catch up for a long time.
global-timeout
¶
The maximum runtime of individual aiohttp coroutines, in seconds.
- Type:
number, in seconds
- Required:
no
- Default:
1800
Note
It is recommended to set this to a relatively high value, e.g. 3,600 - 18,000 (1 - 5 hours). This supports coroutines mirroring large package files on slow connections.
download-mirror
¶
Download package release files from an alternative repository server.
- Type:
URL
- Required:
no
- Default:
none
By default, Bandersnatch downloads packages from the URL supplied in the master server’s JSON response. Setting this option to a repository URL will try to download release files from that repository first, and fallback to the URL supplied by the master server if that is unsuccessful (unable to get content or checksum mismatch).
This is useful to sync most of the files from an existing, nearby mirror - for example, when creating a new mirror identical to an existing one for the purpose of load sharing.
download-mirror-no-fallback
¶
Disable the fallback behavior for download-mirror.
- Type:
boolean
- Required:
no
- Default:
false
When set to true
, Bandersnatch only downloads package distribution artifacts from the repository set in download-mirror and ignores file URLs received from the master server.
Warning
This could lead to more failures than expected and is not recommended for most scenarios.
cleanup
¶
Enable cleanup of legacy simple directories with non-normalized names.
- Type:
boolean
- Required:
no
- Default:
false
Bandersnatch versions prior to 4.0 used directories with non-normalized package names for compatability with older versions of pip. Enabling this option checks for and removes these directories.
workers
¶
The number of worker threads used for parallel downloads.
- Type:
number, 1 ≤ N ≤ 10
- Required:
no
- Default:
3
Use 1 - 3 workers to avoid overloading the PyPI master (and maybe your own internet connection). If you see timeouts and have a slow connection, try lowering this setting.
Official servers located in data centers could feasibly run up to 10 workers. Anything beyond 10 is considered unreasonable.
verifiers
¶
The number of concurrent consumers used for verifying metadata.
- Type:
number
- Required:
no
- Default:
3
See also
This option is used by the bandersnatch verify subcommand.
stop-on-error
¶
Stop mirror/sync operations immediately when an error occurs.
- Type:
boolean
- Required:
no
- Default:
false
When disabled (stop-on-error = false
), Bandersnatch continues syncing after an error occurs, but will mark the sync as unsuccessful. When enabled, Bandersnatch will stop all syncing as soon as possible if an error occurs. This can be helpful when debugging the cause of an unsuccessful sync.
compare-method
¶
The method used to compare existing files with upstream files.
- Type:
one of
hash
,stat
- Required:
no
- Default:
hash
hash
: compare by creating a checksums of a local file content. This is slower thanstat
, but more reliable. The hash algorithm is specified by digest_name.stat
: compare by using file size and change time. This can reduce IO workload when frequently verifying a large number of files.
digest_name
¶
The algorithm used to compute file hashes when compare-method is set to hash
.
- Type:
one of
sha256
,md5
- Required:
no
- Default:
sha256
keep_index_versions
¶
Store previous versions of generated index files.
- Type:
number
- Required:
no
- Default:
0 (do not keep previous index versions)
This can be used as a safeguard against upstream changes generating blank index.html files.
By default or when set to 0, no prior versions are stored and index.html
is the latest version.
When enabled by setting a value > 0, Bandersnatch stores the most recently generated versions of each index file, up to the configured number of versions. Prior versions are stored under versions/index_<serial>_<timestamp>.html
and the current index.html
is a symlink to the latest version.
log-config
¶
Provide a custom logging configuration file.
- type:
file path
- Required:
no
- Default:
none
The file must be a Python logging.config
module configuration file in INI format, as used with logging.config.fileConfig
. The specified configuration replaces Bandersnatch’s default logging configuration.
See also
Refer to Configuration file format for the logging configuration file format.
Sample Alternative Logging Configuration¶
[loggers]
keys=root,file
[handlers]
keys=root,file
[formatters]
keys=common
[logger_root]
level=NOTSET
handlers=root
[logger_file]
level=INFO
handlers=file
propagate=1
qualname=bandersnatch
[formatter_common]
format=%(asctime)s %(name)-12s: %(levelname)s %(message)s
[handler_root]
class=StreamHandler
level=DEBUG
formatter=common
args=(sys.stdout,)
[handler_file]
class=handlers.TimedRotatingFileHandler
level=DEBUG
formatter=common
delay=False
args=('/repo/bandersnatch/banderlogfile.log', 'D', 1, 0)
Folder Structures¶
simple-format
index files¶
Folder structure of generated index files for simple-format:
<mirror directory>/
└── web/
├── packages/...
└── simple/
├── index.html
├── index.v1_html
├── index.v1_json
├── someproject/
│ ├── index.html
│ ├── index.v1_html
│ └── index.v1_json
├── anotherproject/
│ ├── index.html
│ ├── index.v1_html
│ └── index.v1_json
└── ...
This path structure is compatible with the Simple Repository API.
If simple-format
is set to HTML
, Bandersnatch will only create index.html
and index.v1_html
. If simple-format
is set to JSON
, it will only create index.v1_json
.
release-files
folder structure¶
Package release files are distributed into subdirectories based on their checksum:
<mirror directory>/
└── web/
├── packages/
│ ├── 1a/
│ │ └── 70/
│ │ └── e63223f8116931d365993d4a6b7ef653a4d920b41d03de7c59499962821f/
│ │ └── click-8.1.6-py3-none-any.whl
│ ├── 8b/
│ │ ├── 3a/
│ │ │ └── b569b932cf737b525eb4c7a2b615ec07b102dff64f1d8a0fe52a48b911fc/
│ │ │ └── diff-2023.12.5.tar.gz
│ │ └── e2/
│ │ └── 4823d9f02d2743a02e2c236f98b96b52f7a16b2bedc0e3148322dffbd06f/
│ │ └── black-24.1.0-cp39-cp39-win_amd64.whl
│ ├── 31/
│ │ ├── 5f/
│ │ │ └── ...
│ │ └── 7a/
│ │ └── ...
│ └── ...
└── simple/
├── click/
├── diff/
├── black/
├── ...
└── index.html
By default, generated index files contain releative links into the web/packages/
directory.
json
API metadata files¶
Folder structure of saved PyPI project metadata when json is enabled:
<mirror directory>/
├── web/
│ └── json/
│ ├── someproject
│ ├── anotherproject
│ └── ...
├── pypi/
│ ├── someproject/
│ │ └── json
│ ├── anotherproject/
│ │ └── json
│ └── ...
├── packages/
│ └── ...
└── simple/
└── ...
The files web/json/someproject
and web/pypi/someproject/json
both contain the JSON metadata for a PyPI project with the normalized name “someproject”.
hash-index
index files¶
When hash-index is enabled, project index folders are grouped by the first letter of their name - for example:
<mirror directory>/
└── web/
└── simple/
├── b/
│ ├── boto3/
│ │ └── index.html
│ └── botocore/
│ └── index.html
├── c/
│ ├── charset-normalizer/
│ │ └── index.html
│ ├── certifi/
│ │ └── index.html
│ └── cryptography/
│ └── index.html
├── t/
│ └── typing-extensions/
│ └── index.html
├── ...
└── index.html
The content of the index files themselves is unchanged.
Default Configuration File¶
Bandersnatch loads default values from a configuration file inside the package.
src/bandersnatch/defaults.conf
¶; [ Default Config Values ]
; Bandersnatch loads this file prior to loading the user config file.
; The values in this file serve as defaults and are overriden if also
; specified in a user config.
[mirror]
storage-backend = filesystem
master = https://pypi.org
proxy =
download-mirror =
download-mirror-no-fallback = false
allow-non-https = false
json = false
release-files = true
hash-index = false
simple-format = ALL
compare-method = hash
digest_name = sha256
keep-index-versions = 0
cleanup = false
stop-on-error = false
timeout = 10
global-timeout = 1800
workers = 3
verifiers = 3
; dynamic default: this URI used if `release-files = false`
; root_uri = https://files.pythonhosted.org
root_uri =
diff-file =
diff-append-epoch = false
log-config =
An annotated example configuration is also included. You can use this file as a reference or as the basis for your own configuration.
src/bandersnatch/example.conf
¶[mirror]
; The directory where the mirror data will be stored.
directory = /srv/pypi
; Save JSON metadata into the web tree:
; URL/pypi/PKG_NAME/json (Symlink) -> URL/json/PKG_NAME
json = false
; Save package release files
release-files = true
; Cleanup legacy non PEP 503 normalized named simple directories
cleanup = false
; The PyPI server which will be mirrored.
; master = https://test.pypi.org
; scheme for PyPI server MUST be https
master = https://pypi.org
; The network socket timeout to use for all connections. This is set to a
; somewhat aggressively low value: rather fail quickly temporarily and re-run
; the client soon instead of having a process hang infinitely and have TCP not
; catching up for ages.
timeout = 10
; The global-timeout sets aiohttp total timeout for it's coroutines
; This is set incredibly high by default as aiohttp coroutines need to be
; equipped to handle mirroring large PyPI packages on slow connections.
global-timeout = 1800
; Number of worker threads to use for parallel downloads.
; Recommendations for worker thread setting:
; - leave the default of 3 to avoid overloading the pypi master
; - official servers located in data centers could run 10 workers
; - anything beyond 10 is probably unreasonable and avoided by bandersnatch
workers = 3
; Whether to hash package indexes
; Note that package index directory hashing is incompatible with pip, and so
; this should only be used in an environment where it is behind an application
; that can translate URIs to filesystem locations. For example, with the
; following Apache RewriteRule:
; RewriteRule ^([^/])([^/]*)/$ /mirror/pypi/web/simple/$1/$1$2/
; RewriteRule ^([^/])([^/]*)/([^/]+)$/ /mirror/pypi/web/simple/$1/$1$2/$3
; OR
; following nginx rewrite rules:
; rewrite ^/simple/([^/])([^/]*)/$ /simple/$1/$1$2/ last;
; rewrite ^/simple/([^/])([^/]*)/([^/]+)$/ /simple/$1/$1$2/$3 last;
; Setting this to true would put the package 'abc' index in simple/a/abc.
; Recommended setting: the default of false for full pip/pypi compatibility.
hash-index = false
; Format for simple API to be stored in
; Since PEP691 we have HTML and JSON
simple-format = ALL
; Whether to stop a sync quickly after an error is found or whether to continue
; syncing but not marking the sync as successful. Value should be "true" or
; "false".
stop-on-error = false
; The storage backend that will be used to save data and metadata while
; mirroring packages. By default, use the filesystem backend. Other options
; currently include: 'swift'
storage-backend = filesystem
; Advanced logging configuration. Uncomment and set to the location of a
; python logging format logging config file.
; log-config = /etc/bandersnatch-log.conf
; Generate index pages with absolute urls rather than relative links. This is
; generally not necessary, but was added for the official internal PyPI mirror,
; which requires serving packages from https://files.pythonhosted.org
; root_uri = https://example.com
; Number of consumers which verify metadata
verifiers = 3
; Number of prior simple index.html to store. Used as a safeguard against
; upstream changes generating blank index.html files. Prior versions are
; stored under as "versions/index_<serial>_<timestamp>.html" and the current
; index.html will be a symlink to the latest version.
; If set to 0 no prior versions are stored and index.html is the latest version.
; If unset defaults to 0.
; keep_index_versions = 0
; Configure an option to compare whether a file is identical. By default the
; "hash" method is used which reads local file content and computes hashes,
; which is slow but more reliable; when "stat" method is used, file size and
; change time are used to compare, which is useful to reduce IO workload when
; verifying a lot of files frequently.
; Possible values are: hash (default), stat
compare-method = hash
; Configure to download packages from an alternative mirror.
; By default bandersnatch downloads packages from the server in the "url"
; value of json response from master server. This option asks bandersnatch
; to try to download from the configured PyPI mirror first, and fallback to
; "url" value if it was not successful (unable to get content or checksum
; mismatch). It is useful to sync most of the files from an existing, nearby
; mirror, for example when setting up a new server sitting next to an existing
; one for the purpose of load sharing.
; Downloading only from the mirror site without fallback is also possible,
; but be aware this could lead to more failures than expected and is not
; recommended for most scenarios.
; download-mirror = https://pypi-mirror.example.com/
; download-mirror-no-fallback = False
; vim: set ft=cfg:
; Configure a file to write out the list of files downloaded during the mirror.
; This is useful for situations when mirroring to offline systems where a process
; is required to only sync new files to the upstream mirror.
; The file be be named as set in the diff-file, and overwritten unless the
; diff-append-epoch setting is set to true. If this is true, the epoch date will
; be appended to the filename (i.e. /path/to/diff-1568129735)
; diff-file = /srv/pypi/mirrored-files
; diff-append-epoch = true