首页 | 主题 | 图库 | 问答 | 文摘 | 原创 | 百科

历史 | 地理 | 人物 | 艺术 | 体育 | 科学 | 音乐 | 电影 | 信息技术 | 世界遗产

 开放、中立,源自维基百科

个人工具


用搜狗搜索相关网站  Google Search

User:Bob/Data dumps

维库,知识与思想的自由文库

跳转到: 导航, 搜索

Wikimedia provides public dumps of our wiki's content:

  • for archival/backup purposes
  • for offline use
  • for academic research
  • for republishing (don't forget to follow the license terms)
  • for fun!

The timezone of the file dates is UTC.

The dump files can be downloaded at:

目录

[编辑] Schedule

Starting January 23, 2006 dumps will be run approximately once a week. Since the whole process takes more than a week for all databases, not all databases become available at the same time.

Note that the larger databases such as enwiki, dewiki, and jawiki can take a long time to run, especially when compressing the full edit history. If you see it stuck on one of these for a few hours, or up to nine days, don't worry -- it's not dead, it's just a lot of data.

The download site at http://download.wikimedia.org/ shows the status of each dump, if it's in progress, when it was last dumped, etc.

[编辑] What's available?

  • Page content
  • Page-to-page link lists (pagelinks, categorylinks, imagelinks tables)
  • Image metadata (image, oldimage tables)
  • Misc bits (interwiki, site_stats tables)

[编辑] What's not available?

  • User data: passwords, e-mail addresses, preferences, watchlists, etc
  • Deleted page content

At the moment uploaded files are dealt with separately and somewhat less regularly, but we intend to make upload dumps more regularly again in the future.

[编辑] Format

The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at Help:Export.

Three sets of page data are produced for each dump, depending on what you need:

  • pages-articles.xml
    • Contains current version of all article pages, templates, and other pages
    • Excludes discussion pages ('Talk:') and user "home" pages ('User:')
    • Recommended for republishing of content.
  • pages-meta-current.xml
    • Contains current version of all pages, including discussion and user "home" pages.
  • pages-meta-history.xml
    • Contains complete text of every revision of every page (can be very large!)
    • Recommended for research and archives.

The XML itself contains complete, raw text of every revision, so in particular the full history files can be extremely large; en.wikipedia.org would run upwards of six hundred gigabytes raw. Currently we are compressing these XML streams with bzip2 (.bz2 files) and additionally for the full history dump SevenZip (.7z files).

SevenZip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files.

Several of the tables are also dumped with mysqldump should anyone find them useful; the gzip-compressed SQL dumps (.sql.gz) can be read directly into a MySQL database but may be less convenient for other database formats.

[编辑] What happened to the SQL dumps?

In mid-2005 we upgraded the Wikimedia sites to MediaWiki 1.5, which uses a very different database layout than earlier versions. SQL dumps of the 'cur' and 'old' tables are no longer available because those tables no longer exist.

We don't provide direct dumps of the new 'page', 'revision', and 'text' tables either because aggressive changes to the backend storage make this extra difficult: much data is in fact indirection pointing to another database cluster, and deleted pages which we cannot reproduce may still be present in the raw internal database blobs. The XML dump format provides forward and backward compatibility without requiring authors of third-party dump processing or statistics tools to reproduce our every internal hack. If required, you can use the mwdumper tool (see below) to produce SQL statements compatible with the version 1.4 schema from an XML dump.

[编辑] Tools

Note:

The page import methods mentioned below don't automatically rebuild the auxiliary tables such as the links tables. The non-private auxiliary tables are provided as gzipped SQL dumps which can be imported directly into MySQL.

See also Meta's notes on rebuilding link tables

[编辑] importDump.php

MediaWiki 1.5 and above includes a command-line script 'importDump.php' which can be used to import an XML page dump into the database. This requires first configuring and installing MediaWiki. It's also relatively slow; to import a large Wikipedia data dump into a fresh database you should consider mwdumper, below.

As an example invocation, when you have an XML file called temp.xml

 php maintenance\importDump.php < maintenance\temp.xml

[编辑] mwdumper

mwdumper is a standalone program for filtering and converting XML dumps. It can produce output as another XML dump as well as SQL statements for inserting data directly into a database in MediaWiki's 1.4 or 1.5 schema.

Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL. The program is written in Java and has been tested with Sun's 1.5 JRE and GNU's GCJ 4. Source is in our CVS; a precompiled .jar is available at http://download.wikimedia.org/tools/

Be sure to review the README.txt file which is also provided. It explains the invocation options required. Friendly wiki-version of README with few additional hints is available at http://www.mediawiki.org/wiki/MWDumper

[编辑] bzip2

For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.


mwdumper can read the .bz2 files directly, but importDump.php requires piping like so: bzip2 -dc pages_current.xml.bz2 | php importDump.php

[编辑] 7-Zip

For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:

Something like: 7za e -so pages_current.xml.7z | php importDump.php

will expand the current pages and pipe them to the importDump.php PHP script.

[编辑] Perl importing script

This is a script Tbsmith made to import only pages in certain categories. It works for Mediawiki 1.5.

The script

[编辑] Producing your own dumps

MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php which can be used to produce XML dumps directly, with or without page history. mwdumper can be used to make filtered dumps (like pages_articles.xml); this is also built into dumpBackup.php in latest CVS.

The program which manages our multi-database dump process is available in our source repository, but likely would require customization for use outside Wikimedia's cluster setup.

[编辑] Where to go for help

If you have trouble with the dump files, you can:

  • Ask in #wikimedia-tech on irc.freenode.net - Although help is not always available at all times
  • Ask on wikitech-l on http://mail.wikimedia.org/

Alternatively, if you have a specific bug to report:

For French speaking people, see also fr:Wikipédia:Requêtes XML

[编辑] What about bittorrent?

bittorrent is not currently used to distribute Wikimedia dumps... at least not officially. Of course some torrents of dumps exist. If you have started torrenting dumps, leave a note here.

  • Torrentspy search -- currently showing one wikipedia and one wikipedia-fr... 00:00, 2 June 2006 (UTC)

[编辑] See also

  • xml2sql - a tool for xml dump to sql dump.
AD Links