For my lexicographic research, I chose to use Wiktionary because it’s one of the very few online dictionaries that allows you to extract information from its database. Even better, however, is that the Wikimedia Foundation regularly makes dumps of each of its projects available to the public free of charge. These dumps are available as large XML files, so in order to use them, you’ll first need to import the data into an SQL database.
Getting everything up and running locally (or on a remote server), however, isn’t very straightforward. Here’s a step-by-step guide to getting Wiktionary installed and accessible locally for your own research purposes.
Important note: If you’re thinking you’ll just go ahead and extract the information you need from the XML file, I suggest you abandon that idea. Not only is the file difficult to manipulate, it’s HUGE. The French Wiktionary “pages” XML file, once decompressed, weighs in at 1.5 GB and is therefore much too large to simply open and work with.
***Parts of this tutorial are adapted from Dave Shaw’s instructions.***
What you’ll need
These are direct links to the files and software you’ll need to get Wiktionary installed on your own MySQL server.
- A dump of the Wiktionary database you’re interested in. This is an XML file.
- The MWDumper software as well as Java if it’s not already installed
- A Web server with MySql installed (e.g. MAMP for Mac, WAMP for Windows, or the AMP supported by your Linux distro)
1. Download the Wiktionary XML dump
The first step is to find and download the Wikitionary dump for the language you’re interested in. The previous link will give you a list to all the recent dumps by the Wikimedia foundation. You’re going to need to look for the language code, followed by the string “wiktionary” to identify the dump for your language. For instance, say you want to download the Spanish Wiktionary dump. The language code for Spanish is “es”, so you need to look for “eswiktionary”:
Once you’ve located and clicked the link to the dump for your language, you’ll be faced with numerous links to compressed files, such as pages-meta-history.xml.7z, pages-meta-history.xml.bz2, and page_restrictions.sql.gz. You’ll find information for each one of these dumps next to the links. The file you’re probably most interested in is called pages-articles.xml.bz2 as it contains all of the article content for the given dictionary. Download it and extract it to a folder on your server.
2. Setting up the MySql server
I’m going to assume you have your server up and running (if this is a local installation, just install whatever AMP you want to use). MySql should already be installed and ready to use. Most installations come with phpMyAdmin and this is what we’ll be using to set up a database for Wiktionary.
First, login to your phpMyAdmin by going to your server’s address (e.g. http://localhost/phpmyadmin).
- Click on the Databases tab.
- Enter “wiktionary” at the bottom of the screen under Create new database.
- Select “utf8_bin” as your collation and click the Create button.
- You should now see “wiktionary” in the column on the left. Click on it.
- Click on the Privileges tab.
- Click on Add a new User.
- Type in a username (e.g. wiktionary) and a password in the appropriate fields. Make sure that Grant all privileges on database “wiktionary” is selected under Database for user.
- Click Go at the bottom of the screen.
Now that the database and user are created, we need to create tables.
- Go to this link and copy all the text to your clipboard.
- In phpMyAdmin, still under your wiktionary database, click on the SQL tab.
- Paste the copied text into the text box and click Go.
The tables have been created, but before you import the data, you still need to tweak a few settings.
- Still under the wiktionary database, click on the page table in the left hand pane.
- Click on the Operations tab at the top.
- Under Table options, set the Collation to “utf8_bin” if it isn’t already.
- Click Go under Table Options.
- Repeat these steps for tables “revision” and “text”.
You may also choose to disable indexing on these tables to speed up the import process (otherwise, it can take a few hours). To do this, click on the Structure tab for each of the table listed above (i.e. page, revision, and text) and at the bottom of the screen, under Indexes, delete the indexes by clicking the X next to each one. See below on how to add them again.
3. Importing the Data
Now we’re ready to import the data. Download the MWDumper to the same folder as your Wiktionary download if you haven’t already. You may also choose to compile your own via the source. Make sure you have Java installed on your system (see link under “What you’ll need” above). Open a terminal window or command line and navigate to the folder containing the Wiktionary XML file and MWDumper, then run the following command:
java -jar mwdumper.jar –format=sql:1.5 “WIKTIONARY FILENAME” | “PATH TO MYSQL” -u wiktionary -p wiktionary –default-character-set=utf8
Replace “WIKTIONARY FILENAME with the actual filename of the Wiktionary dump you downloaded and replace “PATH TO MYSQL with the path to your MySQL installation (e.g. if you’re using MAMP, /Applications/MAMP/Library/bin/mysql). -u wiktionary specifies that the username to use is “wiktionary,” so change that if you used something else. The second “wiktionary” in the command refers to the database name, so change that as well if you used something else. Press enter once everything is set and you should be prompted for your password. Once you’ve entered your password, hit enter again and off you go. It will most likely take some time to finish the import, but once it does, it will drop you back to the prompt. That’s it. You now have a local database copy of Wiktionary in whatever language you chose to install.
(***Additional instructions, including optional MySQL queries can be found here. Thanks to Gyuri at ProofreadBot for the link.)
4. Re-enabling Indexes
If you removed the indexes on the tables prior to importing the data, you’ll want to re-enable them once everything has been imported. For each of the following tables, go to the corresponding SQL tab and paste and run the commands below.
For the page table:
CREATE UNIQUE INDEX /*i*/name_title ON /*_*/page (page_namespace,page_title); CREATE INDEX /*i*/page_random ON /*_*/page (page_random); CREATE INDEX /*i*/page_len ON /*_*/page (page_len);
For the revision table:
CREATE UNIQUE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id); CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp); CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp); CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp); CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp);
give me some step by step process, how to compile/run enwiktionary.xml in cmd…
and give the step by step procedure for use wiktionary database in android..
Hi Salman,
I’m not sure I understand either of your requests. This post has nothing to do with the Android platform and an XML file is simply structured data.
Hey Yves
Lucas again script works perfectly now however on trying to set up a local copy of wictionary I get this error “The filename, directory name or volume label syntax is incorrect”. Cant figure it out.
heres what i entered
java -jar mwdumper.jar –format=sql:1.5 “wict.xml” | “C:\xampp\mysql” -u wiktionary -p wiktionary –default-character-set=utf8
im using xampp any ideas where im going wrong?
Lucas
Possibly: enter the entire path to your Wiktionary file (c:\…\wict.xml) and drop the quotation marks from the paths.
that got us a little further down the track now we run into another error
java -jar mwdumper.jar –format=sql:1.5 (C:\Users\Lucas\wict.xml) | (C:\xampp\mysql) -u wiktionary -p wiktionary –default-character-set=utf8
it says “-u was unexpected at this time” we played around with the brackets but couldn’t get it to work sorry to be a continuing bother but do you have any idea what this error is about ?
lucas
Path shouldn’t be enclosed in anything (at least, not on Mac, but I’m assuming it’s the same for Windows):
java -jar mwdumper.jar –format=sql:1.5 C:\Users\Lucas\wict.xml | C:\xampp\mysql -u wiktionary -p wiktionary –default-character-set=utf8
Note: if you copy-paste the above command, make sure you have double hyphens in front of format and default-character-set; WordPress tends to drop them.
Hi
the link that you give to copy the data to create the tables in
“Now that the database and user are created, we need to create tables.
Go to this link and copy all the text to your clipboard.
In phpMyAdmin, still under your wiktionary database, click on the SQL tab.
Paste the copied text into the text box and click Go.”
does not seem to work. Could you please paste the info directly in this page? thanks
Hi,
Sorry for the slow response.
I’ve updated the link to a local txt file with Wiktionary’s MySQL table structure. It may not be identical to the original structure, but it seems to work. Please note that the DB engine is set to InnoDB, but that it can easily be changed to something else by replacing the corresponding strings (e.g. MyISAM).
Regards,
Yves
GREAT THANKS
while renabling indexes I am getting an error of Duplicate Key name:name title,
any suggestions?
Hi,
I was able to get the jar to run by compiling a new one from the source (mirrored) on GitHub at:
https://github.com/wikimedia/mediawiki-tools-mwdumper
I’m in windows. I have the latest MySQL, a Wiktionary db, wiktionary user with same pw, etc… but don’t have PhpMyAdmin since MySQL now has its own management gui.
Short version: Mwdumper runs all the way through ~5.7m pages… but it seems none of the rows are getting written into the db. But mwdumper isn’t throwing an errors so I’m sure how to proceed.
Any help appreciated. Happy to share the updated jar if it’s of interest to anyone.
Hi,
To be honest, I haven’t worked with MySQL and Wiktionary in quite some time. It doesn’t look like the mwdumper has a verbose mode (just Quiet to hide its default output), so I’m not sure how to get more info out of the tool.
If it’s an incompatibility issue, then I might suggest running it with a different schema (e.g. –format=mysql:1.4) or creating the required tables with WikiMedia’s official SQL script, then running MWDumper again.
Sorry I couldn’t be of more help.