For my lexicographic research, I chose to use Wiktionary because it’s one of the very few online dictionaries that allows you to extract information from its database. Even better, however, is that the Wikimedia Foundation regularly makes dumps of each of its projects available to the public free of charge. These dumps are available as large XML files, so in order to use them, you’ll first need to import the data into an SQL database.
Getting everything up and running locally (or on a remote server), however, isn’t very straightforward. Here’s a step-by-step guide to getting Wiktionary installed and accessible locally for your own research purposes.
Important note: If you’re thinking you’ll just go ahead and extract the information you need from the XML file, I suggest you abandon that idea. Not only is the file difficult to manipulate, it’s HUGE. The French Wiktionary “pages” XML file, once decompressed, weighs in at 1.5 GB and is therefore much too large to simply open and work with.
***Parts of this tutorial are adapted from Dave Shaw’s instructions.***
What you’ll need
These are direct links to the files and software you’ll need to get Wiktionary installed on your own MySQL server.
- A dump of the Wiktionary database you’re interested in. This is an XML file.
- The MWDumper software as well as Java if it’s not already installed
- A Web server with MySql installed (e.g. MAMP for Mac, WAMP for Windows, or the AMP supported by your Linux distro)
1. Download the Wiktionary XML dump
The first step is to find and download the Wikitionary dump for the language you’re interested in. The previous link will give you a list to all the recent dumps by the Wikimedia foundation. You’re going to need to look for the language code, followed by the string “wiktionary” to identify the dump for your language. For instance, say you want to download the Spanish Wiktionary dump. The language code for Spanish is “es”, so you need to look for “eswiktionary”:
Once you’ve located and clicked the link to the dump for your language, you’ll be faced with numerous links to compressed files, such as pages-meta-history.xml.7z, pages-meta-history.xml.bz2, and page_restrictions.sql.gz. You’ll find information for each one of these dumps next to the links. The file you’re probably most interested in is called pages-articles.xml.bz2 as it contains all of the article content for the given dictionary. Download it and extract it to a folder on your server.
2. Setting up the MySql server
I’m going to assume you have your server up and running (if this is a local installation, just install whatever AMP you want to use). MySql should already be installed and ready to use. Most installations come with phpMyAdmin and this is what we’ll be using to set up a database for Wiktionary.
First, login to your phpMyAdmin by going to your server’s address (e.g. http://localhost/phpmyadmin).
- Click on the Databases tab.
- Enter “wiktionary” at the bottom of the screen under Create new database.
- Select “utf8_bin” as your collation and click the Create button.
- You should now see “wiktionary” in the column on the left. Click on it.
- Click on the Privileges tab.
- Click on Add a new User.
- Type in a username (e.g. wiktionary) and a password in the appropriate fields. Make sure that Grant all privileges on database “wiktionary” is selected under Database for user.
- Click Go at the bottom of the screen.
Now that the database and user are created, we need to create tables.
- Go to this link and copy all the text to your clipboard.
- In phpMyAdmin, still under your wiktionary database, click on the SQL tab.
- Paste the copied text into the text box and click Go.
The tables have been created, but before you import the data, you still need to tweak a few settings.
- Still under the wiktionary database, click on the page table in the left hand pane.
- Click on the Operations tab at the top.
- Under Table options, set the Collation to “utf8_bin” if it isn’t already.
- Click Go under Table Options.
- Repeat these steps for tables “revision” and “text”.
You may also choose to disable indexing on these tables to speed up the import process (otherwise, it can take a few hours). To do this, click on the Structure tab for each of the table listed above (i.e. page, revision, and text) and at the bottom of the screen, under Indexes, delete the indexes by clicking the X next to each one. See below on how to add them again.
3. Importing the Data
Now we’re ready to import the data. Download the MWDumper to the same folder as your Wiktionary download if you haven’t already. You may also choose to compile your own via the source. Make sure you have Java installed on your system (see link under “What you’ll need” above). Open a terminal window or command line and navigate to the folder containing the Wiktionary XML file and MWDumper, then run the following command:
java -jar mwdumper.jar –format=sql:1.5 “WIKTIONARY FILENAME” | “PATH TO MYSQL” -u wiktionary -p wiktionary –default-character-set=utf8
Replace “WIKTIONARY FILENAME with the actual filename of the Wiktionary dump you downloaded and replace “PATH TO MYSQL with the path to your MySQL installation (e.g. if you’re using MAMP, /Applications/MAMP/Library/bin/mysql). -u wiktionary specifies that the username to use is “wiktionary,” so change that if you used something else. The second “wiktionary” in the command refers to the database name, so change that as well if you used something else. Press enter once everything is set and you should be prompted for your password. Once you’ve entered your password, hit enter again and off you go. It will most likely take some time to finish the import, but once it does, it will drop you back to the prompt. That’s it. You now have a local database copy of Wiktionary in whatever language you chose to install.
(***Additional instructions, including optional MySQL queries can be found here. Thanks to Gyuri at ProofreadBot for the link.)
4. Re-enabling Indexes
If you removed the indexes on the tables prior to importing the data, you’ll want to re-enable them once everything has been imported. For each of the following tables, go to the corresponding SQL tab and paste and run the commands below.
For the page table:
CREATE UNIQUE INDEX /*i*/name_title ON /*_*/page (page_namespace,page_title);
CREATE INDEX /*i*/page_random ON /*_*/page (page_random);
CREATE INDEX /*i*/page_len ON /*_*/page (page_len);
For the revision table:
CREATE UNIQUE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id);
CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp);
CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp);
CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp);
CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp);