Wiktionary now requires HTTPS connections

in Programming

Last Friday (June 12, 2015 to be precise), Wikimedia announced that all of their sites would be switching to secure connections by default. Consequently, any tool that uses one of Wikimedia’s APIs must connect through HTTPS, otherwise it will fail on connect.

The version of Wikparser currently available on Github has been updated with the fix (thanks to quuuit for pushing the update to the repo).

If you’re still having issues retrieving information with the parser, it might be due to an SSL verification failure. Disabling this might fix the problem (see Vassilis’s comment on the Wikparser page), but there are other issues to be aware of if you choose to disable SSL verification.


Wikparser 0.3

in Programming

I never expected to revisit my Wiktionary parser, but I’ve received enough feedback from people to do just that.

Briefly, Wikparser is a small app written in PHP able to extract specific lexical information (POS, definitions, gender, synonyms, hypernyms) from either Wiktionary’s API or a local MySQL copy of its database.

Much of the code has been rewritten, cleaned up, improved. This new version of Wikparser (0.3) is available on Github and is, of course, open to anyone interested in modifying it further. For information on how to use it and on how to add language support, have a look at this page, also updated. Most changes are within the code itself, but I have added new functionality to the software, most notably:

  • It should be a little bit faster. Not a whole lot faster, mind you, but I’ve cleaned up some of the functions and improved the MySQL querying.
  • It can now extract gender.
  • It now works natively with Spanish and German. For the time being, support for these languages is considered partial. If you’re a native speaker, don’t hesitate to contact me with any suggestions regarding improving support for these languages or any other languages.

The parser still only provides basic output (no XML or JSON output yet), but despite its limited functionality, it would seem that it has nevertheless proven useful to some.

Polylexical.com Updated

in Info, Lexicography, Programming

After many years of related research, I have finally updated Polylexical.com with additional data and search parameters. Most importantly, all NN and N à N compounds (729 and 319 items respectively) are tagged with information such as headedness and semantic relations. Details regarding the features and parameters contained in the database can be found by clicking on “Renseignements” at Polylexical.com.

My hope is that this database will prove useful to other researchers working on French compounds.

Wikparser now on GitHub

in Lexicography, Programming

I’ve migrated the Wikparser code to GitHub under a GPL licence. Any changes I make in the future will be pushed to the Wikparser GitHub repository.

The parser currently supports both English and French versions of Wiktionary, but additional languages may be added with minimal effort. Instructions on how to do so can be found here under Section 4. I will also be adding these instructions to the repository in the near future. If you happen to use the Wikparser to work in an unsupported language, please feel free to add the appropriate parameters and regex strings to the language.config.php file so that others may also benefit from this work.

Wiktionary Parsers

in Lexicography, Programming

Although functionally limited, the small Wiktionary parser I developed a few years ago has served its purpose. It was meant to provide a fast and easy way to extract specific information for a given word, which would then be included in a separate database. Its output is crude; its abilities meagre. I may, in the future, work on expanding its functionality, but given that I have other projects to work on, I can’t be sure if this will in fact ever happen.

What follows are links to other similar resources for those of you who require more functionality from a Wiktionary parser. My thanks to Gyuri at ProofreadBot for many of these links.

I would also like to emphasize that all MediaWiki projects offer a very robust API with output in a number of different formats, though JSON seems to be the only future proof structure.

Wiktionary Parser Update

in Lexicography, Programming

The Wiktionary parser (Wikparser), a small tool able to extract specific information from Wiktionary entries, has been updated to version 0.2c. It can now be used to extract hypernyms.

Unfortunately, hypernyms are under-represented in Wiktionary. If you’re working on English, you’d be better served by Wordnet.

Polylexical.com – A Database of French Compounds

in Lexicography

My doctoral work is primarily on the semantics of compounding in French. In order to conduct my research, I needed access to a repository of compounds that allowed for specific types of search queries (e.g. all plural N-N compounds), but none met my needs. Mathieu-Colas’s database of French compounds, probably the most well-known such online resource, only allows for the most basic of queries; the MorboComp project currently in development at the University of Bologna remains closed to the public. I therefore decided to develop my own database of compounds and to make it available to everyone at Polylexical.com.

The database contains over 10,000 French nominal compounds retrieved from the French version of Wiktionary. In its current form, Polylexical.com only contains basic information for each entry, such as gender and number, as well as the parts of speech of its constituents. As I continue my doctoral work, additional information will be added to the database, such as the “centricity” of the compounds and what I’m calling the Semantic Reliability Index, a numerical rating based on the recurring patterns in the data. I’m sure that even in its current state, however limited it may be, this resource will prove useful to other researchers.

Feedback is always welcome. You may use the form available on the site or e-mail me at yves.bourque(a)mail.utoronto.ca.

Wikparser – A Wiktionary Text Parser

in Lexicography, Programming

Perhaps you need an easy and quick way to extract specific information from a Wiktionary entry, such as a word’s part of speech or definitions? While Wikimedia’s API allows you to retrieve text-only entries for a given word, it doesn’t allow you to target a specific subsection of the entries. This is where the Wikparser comes in.

The current version of Wikparser (0.2), written in PHP, allows you to extract a word’s parts of speech, definitions, and synonyms. It currently supports French and English natively, but was written so as to allow for support for other languages to be easily added.

Click here for more information, instructions, as well as a link to download Wikparser.


Installing a local copy of Wiktionary (MySQL)

in Lexicography

For my lexicographic research, I chose to use Wiktionary because it’s one of the very few online dictionaries that allows you to extract information from its database. Even better, however, is that the Wikimedia Foundation regularly makes dumps of each of its projects available to the public free of charge. These dumps are available as large XML files, so in order to use them, you’ll first need to import the data into an SQL database.

Getting everything up and running locally (or on a remote server), however, isn’t very straightforward. Here’s a step-by-step guide to getting Wiktionary installed and accessible locally for your own research purposes.

Important note: If you’re thinking you’ll just go ahead and extract the information you need from the XML file, I suggest you abandon that idea. Not only is the file difficult to manipulate, it’s HUGE. The French Wiktionary “pages” XML file, once decompressed, weighs in at 1.5 GB and is therefore much too large to simply open and work with.

***Parts of this tutorial are adapted from Dave Shaw’s instructions.***

What you’ll need

These are direct links to the files and software you’ll need to get Wiktionary installed on your own MySQL server.

  1. A dump of the Wiktionary database you’re interested in. This is an XML file.
  2. The MWDumper software as well as Java if it’s not already installed
  3. A Web server with MySql installed (e.g. MAMP for Mac, WAMP for Windows, or the AMP supported by your Linux distro)

1. Download the Wiktionary XML dump

The first step is to find and download the Wikitionary dump for the language you’re interested in. The previous link will give you a list to all the recent dumps by the Wikimedia foundation. You’re going to need to look for the language code, followed by the string “wiktionary” to identify the dump for your language. For instance, say you want to download the Spanish Wiktionary dump. The language code for Spanish is “es”, so you need to look for “eswiktionary”:

Once you’ve located and clicked the link to the dump for your language, you’ll be faced with numerous links to compressed files, such as pages-meta-history.xml.7zpages-meta-history.xml.bz2, and page_restrictions.sql.gz. You’ll find information for each one of these dumps next to the links. The file you’re probably most interested in is called pages-articles.xml.bz2 as it contains all of the article content for the given dictionary. Download it and extract it to a folder on your server.

2. Setting up the MySql server

I’m going to assume you have your server up and running (if this is a local installation, just install whatever AMP you want to use). MySql should already be installed and ready to use. Most installations come with phpMyAdmin and this is what we’ll be using to set up a database for Wiktionary.

First, login to your phpMyAdmin by going to your server’s address (e.g. http://localhost/phpmyadmin).

  1. Click on the Databases tab.
  2. Enter “wiktionary” at the bottom of the screen under Create new database.
  3. Select “utf8_bin” as your collation and click the Create button.
  4. You should now see “wiktionary” in the column on the left. Click on it.
  5. Click on the Privileges tab.
  6. Click on Add a new User.
  7. Type in a username (e.g. wiktionary) and a password in the appropriate fields. Make sure that Grant all privileges on database “wiktionary” is selected under Database for user.
  8. Click Go at the bottom of the screen.

Now that the database and user are created, we need to create tables.

  1. Go to this link and copy all the text to your clipboard.
  2. In phpMyAdmin, still under your wiktionary database, click on the SQL tab.
  3. Paste the copied text into the text box and click Go.

The tables have been created, but before you import the data, you still need to tweak a few settings.

  1. Still under the wiktionary database, click on the page table in the left hand pane.
  2. Click on the Operations tab at the top.
  3. Under Table options, set the Collation to “utf8_bin” if it isn’t already.
  4. Click Go under Table Options.
  5. Repeat these steps for tables “revision” and “text”.

You may also choose to disable indexing on these tables to speed up the import process (otherwise, it can take a few hours). To do this, click on the Structure tab for each of the table listed above (i.e. page, revision, and text) and at the bottom of the screen, under Indexes, delete the indexes by clicking the X next to each one. See below on how to add them again.

3. Importing the Data

Now we’re ready to import the data. Download the MWDumper to the same folder as your Wiktionary download if you haven’t already. You may also choose to compile your own via the source. Make sure you have Java installed on your system (see link under “What you’ll need” above). Open a terminal window or command line and navigate to the folder containing the Wiktionary XML file and MWDumper, then run the following command:

java -jar mwdumper.jar –format=sql:1.5 “WIKTIONARY FILENAME”  | “PATH TO MYSQL” -u wiktionary -p wiktionary –default-character-set=utf8

Replace “WIKTIONARY FILENAME with the actual filename of the Wiktionary dump you downloaded and replace “PATH TO MYSQL with the path to your MySQL installation (e.g. if you’re using MAMP, /Applications/MAMP/Library/bin/mysql). -u wiktionary specifies that the username to use is “wiktionary,” so change that if you used something else. The second “wiktionary” in the command refers to the database name, so change that as well if you used something else. Press enter once everything is set and you should be prompted for your password. Once you’ve entered your password, hit enter again and off you go. It will most likely take some time to finish the import, but once it does, it will drop you back to the prompt. That’s it. You now have a local database copy of Wiktionary in whatever language you chose to install.

(***Additional instructions, including optional MySQL queries can be found here. Thanks to Gyuri at ProofreadBot for the link.)

4. Re-enabling Indexes

If you removed the indexes on the tables prior to importing the data, you’ll want to re-enable them once everything has been imported. For each of the following tables, go to the corresponding SQL tab and paste and run the commands below.

For the page table:

CREATE UNIQUE INDEX /*i*/name_title ON /*_*/page (page_namespace,page_title);
CREATE INDEX /*i*/page_random ON /*_*/page (page_random);
CREATE INDEX /*i*/page_len ON /*_*/page (page_len);

For the revision table:

CREATE UNIQUE INDEX /*i*/rev_page_id ON /*_*/revision (rev_page, rev_id);
CREATE INDEX /*i*/rev_timestamp ON /*_*/revision (rev_timestamp);
CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp);
CREATE INDEX /*i*/user_timestamp ON /*_*/revision (rev_user,rev_timestamp);
CREATE INDEX /*i*/usertext_timestamp ON /*_*/revision (rev_user_text,rev_timestamp);

Matching UTF-8 encoded characters with regular expressions

in Programming

Working with accented characters (or any unicode, non latin character for that matter) often poses problems when trying to match them using regular expression functions such as preg_match or preg_replace in PHP.

The w expression is meant to match any word character, but it won’t match é or ï in a unicode (UTF-8) encoded string. For example, let’s take the following Wiktionary definition in raw form (succès):

# {{absolument}} [[réussite|Réussite]] de ce que l’on [[espérait]] et la [[gloire]] qui s'[[ensuit]].

Say we want to strip away the first half of those double bracketed words (i.e. “[[réussite|”).

$string = "# {{absolument}} [[réussite|Réussite]] de ce que l'on</span>
		[[espérait]] et la [[gloire]] qui s'[[ensuit]].";

$pattern = "/[[[sw]*?|/";

$results = $preg_replace($pattern, "", $string);

Unfortunately, the pattern won’t match, despite the fact that it’s technically correct (“match any double bracket followed by a space or word character until you find a pipe character”). The problem is the accented character. One solution is to use the u modifier. This tells the preg function that the pattern should be treated as UTF-8. The trick, however, is to remember that it goes at the end of the pattern, after the trailing slash (or parenthesis) and not inside the pattern following the characters you’re looking to match. Though it looks and works like a switch, it’s not actually preceded by any such identifier.

$pattern = "/[[[sw]*?|/u";

// or

$pattern = "([[[sw]*?|)u";

Either format should now work to match any accented characters it encounters, provided that everything is truly encoded in UTF-8, such as the data stored in Wikipedia and Wiktionary.