Wiktionary Text Parser

*** Update 1. Wikparser will now extract synonyms from entries. ***
*** Update 2. Wikparser will now extract hypernyms from entries. ***
*** Update 3. Wikparser is now available on GitHub***
*** Update 4. Wikparser rewritten. New languages and functionality added. ***

Not many web-based dictionaries lend themselves to secondary research projects. English has Wordnet, but if you’re interested in working with other languages, your options are limited.

Because I work primarily with the French lexicon (mostly on compounds and collocations), I needed access to a lexicographic repository that would allow me to extract and manipulate specific information, such as lexical categories, definitions, synonyms, etc. Although Wiktionary (a Wikimedia project) is not a perfect lexicographic resource, it is available in many different languages and, more importantly, it offers an API that allows for direct connections to its database. The results, however, cannot be automatically parsed. For instance, you can’t tell the API that all you want are definitions. It is only able to return the raw data for the entire entry. That’s why I wrote Wikparser.

What is it?

This small tool was developed so as to be able to extract specific information from a Wiktionary entry. It is for research purposes only. This means that the output consists of nothing but text and is meant for storage or processing purposes only. It’s not meant to be pretty.

What does it do?

The current version is able to extract lexical categories (POS for part of speech), definitions and synonyms,  hypernyms, and gender.

What languages are supported?

Currently, English and French are fully supported. Spanish and German parsing is also supported, but functionality is slightly more limited. You can add support for other languages by following the instructions at the bottom of this page.

How do I use it?

You have two options:

  1. You may use the parser hosted on this site.
  2. You may download the software (PHP) and run it on your own web server.

Option 2 is recommended, but let me start by explaining how it works. Please keep in mind that I’m calling this version a 0.2 release.

1. Instructions

The script is called by pointing your browser or other software (e.g. Google Refine) to URL/wikparse.php. The following parameters with their corresponding values must be submitted as GET values (* indicates obligatory parameters):

  • *word: any string
  • *query: pos for parts of speech; def for definitions; syn for synonyms; hyper for hypernyms
  • lang: Use whatever language code Wiktionary uses. Script currently supports English (en), French (fr), Spanish (es), and German (de) natively [default: en]. To add support for other languages, see below.
  • count: number of items to return [default: 100]
  • source: local if you are running a MySql copy of the Wiktionary locally; api if you wish to use Wiktionary’s API [default: api]

Here are a few examples using the script hosted on this server:

Both of the queries above return results by using Wiktionary’s API by default.

If you opt for option 1 (i.e. using the software hosted here, as in the above examples), keep the following in mind: it’s slow. Because calls to the Wiktionary API require a bot login account to lower the delay between connections, requests will eventually timeout if you run the script too quickly. According to Wiktionary’s documentation, an appropriate delay between calls is roughly 5 seconds (lower if you’ve registered a bot account). In order to not have my IP banned, I have set the delay to exactly 5 seconds. If you attempt to reconnect any faster than that, you will receive an error message.

Clearly, this is not ideal if you’re looking to gather a lot of data. For example, for every 1000 words requested, the script will take nearly 90 minutes to complete. Of course, if you want to use your own machine to connect to the Wiktionary API, you might get away with a lower delay, thus increasing the speed of your data collection.

2. Using the parser locally

Requirements:

  1. Apache or some other web server platform
  2. PHP 5 (script will not work with PHP 4)
  3. cURL (uncomment “extension=php_curl.dll” in your php.ini file if it isn’t already enabled)
  4. Optional: if you wish to increase the speed of your queries, you’ll want to have a Wiktionary dump running on MySql locally. Some changes must be made to certain files. Instructions on how to do this below.

Steps:

  1. Download Wikparser 0.3 from Github (Updated 22 Oct. 2014)
  2. Extract files into a directory on your web server.
  3. Point your browser to the file and set parameters via the URL

If you’re only interested in extracting data for English or French words (and to some extent Spanish and German), that’s all there is to it. Feel free to set the delay between requests to whatever you’d like, but know that the Wiktionary API will eventually return errors if the delay is too short.

3. Downloading and installing a local copy of Wiktionary

This requires a bit of work. I might make a post in the future detailing the process (Done), but in the meantime, you can follow this tutorial by Dave Shaw. If you’re not interested in an English copy, you can download the dump for your language via MediaWiki’s backup index. Once you’ve created the tables, I suggest you set the collation to ‘utf8_bin.’

Important: When using the mwdumper.jar tool to import the XML file into your MySQL database, you should add “–default-character-set=utf8” at the end of the command if you’re using any language other than English. Click here for more info on using the mwdumper tool. Please note that importing the Wiktionary XML into a MySQL database takes some time (a few hours). You can speed up the import dramatically by removing all indexes from the tables after you’ve created the database using the appropriate SQL queries. Simply follow the tutorial linked above, but before running the mwdumper, go into the database and remove all indexes from the page, revision, and text tables. Once you’re done importing, you can simply re-enable the indexes by running the corresponding SQL index queries for each table by copying them from original script.

If you’ve succeeded in installing a local copy of the Wiktionary database, you now need to edit the conc.php file in the classes folder according the database information you setup (the four variables that need to be modified are labelled in the file). You can then use wikparser.php as you would otherwise, except that now you must set the source parameter to ‘local’ (e.g. …/wikparser.php?word=big&query=pos&source=local).

If installing Wiktionary locally doesn’t seem worth the trouble, well, trust me, it is. You can easily query 10 words per second, if not more. This means that extracting data for 10,000 words takes about 15 minutes. If you respect the 5 second wait between requests to the Wiktionary API, however, this same task would take nearly 14 hours.

4. Adding support for other languages

In order to add support for other languages, you must first determine the language code used by Wiktionary. It’s usually the standard two letter code, but you can always check by going to wiktionary.org and selecting the language you’re interested in. Then look at the first few letters of the URL:

http://tr.wiktionary.org/ : tr for Turkish
http://vi.wiktionary.org/ : vi for Vietnamese

Now open the language.config.php file in the root of the Wiktionary Parser. You’ll see a PHP switch. You must add a new case (or modify one of the ones included if you don’t care about keeping English or French functionality) for the language you want to work with. You’ll see the following:

case "INSERT LANGUAGE CODE HERE":
 $langParameters = array(
  "langCode" => "",
  "langHeader" => "",
  "langSeparator" => "",
  "defHeader" => "",
  "defTag" => "",
  "synHeader" => "",
  "hyperHeader" => "",
  "genderPattern" => "",
  "posMatchType" => "",
  "posPattern" => "",
  "posArray" => "",
  "posExtraString" => "",
 );

For instance, if you’re working with Turkish, you would insert tr between the case quotes. As for the rest, you’ll need to actually have a look at the output generated by the Wiktionary API (the output is also identical for a local copy of the database). You’ll need to call the API with a word and look at the output to identify each one of the parameters above. Here’s an example of the output for the word abuelo via the Wiktionary API using the spanish language code (es):

http://es.wiktionary.org/w/api.php?action=parse&prop=wikitext&page=abuelo&format=xmlfm

You’ll need to scan multiple words to determine what patterns to use for whatever language you’re interested in. It can be tricky, as the raw data is messy and inconsistent. You’ll often find identifiers that differ from one entry to the next. Once you’ve figured out how Wiktionary encodes its data for that language, you can begin to fill in the parameters. Not all parameters need to be set for the parser to work; if you’re only interested in extracting synonyms, then only synHeader requires a value. One by one:

  1. langCode: The string that identifies the language within Wiktionary (e.g. en, de, tr, etc.)
  2. langheader: The string that identifies the section for whatever language you’re working with. Wiktionary will often list multiple languages on a page for a given word (table, for instance, is both valid in English and French). It’s important to identify the string that starts a language section so that info from another language isn’t parsed. Ex. ==English==.
  3. langSeparator: The string that separates each language on a given page. Sometimes it’s a simple string (e.g. “—-” in the English Wiktionary), but in other cases you might have to use a partial string. For example, the French Wiktionary wraps languages within “== {{=fr=}} ==”, so we can assume that each new language section will begin with “== {{=”. This is therefore the langSeparator for French entries.
  4. defHeader: The string that begins the definitions section. Not always present (e.g. English). In German, all definitions fall under the {{Bedeutungen}} string.
  5. defTag: Definitions are usually preceded by some non-alphanumeric character (e.g. in English by “# ” (notice the space)). This differs between languages, however.
  6. synHeader: String that identifies the synonyms section (e.g. English: ====Synonyms====).
  7. hyperHeader: String that identifies the hypernyms section (e.g. English: ====Hypernyms====).
  8. genderPattern: A regular expression that captures a words gender. Patterns used are often inconsistent, so you’ll need to go through a few pages to make sure you’ve identified all possible strings.
  9. posMatchType: Either “array” or “preg”. This is how the parts of speech will be identified. If, like for English, there is a limited number of possibilities, you can simply store them in an array and set this variable to “array”. If the parts of speech vary greatly (like they do for French), then you’ll want to use a regular expression and set this variable to “preg”.
  10. posPattern: If the parts of speech vary greatly, you’ll need to write a regular expression in order to identify them. If you’re unfamiliar with regular expressions, have a look at this quick guide, which also has a link to a tutorial.
  11. posArray: If the parts of speech do not vary and are limited in number, you can store them all in this array and set the posmatchtype variable to “array.”
  12. posExtraString: When using regular expressions to match POS, you often need to add unrelated strings in order to capture the correct entry (e.g. in German, POS is preceded by {{Wortart|). Add this string here to have the parser strip at output.

Once these parameters are set, you should be able to call the script with the new language code set to the lang parameter.

That’s it for now. If you have any questions/suggestions/fixes, please write me at ybourque((at))gmail.com

13 Comments

  1. Hi,
    first of all, big thanks for wikparser! I’ve followed your instructions, but always get the result “No such word for specified language.” when calling wikparser.php with the parameters “?lang=en&query=def&word=table”, or any other pretty common word.. I’ve checked the phpinfo() of my web server, it says cURL is enabled and the php version is 5. How can I fix this?
    Thanks!

  2. Yves Bourque

    Hi,

    Well, the error is being generated within the Wiktionary extraction class, more specifically when trying to discard everything that isn’t part of the specified language. If I were to guess, I’d say you’re not actually connecting to Wiktionary (your initial hunch about cURL is probably right). My suggestion would be to print $wikitext to screen immediately after the “$wikitext = $this->get_wikitext_from_wiktionary($word);” in the get_wikitext function (in class.wikiextract.php). See if there is anything in the variable (should be all the text from Wiktionary for a given word). You could also alter the cURL function to post errors and see what comes up there.

    The parser hosted on this site still returns correct values, so the format of the Wiktionary API hasn’t changed (i.e. ‘table’ still returns information).

  3. Hey Yves. Great script however it seems to omit anything written in italics any
    ideas on how to fix?

    Lucas

  4. Yves Bourque

    Hi Lucas,

    Could you give me a word and the text being omitted as an example (along with the type of query)?

    Thanks.

  5. I’m not at my computer at the moment but I was trying to grab the definition for “cats” but noting showed I tried a variety of other plurals and nothing except 2 vertical bars showed up. There were also some other seemingly random definitions that weren’t parsed either the one thing they all had in common was they were all in italics

    Hope this helped

    Lucas

  6. Yves Bourque

    Lucas,

    Thanks for this. Yes, some definitions were being discarded because they were entirely enclosed in brackets, which the parser usually discards. Unfortunately, Wiktionary is pretty messy and inconsistent in how it tags information. To fix the issue, ad-hoc rules are needed. Instead, I’ve updated the code to allow for the original, unmodified definition to be retained (brackets and all) whenever the cleaned up version would be blank/empty.

    Thanks again for bringing this to my attention.
    Yves

  7. Vassilis

    Hi Yves, thanks indeed for the Wikiparser!

    When used for the english language, it recently started returning to me “No such word for specified language.” – although it was working fine so far. I followed your debugging instructions, provided to Skizzo above, and after all I managed to solve the problem by
    – changing the cURL from “http…” to “https…” (line 57 of class.wikiextract.php).
    – and setting the CURLOPT_SSL_VERIFYPEER to “false”

    I’m amateur in web-programming, so I may well be wrong: but in case the issue was due to some recent change in wiktionary, there may be a need to update the Wikiparser accordingly.

    Hope that’s useful!

    Vassilis

  8. Yves Bourque

    Hi Vassilis,

    Thanks for this! I was actually aware of the issue (someone had pushed an update to the github files), but I just hadn’t had a chance to post anything about it yet. The problem stems from a recent change to Wikimedia’s sites: they all require secure connections now. Your fix is correct!

    I don’t think the SSL option is necessary, however. Changing the HTTPS setting in the parser is sufficient if your cURL installation contains the appropriate files:
    https://snippets.webaware.com.au/howto/stop-turning-off-curlopt_ssl_verifypeer-and-fix-your-php-config/

  9. I followed the instructions but it does not work.

  10. Hi,
    Is there a sql file so I can create tables?
    Thanks.

  11. Yves Bourque

    You can use the contents of this file to create the Wiktionary tables.

  12. roberto mariani

    Hello Yves,

    I found this amazing resource on wiktionary. I am sure you could adapt wikparser
    quickly to offer these data to the users.

    https://en.wiktionary.org/wiki/User:Matthias_Buchmeier/it-en-f

    Here, you will find the python code offered by CLIPS.

    http://www.clips.ua.ac.be/pages/pattern-it

    Hope this help to improve your wikparser.

    Best regards
    Roberto

  13. I’m getting Bitninja security errors… it’s asking me to enter a capcha which doesn’t load as my script (personal, only hand full of requests a day) is making the request remotely.

    I’m using json to pull from your hosted script because i don’t have a heavy footprint but i’m being singled out by ip it seems.

    It just stopped working today. I’ve been happily using it for around a year now without any issues until now.Are there changes you can make on your end to make it work again?

Leave a Reply

Using Gravatars in the comments - get your own and be recognized!

XHTML: These are some of the tags you can use: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>