WordNet XML Interchange Specification

Adam Pease, Bill Black , Piek Vossen

DRAFT January 23, 2006


This document provides a specification for an interchange file format that will be used by the participants in the “Arabic WordNet with Ontology” project, and circulated to other WordNet builders. Many fields in WordNets for languages that have non-Latin character sets will be in Unicode.

DTD

<!DOCTYPE wordnet [
	<!ELEMENT item EMPTY>
		<!ATTLIST item id ID #REQUIRED>
		<!ATTLIST item offset CDATA #IMPLIED>
		<!ATTLIST item lexfile CDATA #IMPLIED>
		<!ATTLIST item name CDATA #REQUIRED>
		<!ATTLIST item type (synset|term) #REQUIRED>
		<!ATTLIST item headword (yes|no) #IMPLIED >
		<!ATTLIST item POS (noun|verb|adjective|adverb) #IMPLIED>
		<!ATTLIST item source CDATA #REQUIRED>
		<!ATTLIST item gloss CDATA #REQUIRED>
		<!ATTLIST item authorshipid IDREF #REQUIRED>
	<!ELEMENT link EMPTY>
		<!ATTLIST link type (antonym|hyponym|instance hyponym|meronym|
			entailment|cause|also see|derived from|
			attribute|relational adj|similar to| verb group|
			participle|member holynym|substance holonym|
			part holonym|member meronym|substance meronym|
			part meonym|attribute|derivationally related|
			domain topic|member topic|domain region|
			member region|domain usage|member usage|pertainym|same|
			equivalent|subsuming|instance|antiequivalent|antisubsuming|
			antiinstance) #REQUIRED>
		<!ATTLIST link id1 IDREF #REQUIRED>
		<!ATTLIST link id2 IDREF #REQUIRED>
		<!ATTLIST link authorshipid IDREF #REQUIRED>
	<!ELEMENT word EMPTY>
		<!ATTLIST word value CDATA #REQUIRED>
		<!ATTLIST word synsetid IDREF #REQUIRED>
		<!ATTLIST word wordid ID #REQUIRED>
		<!ATTLIST word frequency CDATA #REQUIRED>
		<!ATTLIST word corpus CDATA #REQUIRED>
		<!ATTLIST word authorshipid IDREF #REQUIRED>
	<!ELEMENT form EMPTY>
		<!ATTLIST form value CDATA #REQUIRED>
		<!ATTLIST form root (yes|no) #REQUIRED>
		<!ATTLIST form tense (past|present|future) #IMPLIED>
		<!ATTLIST form number (singular|dual|plural) #IMPLIED>
		<!ATTLIST form person (1|2|3) #IMPLIED> 
		<!ATTLIST form gender (masculine|femenine|neuter) #IMPLIED>
		<!ATTLIST form case (nominative|genative|partitive) #IMPLIED>
		<!ATTLIST form wordid IDREF #REQUIRED> 
		<!ATTLIST form authorshipid IDREF #REQUIRED>
	<!ELEMENT verbFrame EMPTY>
		<!ATTLIST verbFrame frame CDATA #REQUIRED>
		<!ATTLIST verbFrame synsetid IDREF #REQUIRED>
		<!ATTLIST verbFrame authorshipid ID #REQUIRED>
	<!ELEMENT author EMPTY>
		<!ATTLIST author authorshipid ID #REQUIRED>
		<!ATTLIST author author CDATA #REQUIRED>
		<!ATTLIST author date CDATA #REQUIRED>
		<!ATTLIST author score CDATA #IMPLIED>
		<!ATTLIST author comment CDATA #IMPLIED>
		<!ATTLIST author covering (yes|no) #IMPLIED>
]>

DTD Reference

<!ELEMENT item EMPTY>

The item element is the central element in the schema. It holds the synset or term and its basic information.


<!ATTLIST item id ID #REQUIRED>

The id attribute is a unique identifier for the synset or term. Ideally, it should be persistent across versions. However, some specification is needed to cover cases where the item is changed significantly enough to warrant creation of a new id. Changing a gloss shouldn't prompt creation of a new id, but deciding to subdivide a synset should.

It should be of the form word_POS_sensenum_languageID. LanguageID will be the standard ISO two letter language code.


<!ATTLIST item offset CDATA #IMPLIED>

The offset is a byte offset in a WordNet .DAT file. Since WordNet uses the byte offset as a unique id within versions, this is needed for compatibility reasons.


<!ATTLIST item lexfile CDATA #IMPLIED>

The lexfile attribute of the item element should be a two digit number from 00 to 40 as described at <http://wordnet.princeton.edu/man/lexnames.5WN.html#sect4>. It is possible that new files might be added in the future though so a number greater than 40 is not necessarily an error.

<!ATTLIST item name CDATA #REQUIRED>

This is a human-readable name for the item. Where the item is a synset containing multiple words, this should be the first word in the synset.


<!ATTLIST item type (synset|term) #REQUIRED>

Whether the item is a WordNet synset or a formal ontology term.


<!ATTLIST item headword (yes|no) #IMPLIED>

Whether the item is an adjective “headword”. See <http://wordnet.princeton.edu/man/wngloss.7WN> for a further description.


<!ATTLIST item POS (noun|verb|adjective|adverb) #IMPLIED>

The part of speech of the item. This is omitted when the item is a formal ontology term.


<!ATTLIST item source CDATA #REQUIRED>

The product from which this item comes. Typically, this would be a particular WordNet, or ontology. This is an enumerated set, but one which is growing rapidly, so the enumerations are not listed as part of the schema, but should follow the naming of WordNets given in the “resource name” column of <http://www.globalwordnet.org/gwa/wordnet_table.htm>


<!ATTLIST item verbFrame IDREF #IMPLIED>

The verbFrame attribute of item is one of those specified at <http://wordnet.princeton.edu/man/wninput.5WN.html#sect4>


<!ATTLIST item gloss CDATA #REQUIRED>

The glossary text for the item. Note that the glossary text is of course in the language of the particular synset, so English WordNet has English glosses, ItalWordNet has Italian glosses etc.


<!ATTLIST item authorshipid IDREF #REQUIRED>

A pointer to information about who created this item.



<!ELEMENT link EMPTY>

This element covers links between items.


<!ATTLIST link type (antonym|hyponym|instance hyponym|meronym|

entailment|cause|also see|derived from|

attribute|relational adj|similar to| verb group|

participle|member holynym|substance holonym|

part holonym|member meronym|substance meronym|

part meonym|attribute|derivationally related|

domain topic|member topic|domain region|

member region|domain usage|member usage|pertainym|same|

equivalent|subsuming|instance|antiequivalent|antisubsuming|

antiinstance) #REQUIRED>

The attribute type in element link has the values “antonym”, “hyponym”, “instance hyponym”, “meronym”, “entailment”, “troponym”, “cause”, “also see”, “derived from”, “attribute”, “relational adj”, “similar to”, “verb group”, “participle”, “member holynym”, “substance holonym”, “part holonym”, “member meronym”, “substance meronym”, “part meonym”, “attribute”, “derivationally related”, “domain topic”, “member topic”, “domain region”, “member region”, “domain usage”, “member usage” and “pertainym” for links between WordNet senses in the same version of WordNet. The value “same” should be used to link between equivalent senses between different WordNet versions. The types “equivalent”, “subsuming”, “instance”, “antiequivalent”, “antisubsuming”, and “antiinstance” should be used to link SUMO terms and WordNet senses. Note that the part of speech restricts the allowable types of the first argument as follows


link type

noun

verb

adjective

adverb

antonym

x

x

x

x

derived from (adjective)




x

similar to



x


participle (of verb)



x


pertainym (pertains to noun)



x


attribute

x


x


hyponym

x

x



instance hyponym

x




entailment


x



cause


x



also see


x

x


verb group


x



member holonym

x




substance holonym

x




part holonym

x




member meronym

x




substance meronym

x




part meronym

x




derivationally related (form)

x

x



domain topic (Domain of synset – TOPIC)

x

x

x

x

member topic (Member of this domain – TOPIC)

x




domain region (Domain of synset – REGION)

x

x

x

x

member region (Member of this domain – REGION)

x




domain usage (Domain of synset – USAGE)

x

x

x

x

member usage (Member of this domain – USAGE)

x





<!ATTLIST link id1 IDREF #REQUIRED>

The id of the first argument to the link.


<!ATTLIST link id2 IDREF #REQUIRED>

The id of the first argument to the link


<!ATTLIST link authorshipid IDREF #REQUIRED>

A pointer to information about who created this item.



<!ELEMENT word EMPTY>

Information about a particular word that is part of a (possibly singular) synset.


<!ATTLIST word value CDATA #REQUIRED>

The word, which should be a root form. Root forms in English would be singular forms for nouns and infinitive forms (without the “to”) for verbs.


<!ATTLIST word synsetid IDREF #REQUIRED>

The synset that the word belongs to.


<!ATTLIST word wordid ID #REQUIRED>

The id of the word, so that the form table can provide additional forms of the word.


<!ATTLIST word frequency CDATA #REQUIRED>

The frequency of appearance of the word in a primary corpus.


<!ATTLIST word corpus CDATA #REQUIRED>

The name of the corpus from which frequency information for this word is derived.


<!ATTLIST word authorshipid IDREF #REQUIRED>

A pointer to information about who created this item.



<!ELEMENT form EMPTY>

The part of speech forms included in the form table should generally be those forms that are exceptions to the rules provided in the MORPHY system <http://wordnet.princeton.edu/man/morphy.7WN.html> for English, or exceptions to any regularly derived form for other languages.


<!ATTLIST form value CDATA #REQUIRED>

A particular word form.


<!ATTLIST form root (yes|no) #REQUIRED>

Whether the given form of the word is considered its root form.


<!ATTLIST form tense (past|present|future) #IMPLIED>

The tense of the particular word form. This is optional, as not all languages have tense inflections on word forms, in particular, Chinese does not.


<!ATTLIST form number (singular|dual|plural) #IMPLIED>

The grammatical number of the particular word form.


<!ATTLIST form person (1|2|3) #IMPLIED>

The grammatical person of the particular word form.


<!ATTLIST form gender (masculine|femenine|neuter) #IMPLIED>

The grammatical gender of the word form. This is optional as many languages do not have inflections for gender on word forms. English doesn't have genders. Romance languages have masculine and feminine but not neuter, etc.


<!ATTLIST form case (nominative|genative|partitive) #IMPLIED>

The grammatical case of the word form. The partitive case appears in few languages, Finnish among them.


<!ATTLIST form wordid IDREF #REQUIRED>

A reference to the root word that this form is derived from.


<!ATTLIST form authorshipid IDREF #REQUIRED>

A pointer to information about who created this item.



<!ELEMENT verbFrame EMPTY>

A relation between a verb synset and a verb “frame” which is a minimal pattern of linguistic usage.


<!ATTLIST verbFrame frame IDREF #REQUIRED>

The verbFrame attribute of item is one of those specified at <http://wordnet.princeton.edu/man/wninput.5WN.html#sect4>

<!ATTLIST verbFrame synsetid IDREF #REQUIRED>

The synset that the frame is valid for.



<!ELEMENT author EMPTY>

Information about the authorship of this synset or link etc.


<!ATTLIST author authorshipid ID #REQUIRED>

The unique identifier for this table, because many items are likely to have been authored by the same person on a given day, at least for initial versions of any particular WordNet.


<!ATTLIST author author CDATA #REQUIRED>

The name of the author of the information.


<!ATTLIST author date CDATA #REQUIRED>

The date on which the information was created.


<!ATTLIST author score CDATA #IMPLIED>

If the item or link was created automatically, a score indicating the confidence of the creation. This item should not normally be present for information authored by a human. Note that this apparent total order will not be comparable between different automatic processes.


<!ATTLIST author comment CDATA #IMPLIED>

A comment string relevant to this data item.


<!ATTLIST author covering (yes|no) #IMPLIED>

If the item is a synset, this refers to whether the item covers all synonyms. That is, if the item tag is for a WordNet item, rather than a formal term, if this attribute is “yes”, the human author of the item has determined that all possible synonyms of this word sense have been entered and linked.

Example


<item id="1" offset="474859483" name="put" type="synset" POS=”verb”

source="Princeton WN" gloss="To place an object at a location."

author="Christiane Fellbaum" date="19990101"/>


<item number="2" name="Putting" type="term"

source="SUMO" gloss="To place an object at a location."

author="Adam Pease" date="20050101"/>


<link type="hyponym" item1="1" item2="2" author="Adam Pease" date="20050101"/>

<link type="equivalent" item1="1" item2="2" author="Adam Pease" date="20050101"/>


<word value="put" synsetid="1" wordid=”1” frequency="200"

corpus="Brown Corpus" author="Christiane Fellbaum" date="19990101"/>


<form value="putting" form="present" wordid="1"

author="Christiane Fellbaum" date="19990101"/>