This document suggests some implementation-format independent naming conventions for controlled vocabularies (CVs) and ontologies. Metadata annotation elements are not covered here; these are addressed in a separate <<Metadata Annotations for Representational Units and Representational Artifacts>> document [1]. These recommendations have been developed to guide the work of the Metabolomics Standards Initiative (MSI) [2] Ontology Working Group (OWG), the Proteomics Standard Initiative (PSI) Ontology WG [3] and the Ontology for Biomedical Investigation (OBI, previously ‘FuGO’) WG, a larger multi-domain collaborative effort [4].

Recommendations on Implementation dependent realisations of these naming conventions in OBO and OWL will be available in the near future.

The key words “MUST,” “MUST NOT,” “REQUIRED,” “SHALL,” “SHALL NOT,” “SHOULD,” “SHOULD NOT,” “RECOMMENDED,” “MAY,” and “OPTIONAL” are to be interpreted as described in the RFC-2119 document [5].

Sections in brackets […] are notes for the editor only. Please ignore.

1.1 Authority

[add]

1.2 Scope

These naming conventions tackle lexically, syntactical and semantical issues on naming representational units (mainly class names and property names) in representational artifacts ranging from simple glossaries over taxonomies and controlled vocabularies up to formal ontologies on the top end of the semantic complexity scale.

1.3 Target audience

This document is addressed to all biologists and ontologists who are involved in the creation, administration and in the review of symbolic representational artifacts (RAs) like taxonomies, controlled vocabularies and DL ontologies.

1.4 What is a Naming Convention

(In part from: c035347_ISO_IEC_11179-5_2005(E)-1.zip )

A naming convention (NC) describes what is known about how names for administered items are formulated in a consistent manner. It may be simply descriptive; e.g., where no registration authority has control over the formulation of names for a specific context. This NC is prescriptive in the way that it specifies how names 'should' be formulated. A NC can also enforce the exclusion of irrelevant facts about administered items.

The NC reference or specification document (like this one) shall cover the following aspects:

the name and scope of the NC (specifies the range within it is in effect. It may be as broad or narrow as the responsible registration authority determines appropriate)
the authorities that establish the names;
rules governing the source and content of the terms used in a name, e.g. terms derived from data models, terms picked according to usage frequency in a certain domain, etc.;
uniqueness prescriptions document how to prevent homonyms
lexical prescriptions unifying term appearance (reducing redundancy and increasing precision) covering controlled term lists, synonym handling, name length, character set, specific language requirements;
syntactic prescriptions covering required consistent term orders within a name (relative, absolute or in combination);
semantic prescriptions; document if and how names convey meaning, e.g. in word order or adjectives used in compound names.

There are diverse NC documents available, e.g. [6, 7] but most naming conventions are not sufficient enough to serve the needs, e.g. for text mining [8].

1.4.1 How does one profit from applying naming conventions

A rigorous formal and logically consistent way of naming RUs within RAs eases

· Indexing and Categorisation of RUs

· Integrated tool access across different ontologies

· Ontology alignment (mapping), difference detection and merging (e.g. through PROMPT)

· Consistent visualisation

· Unified understanding of meaning to humans as well as web agents

· Avoidance of masked redundant content

The overall profit is the ease to access different ontologies through a unified mechanism and thereby better exploit the given ontological resources, i.e. in ontology libraries.

2 (Meta-) Reference Terminology

At first we would like to clarify the terminology used to talk about the different idioms which are the matter of this text.

2.1 Peculiarities in getting familiar with modelling (meta-)terminologies

When the structures of RAs and RUs are explained, the problem is, that they can not easily be introduced in a simple serially ordered manner (as the nature of text demands), because each idiom heavily relates to all others and some of the idioms are even fractal. So we can't expect immediate understanding of everything mentioned when serially reading this text. Understanding will rather come holistically in the sense that you might have to read the whole text once more and while doing so, your understanding, your internal conceptualisation, on each chapter will build up and re-new gradually. Do not worry, if you do not get it at the first time. There will always be words which you might not understand immediately. At the highest level of abstraction there will even be words that you can not fully understand, e.g. ‘thing’.

Another issue tackles the completeness of such a description. If you should write a book that contains all information about writing this book itself (again a fractal approach), this would be a never ending incrementally nested task and such book could never be finished. So, not everything (e.g. some words from the meta terminology) can and shall be described, otherwise we are likely to get stuck in what can be called the ‘Meta-Ether’, the little brother of ‘Analysis-Paralysis’.

2.2 Basic entities and ‘levels of reality’

We introduce a common reference terminology to harmonize cross domain understanding of the things that are talked about.

For a more formal clarification have a look at the ‘Terminology for Ontologies’ paper [9]:

We start out from a distinction of three levels on which entities can exist:

Level 1 - Reality: The objects, processes, qualities, states, etc. in reality;

Level 2 - Mental Concepts: Cognitive representations of this reality on the part of researchers and others;

Level 3 - Representational Artifacts: Concretizations of these cognitive representations in (for example textual or graphical) representational artifacts.

An ENTITY is anything which exists, including objects, processes, qualities and states in on all three levels (thus also including representations, models, beliefs, Protocols, documents, observations, etc.).

A REPRESENTATION is any model (for example an idea, image, record, or description) which refers to (is of or about), or is intended to refer to, some entity or entities external to the representation. Note that any representation as any model per definition always leaves out many aspects of its target and hence can always be expanded and is never complete in covering all aspects of the target.

A COMPOSITE REPRESENTATION is a representation built out of constituent sub-representations as their parts, in the way in which paragraphs are built out of sentences and sentences out of words.

The constituent sub-representations are called KR idioms or REPRESENTATIONAL UNITS (RU); examples are: icons, names, simple word forms, or letters, but also classes and properties. If we take the graph-theoretic concretisation of the Gene Ontology as an example, then the representational units here are the nodes of the graph, which are intended to refer to corresponding entities in reality. But the composite representation refers, through its graph structure, also to the relations between these entities, so that there is reference to entities in reality both at the level of single units and at the structural level.

A COGNITIVE REPRESENTATION (Level 2) is a representation whose representational units are ideas, thoughts, conceptual models or beliefs in the mind of some cognitive subject.

A REPRESENTATIONAL ARTIFACT (RA, Level 3) is a representation that is fixed in some medium in such a way that it can serve to make the cognitive representations existing in the minds of separate subjects (mental conceps) publicly accessible in some enduring fashion. Examples are: a text, a diagram, a list, a controlled vocabulary, schema and knowledge representations (KR, also called representational models) or ontologies. RAs can serve to convey more or less adequately the underlying cognitive representations and can be correspondingly more or less intuitive or understandable. RAs vary in terms of formality and semantic expressivity (Text has a high expressivity but a low formality, DL has lower expressivity but is much more formal).

2.3 Naming representational units (RU)

We recommend using the term 'class' (this is the same as 'type' or 'kind') to refer to the RU that models an ontological 'universal' A 'concept' is the representation of a universal in the researchers head, his idea of the meaning of an entity which is due to change over time and experience [10]. “There are no valid parsers for concepts!” and an ontology should model reality, not the representation of reality in some head. So better avoid this term. Each class is represented through a 'class name', a string that designates the class for humans, a unique identifier, a definition in natural language. Each class can have properties (in Protégé Frames also called slots) associated with it. These properties are constrained by facets: Properties which have values (ranges) of simple datatypes (e.g. integer, string, boolean) are called 'attributes' or 'datatype properties'. Properties which have classes or instances as their values are called 'relations' or 'object properties'. The group of classes a property is associated with is called its 'domain'.

An 'Instance' is the representation of a 'particular' of a universal in reality. A 'particular' instantiates a universal and an instance (called an individual in owl) instantiates a class.

[Here graphic: Andrew, Ontogenesis…]

[Cite papers: Interpretation continuum, What are the differences…, DAG]

2.4 Naming representational artefacts (RA)

We can sort the different types of RAs according to their formality and semantic expressivity. Lassila and McGuinness have presented an ontology spectrum that presents various levels of formalization (2001 Deborah L. McGuinness. Ontologies come of age. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors, Spinning the semantic web: bringing the world wide web to its full potential. MIT press, 2002. Available on-line at http://www-ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm).

The most often cited types of RAs will be described here, highlighting their relations to each other and their differences.

2.4.1 Terminology or Vocabulary

Any set of symbols or terms (in most cases words or word compositions) used for communication, which can be interpreted by the address in the way intended by the addresser. Interpreted means it is felt to be descriptive in the sense that the perception of the terms induces some kind of understanding or conceptual model, which ideally has as most overlap with the conceptual model of the addresser. In this sense a terminology is the medium for exchanging knowledge models. Language related terminologies consist of words suitable for describing a domain of interest.

Key characteristic (primary intrinsic quality, or quale): Intended meaning

Implementation formalisms: Any text.

2.4.2 Semi structured data

Semi-structured data are usually considered to be RAs that contain free-text fragments, structured in accordance to some schema. Typical sorts of semi-structured RAs are forms and tables, which have some strict structure (fields, parts, etc.), but still the content of the specific parts of the document is a free-text.

Key characteristic: combination of RA and free text

Implementation formalisms: Tables, spreadsheets, RDB, Forms

2.4.3 Controlled Vocabulary

Any terminology which is taken care of by some registration authority or standardisation body (can be very small though, i.e. a project or working group only) in the sense that the terms used are controlled by a group. “Controlled” means the sense and/or the appearances of the terms are defined in a consistent manner and the authority has the power to enforce these. Each term should have at least a unique identifier. The word "CV" does not say anything about the structure of the terminology or RA, i.e. a CV can be a simple list of terms or an ontology. No formal statement about the relationships between the terms have to be made, but can be made. A CV does not have to state anything about the meaning of its terms but usually informal definitions are provided for each term. All terms should have unambiguously defined and non-redundant meanings. Usually Homonyms (a term that has context-dependent different meanings) are resolved and synonyms (different terms that refer to the same meaning) are captured.

Def agreed by Barry, taken from semeda: a controlled vocabulary is a set of nodes each of which is associated with an identifier, term, definition, and an optional set of synonyms.

Key characteristic: A standard body enumerates and defines the terms explicitly for unified usage.

2.4.4 Glossary

A glossary is a simple list of terms in a particular domain of knowledge with definitions and explanations in natural language which explain the meanings of newly introduced or uncommon terms.

2.4.5 Dictionary

Any list of words which entries refer to entries in another list. In contrary to a thesaurus the dictionary usually defines words [needs work] .

2.4.6 Graph

A graph G consists of two sets N and E. N is a non-empty set of nodes, and E is a set of edges, an edge being a pair of nodes from N. G is directed if its edges are directed. The node from which a directed edge originates is called the source and the one in which it terminates is the target. A path in a directed graph is a sequence of nodes _x0, x1, . . . , xn_ (n>0) where every two adjacent nodes xi and xi+1(0_i_n − 1) are source and target, respectively, of some edge. The path is direct if n=1; indirect otherwise. The path is called a cycle if x0 and xn are the same node. A graph is acyclic if it has no cycles.

2.4.7 Hierarchy

A hierarchy is a nested set of symbols or terms (in most cases words or word compositions). In a hierarchy the principle used to build the nested structure is not specified and can be of any transitive relation (i.e. part-of, is-a, ….) and even of multiple relations at the same time. The term refers to the graphical structure and does not specify the semantics behind the parent-child relationship. In this sense nested xml elements are hierarchical when displayed as such, but the meaning of 'B being nested in A' is not defined within the xml. Hierarchies have meanings specifies via whatever the meaning of the hierarchical relationship is.

There are one parent only hierarchies (mono-hierarchies) and multiple parent hierarchies (poly hierarchy or directed acyclic graphs, DAG), in which one term can be found under more than one parent. Multiple parenthood is a well established practice to profit from multiple inheritance of properties.

Key characteristic: Graph structure

2.4.8 Taxonomy, Meronymy

When the relation used to build the hierarchy is of one transitive relation only, i.e. the nested (child-) term stands in a 'is-a' or ‘part-of’ relationship to its parent term throughout, we speak of a Taxonomy (from Greek verb τασσεν or tassein = "to classify" and νόμος or nomos = law, science, cf "economy"). Taxonony was once only the science of classifying living organisms.

The Taxonomy is a hierarchy (usually a collection of controlled vocabulary terms) build according to one intrinsic property of the items to be taxononized (e.g., whole-part, genus-species, type-instance). Some taxonomies allow poly-hierarchies, which means that a term can have multiple parents. If a term has children in one place in a taxonomy, then it has the same children in every other place where it appears.

A taxonomy is a directed acyclic graph satisfying the following conditions [6]:

(1) The nodes in the graph are classes.

(2) An edge between x and y represents a direct taxonomic (IS-A) relationship from x to y. x is called a child (or subclass or subcategory) of y and y a parent (or superclass) of x. A class–relationship–class triple (x, IS-A, y), called a relation, can also be used to represent the edge between x and y.

(3) A taxonomic (IS-A) relationship holds between class x and y (i.e., (x, IS-A, y) ) if (a) x is a child of y, or (b) there exists a class z such that the two relations (x, IS-A, z) and (z, IS-A, y) hold. If (x, IS-A, y) holds, x is called a descendant of y and y an ancestor of x; in such cases, x is more specific than y (or is subsumed by y) and y is more general than x.

(4) There is one and only one class, called the root of the taxonomy, which has no parents. Every class except the root has at least one parent.

(5) The class x1, x2, . . . , xn (n>1) are called siblings if they all have the same parent.

The difference between a classification and a taxonomy is that a taxonomy classifies in a structure according to one defined relation between the entities and that a classification uses more arbitrary (or extrinsic) grounds. As an example of intrinsic grounds, spinach is a vegetable and not every vegetable is spinach, so spinach is a subclass of vegetable. The decision to place spinach in the class vegetable is based upon data intrinsic to the entities, so this would be a piece of taxonomy (a taxonomy with a subclass hierarchy). A classification of vegetables according to the sortal “Do I like to eat it” would be based on an extrinsic property. This would lead to a classification, not a taxonomy. A taxonomic relation is a relation between entities in the taxonomy (the is_a relation in most cases), a classification relates the entities to something that is external.

When the relation used to build the taxonomy is of 'part-of' type, then we call such a taxonomy a Meronymy. For example, 'finger' is a meronym of 'hand' because a finger is part of a hand.

2.4.9 Folksonomy

A collection of terms allocated to resources by endusers in order to categorise or index them in a way that these endusers consider useful is called Folksonomy. Terms in such 'democratic' folksonomies, are typically added in a fast, pragmatic, decentralized and uncontrolled manner, without making the underlying structures or principles explicit necessarily. The process of folksonomic data (in most cases website-) annotation is intended to make a body of information increasingly easier to search, discover, and navigate by human users. A well-developed folksonomy is accessible as a shared vocabulary that is both originated by and familiar to its primary users. Part of the appeal of folksonomies is its independency of search engine censorship (which is currently applied by all major Software companies, i.e. Symantec, eBay and Google).

2.4.10 Thesaurus (Structured Vocabulary)

A thesaurus is an associatively networked list of words and their descriptions in natural language. The terms refer to each other through different often informal relations. A thesaurus does not need to have a taxonomic structure. Usually it is a list of controlled terms that refer to each other verbally. The relationships vary in detailledness (can be simple ’synonymous’, ‘broader_than’, or even ‘related_to’ relations). A formal definition of a thesaurus designed for indexing (according to wiki) is: "A list of every important term (single-word or multi-word) in a given domain of knowledge and a set of related terms for each term in the list."

2.4.11 Directed acyclic graph, DAG

A DAG is a directed graph with no directed cycles; that is, for any node, there is no nonempty directed path starting and ending on itself. The most famous prototype for a DAG is the gene ontology controlled vocabulary.

2.4.12 Object model

An OM is a hierarchical classification scheme. It does not always have to be a taxonomy. An object model (OM) is a platform- and implementation independent object-oriented RA used as interface to some model, service or program. An OM can be automatically transferred in a concrete implementation, e.g. an xml schema. It consists out of a collection of nested and encapsulated objects through which an agent (human or software program) can generate, examine and manipulate data. The basis for nesting objects is usually the is_a relation, which then only allows property inheritance (given properties were formulated). Non-formal relations, such as dependencies, associations, or any other one regarded as useful for software development, can be used to connect classes in an OM. A graphical representation language to display, develop and share OMs is the unified modelling language, UML.

2.4.13 Ontology

Ontologies were mentioned as ‘categories’ in Aristoteles Metaphysik, but the word 'ontology' itself was first established in the 17.century. The Encyclopaedia Britannica defines ontology as “the theory or study of being as such; i.e., of the basic characteristics of all reality”. This is a philosophy centered definition. The field has exploded with the dawning of IT technology and has shifted in meaning within this field.

“Ontology” is the buzzword used on the internet when discussing the semantic web. The WebOntology working group at W3C emphasises that ontologies are a machine-readable set of definitions that create a taxonomy of classes and subclasses and relationships between them.

The word ontology was established to the biocommunity mainly through Gene Ontology, an effort that in fact build a taxonomic CV. This has created much confusion over what an ontology is. An ontology resembles both a kind of taxonomy plus definitions and a kind of knowledge representation language that allows to capture additional relations, not just the one used to build the taxonomic structure. In an ontology one can specify the relation which is used to build the hierarchy aside of many others. A clear boarder between a rich “taxonomy” and a “simple ontology” is nevertheless hard to define.

The fundamental difference between a classification and an ontology is in the richness of information available formally. Both provide a list or structure of classes, but a classification stops at that point, whereas an ontology also provides further information on the classes such as definitions and properties like attributes and relations.

Ontology is defined in the DIP Glossary as “The formalization of a terminology (set of terms and possibly their interrelations) used in some domain of discourse. An ontology represents consensual knowledge about a domain of discourse (in form of terms and possible interrelation among them) in a formal way that can be shared between agents and makes this knowledge accessible by machines. …” The most popular definition of an ontology from the Semantic Web and AI perspective is the one provided in [11], http://ksl-web.stanford.edu/KSL_Abstracts/KSL-92-71.html : “An ontology is an explicit specification of a conceptualization”, where “a conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose.” Ontologies can be considered as RAs intended to represent knowledge in the most formal and re-usable way possible. Formal ontologies (considered in the AI) are represented in logical formalisms (like OWL) which allow automatic inference over them or datasets aligned to them. We would describe ontology as a CV expressed in a formal representation language, which enables to formally capture a defined semantics. The most well known representation languages used to structure ontologies are OWL (DL-semantics) and OBO. Ontology representation languages differ in their semantic expressivity. Ontologies are rich enough to express meanings as formal and hence computer-accessible models through use of defined related RUs. Ontology representation languages have a defined syntax, semantics and grammar. Usually it is regarded that the use of one of the following semantic idioms makes a CV an ontology: object properties, cardinalities, restrictions and axioms.

Def agreed by Barry, taken from semeda: In ontologies nodes from a CV (each of which is associated with an identifier, term, definition, and an optional set of synonyms.) are linked by directed edges, thus forming a graph. This graph represents a counterpart structure on the side of entities (classes, universals) in reality, and its edges represent the relations (e.g. is-a or part-of) which hold between these entities. If a node has a parent node in the is-a hierarchy, then we say that the corresponding class is subsumed by this parent node.

Another rather pessimistic definition [12] states an ontology is “a common language to express human confusion”….

2.4.14 Knowledgebase

From “Data, Information, and Process Integration with Semantic Web Services (DIP)”, http://dip.semanticweb.org and https://bscw.dip.deri.ie/bscw/bscw.cgi/0/3016:

Knowledgebase (KB) is a term with a wide usage and multiple meanings. It can be seen as a dataset described through some formal semantics bearing REPRESENTATIONAL ARTIFACTS. A KB, similar to an ontology, is represented with respect to a knowledge representation (or just a logical) formalism, which usually allows automatic inference. It could include multiple axioms, definitions, rules, facts and statements.

In short a knowledgebase is the ontology in use: when it is instantiated or it's classes are used to annotate data. In this sense the ontology is the T(terminological)box and the data annotated through the Tbox is called A(assertional)box. Both together make a knowledgebase.

3 Depicting representational units within text

In general within word processing tools there are the following possibilities to encoding metadata about RU types in text styles: [to be added: Formatting convention when using ontological RUs in literature – see OBO RO paper][13]

Underlined, Italics, Bold (these three can only be applied in word processing software), UPPERCASE, lowercase, S p a c e d, demarking affixes, e.g.

prefixes like > : # or _, e.g. _prefixed

circumfixi like " ' * '# , e.g. *circumfixed*.

A more direct way is to explicitly state the type of a RU as prefix, e.g. class:some_class, property:some_property.

An other possibilitie is to put the term in xml style elements which states of what type the RU is, e.g. <class>some_class</class>. This is not recommended.

We recommend the following formattings for depicting different RUs:

Universal: BOLD_UPPERCASE, e.g. DOG

Particular: ‘normal lowercase’ (with single apostroph), e.g. ‘fido’

Class: bold_lowercase, e.g. dog

All simple words terms, e.g. a preferred name: normal (with doubled apostrophes): “canis”

Instance:’bold_lowercase’ (with single apostroph), e.g. ‘fido’

Properties between classes: italic_lowercase e.g. is_a

Properties between all other RU types: bold_italic_lowercase, e.g. instance_of

Three kinds of binary relations can be distinguished according to their domain and range types [13]:cc, ci, ii

An example:

The universal DOG is represented as an ontological class dog with the preferred term name “canis”. The class dog in the representational artefact is instantiated through the instance ‘fido’ which models that particular ‘fido’, which sits on your lap.

dog is_a mammal. The dog has_mood . ‘fido’ has_mood “sleepy”.

4 General principles for creating sound RUs

· Become acquainted with the capablities and incapabilities of

1. the representation formalism you use

2. its implementation language

3. the ontology engineering tool of your choice.

· Save often! Always save to a new version number including the date. Protégé-OWL is not yet completely stable. 'Undo' is difficult and bugs occasionally corrupt ontologies beyond retrieval.

· Don’t get into 'analysis paralysis'! You will not get it right at the first time! Sometimes one has to throw things away and start again. Do not get into the ‘naïve euphoria’ either. Not every fancy just-built piece of representation is an ontology worth bothering others.

· Don’t get stuck in the ‘Meta-Ether’. Do not try to capture all possible metadata. Only formalize what is of immediate use for the projects outcome. You have to stop capturing Meta data at some level and therefore not everything can be defined.

· Don’t confuse the 3 layers of reality. Always be aware of the level you are modeling. Try to model consistently ‘reality only’ and do not mix it with ‘models of reality’ within your model. The ontology is your model of reality, therefore do not try to model an experiment AND the description of an experiment (I.e. a protocol) unless you really need the 'protolollness'.

· Avoid overloaded term names. The use of overloaded terms such as “experiment” “method” “technique”, “instructions” has to be avoided. They are ambiguous and have too many meanings across diverse domains. In this example a series of events or actions used should be represented as a single or collection of atomic “Protocols” rather than using all the terms above.

4.1 Modularisations

WordNet 2.0, http://www.cogsci.princeton.edu/cgi-bin/webwn defines a module as a selfcontained component (unit or item) that is used in combination with other components. This is also the case for RUs and for RA build out of RUs. Build your RA in such a way that you have clearly separated orthogonal modules, that relate to each other. These modules correspond to upper level classes in your ontology.

[add]

4.2 Univocity

Names of RUs (including the ones for relations) should have the same meaning on every occasion of use and refer to the same universals and kinds of entities in reality. Each name should refer to exactly one RU, and each RU should represent exactly one entity in reality (a universal in the case of a class). This principle of univocity excludes homonyms, terms that are used as names of more than one RU. For example, if you use the term ‘cell’ as a name of the class representing (the type of) cells as found in all organisms, the same term should not be used as a name for a more specialized class representing (the type of) cells as found only in plants. Likewise, the term ‘part of’ should not be used to name more than one relation, e.g., partonomy, set membership, etc.

Further more:

Don’t confuse universals with ways of getting to know types

Don’t confuse universals with ways of talking about types

Don’t confuses universals with data about types

4.3 Positivity

Complements of classes such as ‘non-mammal’ or ‘non-membrane’ are not necessarily themselves classes and don’t designate genuine universals. Similarly, do not represent the absence of an NMR magnet as the presence of the non-existence of an NMR magnet, e.g.: 'NMR magnet' has_status "absent". Which universals exist is not a function of our biological knowledge. Be aware that terms such as ‘unknown’ or ‘untypified’ or ‘unlocalized’ do not designate genuine universals. The positivity recommendation may need to be weakened; sometimes it can make sense to have e.g. an "ex-vivo" role or a “non-living_organism”.

4.4 Objectivity – Intrinsic and extrinsic characteristics

No distinction without a difference. A child class must differ from its parent class in a distinctive way. A child class must share all the properties of its parent classes (inheritance principle) and have additional ones that the parents have not. Each class must be defined in a formula which states the necessary and sufficient conditions for being an instance of the corresponding universal. The sibling class of a given parent class should have differentia which are really distinct. This means that the universals of these classes at least have distinct (ideally non-overlapping = single inheritance) extensions. The distinction between each pair of siblings must be explicitly represented (opposition principle).

To characterize classes, formulate intrinsic properties (properties that are inherent to the universal represented by the RU) rather than extrinsic ones (properties that are asserted from outside, e.g. accession numbers). ‘Intrinsic’ describes a characteristic or property of some thing or action which is essential and specific to that thing or action, and which is wholly independent of any other object, action or consequence. A characteristic which is not essential or inherent is extrinsic (from http://en.wikipedia.org/wiki/Intrinsic).

4.5 Try to avoid multiple parenthood at the beginning

No class in the hierarchy should have more than one superclass when starting to build an ontology.

Sometimes a class seems to have multiple valid parent classes, because:

• The word represents a complex concept.

• The word is a homonym (has more than one meaning).

• The discussion brings related concepts to light.

Refrain from using multiple parenthood at the beginning, because multiple parenthood resulting in multiple inheritance can generate subtle but systematic ambiguity in the meaning of the used formal is_a and part_of relations [14, 15]. Do not press the is_a into service to mean a variety of different things (see univocity principle). Otherwise a relaxed reading of is_a relations can lead to assertions of is_a relations which erroneously cross the divide between different ontological categories .

Domain-experts should build single parenthood taxonomies of their views of reality. Other domain experts build the same for theirs and only later all these taxonomies will get ‘multidimensionally’ aligned within obo and secure common nodes will result which make consistent (!) multiple inheritance possible.

There are however many opinions on this issue. The above statements represent the ‘realist’ perspective on things and we might discuss this matter further, when we feel there is a real need for multiple parenthood.

[Alan Rectors Normalisation and untangling practices have to be discussed here, too…]

5 Naming Classes

Each class representing a universal in a representational artifact is labelled with a human readable class name. Class names should be short, easy to remember and self-explanatory. The human readable class name should be used as default browser key or display key when navigating through the class hierarchy and should therefore be as intuitive as possible to the ontology engineer building the ontological structure. However this class name will not necessarily be used as the main search attribute by the end-users or agents when they are searching for classes. For this a short and intuitive class name should be captured as preferred synonym, which would be a less explicit term of highest usage frequency found in the domain literature, i.e. the term with the highest user acceptance.

5.1 Class name precision

Class names should be precise, concise and linguistically correct (i.e. they should conform to the rules of the language used). Often terms for RUs are not precise, i.e. they do not capture the intended meaning. Imprecise terms are especially problematic in the absence of good definitions. For example the term “anatomic_structure, system or substance” does not give us any clue as to whether the scope of the adjective prefix “anatomic” is restricted to structure or extends also to system and substance. This ambiguity can lead to problems like the following: If “anatomic” is restricted to “structure” only, then “drug” and “chemical” would be classified under this class, since these are clearly substances. If it is not restricted “drug” and “chemical” could not be classified under this class.

Avoid overloaded and highly ambiguous words and morphemes. A sensefull cutoff has to be found between conciseness and unambiguity on one side, and intuitivity and usage frequency within the domain on the other side.

For the preferred name, avoid adding many semantically equally words, because this distracts and slows down the perception of the intended meaning [16].

The class should represent and be named after the intrinsic, underlying nature of the universal to be represented, not according to extrinsic properties or roles a class can play in a particular context. Embodying the whole meaning of the class - with all its relationships to other classes - in its name is in most cases neither possible nor recommended. Keep semantics in the definitions and formalize it explicitly as properties and axioms. For example, a class “distinct_identifiable_physical_part” should be just called “physical_part”. For the human-preferred name readability should have higher priority than constraining interpretation through the class names. For the class name that is used for OE, it is the other way round.

Epistemological statements (using meta-level jargon) don't belong in the class names so avoid calling the class “instrument” “instrument_class” or the relation “has_part” “has_part_relation”. Since each class 'A' implicitly means 'the class A', either prefixes or affixes involving “_class” must be avoided. The same applies to suffixes like "_entity" and "_type".

5.1.1 Avoid linguistic ellipses and apocopes

Be explicit, try to avoid ellipses and apocopes, because what you leave out or think as implicitly clear is not necessarily known by others and in any case not for computers. An ellipsis and apocopes are rhetorical figures of speech, omissions of sentence parts, words or word parts when used in vernacular language that are normally required by strict grammatical or lexical rules but not by sense. The missing words are implied by the context in human language. Ellipse usage often points to slang words which should be avoided, or put as synonyms, e.g. "chemo" for "chemotherapy".

(The aposiopesis is special form of rhetorical ellipsis (wiki). Typical examples of this are: NMR detects Receptor, and the Receptor the transmitter, in which the second instance of the word detects is implied rather than explicit. )

The Plant Ontology used to use 'cell' to mean 'plant cell' in this way, which led to problems when they had to extend the ontology to deal with bacteria in plants. They have now changed the definition and name of their former 'cell' to ‘plant cell’ and created a broader ‘cell’ class. The general rule is, for every expression 'E': 'E' means: E. The term ‘E’ means what the word ‘E’ means, but the word ‘E’ may mean different things...

Sometimes hyphen usage is a hint for Ellipse usage. This should be avoided, e.g. "bio- and genetechnology" would be "biotechnology and genetechnology" and then probably modelled as two separate classes "biotechnology” and “genetechnology".

Confusingly we sometimes use the same general terms to refer both to universals and collections of particulars. Consider:

· HIV is an infectious retrovirus

· HIV is spreading very rapidly through Asia

This however could also be regarded as an ellipise: The first ellipse "HIV" stands for "HIV-Virus", the second ellipse stands for "HIV-Disease".

5.2 Synonyms

One definition of synonymy, as proposed by ISO 1087-1:2000: A synonym is a “… relation between or among terms in a given language representing the same concept, with a note to the effect that terms which are interchangeable in all contexts are called synonyms; if they are interchangeable only in some contexts, they are called quasi-synonyms.“ [I don’t think ‘quasi-synonym’ should exist, see next chapter]. The number of synonyms for a class is not limited.

Should the same text string be used as a synonym for more than one class? How do we handle Homonyms? [???]. If you edit or delete a class name, the old name can still be a valid synonym, e.g. if you change "respiration" to "cellular_respiration", think of keeping "respiration" as a synonym (but in this case make it a superclass…). This helps other users to find "familiar" classes. 'Jargon' type phrases, abbreviations and acronyms are synonymous with the full name as long as they are not used in any other sense elsewhere. Translations of the class name into other languages are sometimes captured as synonyms, too. We would recommend to capture translations in a different element, e.g. owl provides a nice functionality to set the 'lang' attribute, e.g. for the rdfs:label annotation property.

5.2.1 Avoid different sorts of Synonyms

As we saw above some ontologists perceive synonyms as not always 'synonymous' in the strictest sense of the word, as they feel they not always mean exactly the same as the class they are attached to. Some ‘synonyms’ seem to be broader or narrower in meaning than the class name; it may be a related phrase or alternative wording, spelling or use a different system of nomenclature. Having a single, broad relationship between a class and its synonyms is adequate for most search purposes, but for applications such as semantic matching, the inclusion of a more formal relationship set can be valuable. Here sometimes synonym types are introduced. However we do not recommend to capture such ‘synonym types’ as the GO style guide suggests. Capture only exact synonyms. Thesaurus information should be kept in a thesaurus (SKOS) semantics and not called synonym, but e.g. “broader tag” instead “broader synonym”.

5.2.2 Property synonyms

One should also capture object property synonyms (see section 4.1 of http://www.w3.org/TR/owl-guide).

5.3 Acronyms and Abbreviations

Ideally, abbreviations in names should be avoided and acronyms resolved. Names for RUs should be explicit, e.g. "number_of_residues" should be used instead of a totally unintuitive "n_res". Abbreviations and Acronyms can have different meanings in other domains, e.g. "Ca" for calcium could be mistaken for "CA", which means cancer in many other fields. You will be surprised how many different resolutions and meanings an acronym can have, e.g. try “NMR” in the tool http://www.acronymfinder.com/ and you will get about 20 meanings:

News Media Representative, Nielsen Media Research, National Museum of Racing and Hall of Fame (Saratoga Springs, NY), National Monuments Record (UK), Not My Rip

No Moves Received (online gaming), Network Measurement Report, Non-conforming Material Report, N-Modular Redundant (reliability), No Mail Receptacle (USPS) , No Maintenance Requirement and New Mobile RAPCON.

When an acronym is commonly used with very high frequency in everyday language in place of its full name (then called an anacronym), for example “laser”, it can be used within a class name, while its resolved name should be listed as synonym.

Top level classes should never have abbreviations or acronyms in their names, however, there are bottom level classes in which an acronym or abbreviation could be used. In these cases of compound terms on the bottom level the acronym should be unambiguous and be resolved at least in one of the synonyms. When an abbreviation is well known, unambiguous and appears to be needed and is re-occuring in many RU names, then use the abbreviation. NMR and HIV are border cases. Use feeling here. Only the main focus Acronyms that are found frequently in the ontology can stay as they are. Resolving e.g. “NMR” as “nuclear_magnetic_resonance_spectroscopy” in each RU within an NMR ontology makes too many terms unnecessary long and hard to read.

Do not allow abbreviations which employ expressions with other meanings ('chronic olfactory lung disorder' should never be abbreviated: cold). If they can’t be avoided capitalize Acronyms. There is no clear policy on when to spell out abbreviations, so use your common sense.

5.4 Registered Product- and Company-names

Proprietary names should be captured as they are, as long as this is not prohibited by ‘allowed character’ rules for the element used to represent it. Proprietary names can should be captured as they are and are allowed to break typographical conventions, e.g. there can be a " AVANCE_II_spectrometer " (starting with a capital letter) and there can be a CamelCase brand name like “SampleJet”.

Since product names often get very cryptic (e.g. a Bruker NMR magnet has the product name “US_2”), we recommend a convention that renders these more understandable: Use the company name as prefix, the product name as infix and the product type (superclass) as headword/suffix, e.g. use “Bruker_US_2_NMR_magnet” instead of “US_2”.

[parsers, add ]

5.5 Lexical properties of class names

5.5.1 Capitalisation

Names should be lower case letters throughout except for acronyms which are capitalised (if their use in class names can't be avoided) and proprietary names, which are written as such. Acronyms and brand names can break the conventions rules unless rdf-field restrictions prevent these. E.g. there can be a "NMR_instrument" (starting with a capital letter) and there can be a CamelCase brand name like “SampleJet”. Other KR-domains (semantic web / OWL, Protégé-group), use capitals for beginning class names, while properties start with lower case letters.

Internal capitalization is however enforced by some computer systems, and mandated by the coding standards of many programming languages, i.e. Java coding style dictates that UpperCamelCase be used for classes, and lowerCamelCase be used for instances and members. So unless you plan to use auto generated java classes or any MDA approaches to convert the ontology into software code avoid CamelCase.

5.5.2 Character set

Terms designating RUs should consist mainly of alphabetic characters, numerals and underscores. Whether you will be allowed to use the space as word delimiter depends on the way the implementation handles the strings for the representational unit in question. Avoid special characters where possible. Avoid character-combinations that may have a special meaning in regular expressions or programming languages and XML. This recommendation is largely dependant on what the parsers for the implementation format for the specific RU can handle.

5.5.3 Character and word formattings

No accents, subscripts or superscripts are allowed (e.g. cm3 replaces cm³ and CO2 replaces CO²). The Names of chemical elements from the periodic table should be written in full length and should not be abbreviated with their symbols. (use hydrogen, copper and zinc rather than H, Cu and Zn). Greek symbols should be spelled out e.g. "alpha" instead of a. Temperature designations like 37° C. can be represented as 37C or better be represented formally through a proper units ontology.

Full stops, exclamation- and question marks do not belong into class names.

5.5.4 Punctuation

Various kinds of punctuation connect name parts, including separators such as spaces, hyphens, and grouping symbols such as parentheses. These may have:

a) No semantic meaning. A naming rule may state that word separators will consist of one blank space or exactly one special character (for example the underscore) regardless of semantic relationships of parts. Such a rule simplifies name formation.

b) Semantic meaning. Separators can convey semantic meaning by, for example, assigning a different separator between words in the qualifier term from the separator that separates words in the other part terms. In this way, the separator identifies the qualifier term clearly as different from the rest of the name. For example, in the data element name “Cost_Budget-Period_Total_Amount” the separator between words in the qualifier term is a hyphen; other name parts are separated by underscores.[???]

Other languages, e.g. asian languages, form words using two characters which, separately, have different meanings, but when joined together have a third meaning unrelated to its parts. This may pose a problem in the interpretation of a name because ambiguity may be created by the juxtaposition of characters. A possible solution is to use one separator to distinguish when two characters form a single word, and another when they are individual words.

5.5.4.1 Word separators

Class name terms should be delimited by the "_" (underscore) separator. The underscore substitutes the space character. Whether you will be allowed to use the space as word delimiter depends on the way the implementation handles the strings for the representational unit in question. Under the OBO umbrella one can find: "MyClass" "My Class", "My-Class", "My_Class", “My_class" and "my class" conventions, sometimes even within one ontology. One convention is not necessarily better or worse than the other as long it is used consistently within the ontology. Java programmers, for example, use the "MyClass" (CamelCase-) convention, because that is the standard for naming Java classes, whereas text miners use "My class" convention, because it is easier to tokenize by natural language processing tools. The CamelCase convention has problems to capture class names like “Sample_pH” which would then read “SamplePH”. XML based languages don't like the space as a separator, so check how your parser copes with it in the (meta-) RU which captures the name for the RU.

5.5.4.2 Hyphens, dash and slash

The hyphen should be avoided as word-separator and it should be used as in normal written English. Java will interpret the Hyphen as a minus. Using the hyphen as separator would also cause ambiguity when using hyphens when required by English, e.g. “copper-based_compound” and when used to restrict or refine the meaning of a name, e.g. within homonym resolution: "bow-boat_part" and "bow-the_weapon" as is still done in some ontologies. In general we recommend to avoid overloading the hyphen or equivalent characters with meanings and use the hyphen as used in natural language only.

The Hyphen has many meanings which we take for granted, but which have to be assigned more explicitly to be processed by computers. When using the hyphen one should be aware that its meanings can conflict: It can generally mark an undefined "somehow-related-to" relationship, it can mark a closer semantic binding as in “copper-based_compound” and can encode substantiation like in "abdomen-sonography", but it can also mark a divergence in meaning between the two words, as in "black-white". In “bio- and genetechnology” it encodes an apocope, standing for the morpheme “technology”. Sometimes the hyphen encodes different logical connectors like "and" or "or" and it can be used to separate syllables when breaking a work in two at the end of a line. In sentences it can of course also encode separation marks for additional thoughts squeezed into a sentence as in “Enzymes – except Prions – are useful Proteins” The hyphen also demarks numerical, spatial or temporal lengths as in “1–4 telephone calls”, “Bremen–Hamburg” and “25.09.–28.12”, or is used as a minus or to indicate an omission as in “the PC is worth 300,–“.

We need to differentiate between the hyphen and a dash. There are two kinds of dashes: the n-dash and the m-dash. The n-dash is called that because it is the same width as the letter "n". The m-dash is longer, he width of the letter "m". We use the n-dash for numerical ranges, as in "6-10 years." When we need a dash as a form of parenthetical punctuation in a sentence use the m-dash.

The slash "/" means OR or AND in most cases and should be avoided in class names as should logical connectives in general.

5.5.5 Specific language requirements

Consistency is required if encountering this special case.

Where there are differences in the accepted spelling between English and US usage, use the US form, e.g. polymerizing, signalling rather than polymerising, signalling.

A common source of misspelled tags is the translation from other alphabets or characters. For example, the Umlaut, commonly used in German, is usually represented by the Latin-1 character set. Since this character set is often unavailable, Germans frequently represent an Umlaut character by means of a longhand encoding, such as "ue" for "ü". Consistency is required in these special cases to avoid mixture of "ü"s and "ue"s.

5.5.6 Wordform and tense

Names for RUs should be in the singular form throughout. This prevents redundancy and misclassifications, e.g. creating a class "experiments" (plural) and then "experiment" as its subclass deeper in the hierarchy (true only if the idiom used is checked to keep a unique string, e.g. :NAME field in Protege). If you want to import legacy XML or generate XML feeds from the ontology you have to use the singular form anyway, since this is the expected convention for XML tags.

Class names are always nouns, so use "randomisation" instead of "randomise" if you intend to model a class, use "randomises" if you model a property. Nouns are the most concrete part of speech. Verbs can be converted to nouns. Adjectives and adverbs, however, seldom convey meanings captured via atomic classes. They correspond more to properties [section needs work ???].

Class and property names (verbs) should be uniformly captured in present tense.

Sometimes a time perspective is indicated within class or property names, i.e. ”to_be_measured”, “measuring”, “measurement_taken”. Class names should be normalized consistently into the present tense form or better be tense-less nominals, e.g. “measurement”.

5.5.6.1 Plurals and sets

If you have to capture plurals you have three possibilities e.g. “protocols” “set_of_protocols”, “protocol_set” and “protocol_collection”. The last form is recommended (just add the “_collection” postfix to the singular class name), because it is easier to spot (also for textmining). It is preferred over “collection_of_x” because it is placed alphabetically directly beneath its singular form within the hierarchy. The “X_set” convention has to be avoided since the word “set” is highly ambiguous. Use plurals sparsely and only if you really think you will need them for the application. Creating for each singular x a plural-container of the form “x_collection” creates a lot of classes, which we might not use at all. An instance of 'protocol' is a protocol and an instance of 'protocol_collection is a collection/set of protocols. Be aware of the difference: Each class 'A' in an ontology has the implicit meaning 'the class A'.

[Refine, (Chebi comment)]

NOTE: The realist distinguishes set theory ‘classes’ (not the type we use the word class for here) from collections/sets for ontological classes (types): Both classes and collections are marked by granularity, but collections are timeless. A set theory class endures through time and survives the turnover in its instances. A set theory class is not determined by its instances (as a state is not determined by its citizens and as an organism is not determined by its molecules). A collection/set is determined by its members. It is an abstract structure, existing outside time and space. The set of human beings existing at t is (timelessly) a different entity from the set of human beings existing at t' because of births and deaths.

5.5.7 Word order (Syntactic issues):

Rules for compound term names should be investigated, e.g.:

a) The object class term shall occupy the first (leftmost) position in the name.

b) Qualifier terms shall precede the part qualified. The order of qualifiers shall not be used to differentiate names.

c) Descriptive property terms shall occupy the next position.

d) Terms designating the parentclass shall occupy the last position.

e) If a word in the name is redundant i.e. with a word in the property term, one occurrence should be deleted.

f) Do not put the type of the RU you model (i.e. '_class' or '_propertiy' ) at the end of the classname.

5.5.8 Word length and word compositions

Names for RUs that are used to show up in the hierarchy (i.e. the browser or display key) and should be read in a fast manner for orientation purposes, should be at least four characters long and as short as possible to be easy readable and understandable. It should be avoided to create human readable or preferred names that look like full sentences. Ideally, short and maximally intuitive names are to be preferred. Names are useful only if they are in fact used

[see JacobKoehler paper."intelligibility of GO terms" + DILS paper].

Word compositions longer than five words and very complex morphemes should be avoided. When class names are made out of more words, try to use words that are already defined in higher hierarchy levels of the ontology. Build compound names out of simpler ones from the ontology in a consistent LEGO-like approach. Consistent means that the binding operators (words used to connect the other parts of the class name) are used in the same sound manner throughout the ontology. ‘Recycle’ words whenever possible.

A formal class name can be given to a class, i.e. a name for the class that is formally controlled through linguistical rules and axioms. E.G. OBOL normalized ones, that adhere to defined principles of word/morpheme/affix order and form or class names that use a controlled natural language (CNL) such as KANT or ClearTalk or Attempto Controlled English (ACE). CNL are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity. CNL can improve readability for human readers and improve computational processing of the text.

5.5.8.1 Compound vs. atomic names for representational units

Sometimes one encounters rather long names for RUs, which encode a lot of semantics within the name. These complex names are compositions of many words and therefore are called compound terms. They often consist of a noun phrase, like "sample_temperature_in_autosampler" embedding a prepositional term (localizational property like "in_autosampler").

[Compositionality – see Chris Mungall's OBOL , see Okren]

Try to avoid to use a paucity of resources for expressing relations. GO for example captures relations implicitly and indirectly within class names by constructing class names that contain syntactic operators such as 'site_of', 'within', 'extrinsic_to', 'space', 'region', and so on.. This is a result of a lack of (e.g. asserted ‘location’) relations. It then simulates assertions of location by means of 'is_a' and 'part_of' statements involving such composites, for example in:

extracellular region is_a cellular component

extrinsic to membrane part_of membrane

When the representational formalism allows to formalize properties and the atomic compounds are already present, these classes can be refactored / dissected / decomposed into more primitive existing classes (atoms) and attributes or relations between them (In owl-speak: you build a named/defined class from primitive classes and restrictions). This is encouraged for OWL ontologies. When only an is_a hierarchy (without properties) is provided, compound names should be kept in the long form to capture what the user really wants to express and one has to keep the semantics within the class. As long as working with CVs one should aim to be reasonably descriptive, even at the risk of some verbal redundancy or longer names. That is why one often finds rather long class names in taxonomic CVs.

When word combinations with genitive, dative or accusative case occur, variants are possible, e.g. Combination into one single word, e.g. Breaking_off_the_experiment à experiment_breakoff or connection with hyphen, e.g. NMR_of_Hydrogen à Hydrogen-NMR.

According to DIN 12/1993, when new terms are created out of existing already defined class names the following types of multi-word terms can be distinguished (B. Schaeder, Fachlexicographie: Fachwissen und seine Repraesentation in Woerterbuechern, 1994, Tübingen):

Determinative term linkage:

A second term occurs additionally, as a feature in the content of the original term, whereby the latter is restricted. The resulting multi-word term is a subterm. E.g. fast_NMR.

Disjunctive term linkage:

The new multi-word term encompasses the scope of both constituent terms. E.g. GC_MS.

Integrating term linkage:

Objects associated to terms are combined into the next higher whole. E.g. sponsor-investigator.

Conjunctive term integration:

The new term merges the contents of both constituent terms, and is their next common subterm. E.g. investigator_study.

[To be evaluated…]

5.5.8.2 Splitting and merging classes

Simple (sometimes hyphen separated) and bimorphemic compound terms like "histology-result" should only be atomised into histology and result when the occurring morphemes represent single important classes themselves which are of use in other multi-word creations. E.g. for a clinical trail the atomic morphemes "ethics" and "commission" are not important, so a multi-word term like "ethics_commission" can stay like this and needs only be defined once as is.

The standard procedure for refactoring / splitting a class is to obsolete the original class and add a suitable comment directing annotators to the new classes (see Metadata Annotation document on http://msi-ontology.sourceforge.net/). Classes are merged in cases where two classes have exactly the same meaning in all contexts (i.e. are synonymous). Usually this situation arises when one class exists, and another wording of the same concept is added as a new class instead of as a synonym, either because a curator didn't find the old class or didn't know it meant the same thing.

5.5.9 Affixes (prefix, suffix, infix and circumfix)

The word-stem should be used to formalize class names and affixes to names should be avoided where possible and in any case be used consistently. When an ontology has many terms starting with the same prefix, for example “sample_number”, “sample_origin”, it suggests the need for transforming the postfixes into properties of a [prefix]-class when building the ontology. If subclasses are named using the class-name and a further descriptive morpheme, this should be done in a consistent way throughout the subclasses. For example, a class "receptor" can have two subclasses named either “katecholamine_receptor” and “peptide_receptor” (naming them just “katecholamine” and “peptide” would be a bad practice since ellipses have to be avoided and “peptide” designates a complete different class anyway). So there should not be the names “katecholamine_receptor” and “peptide”. If one prefixes a "receptor"-subclass name in the form xy_receptor, e.g. "adrenaline_receptor" (having the ligand as xy (prefix), one can't integrate receptors that are named according to their succeeding signalling transduction module, e.g. "G-proteine_coupled_receptor" (and not the ligand) in a consistent way. Infixes, circumfixes, articles, conjunctions and possessive forms of words should be used consistently, but be avoided when possible.

5.5.10 Logical connectives

According to the realist view on ontologies, logical connectives such as "and", "or" and "not" should not be used within names for RUs, because they will be formalised as constraints and axioms later (and hence will allow for reasoning). 'rabbit or whale' does not designate a special universal of mammal. In general, owl allows you to build named/defined classes and label them accordingly.

5.5.11 "Taboo" words and Character combinations

Where possible, words from the metalevel (the representation formalism / KR language) should not be used within names for RUs. The use of database or ontology language keywords, for example "Model", "Class", "KIF", "Clips" and "OWL" and xml style tags or characters designating tags or regular expressions should be avoided when possible, because you never know whether all parsers you might need to use will handle these. Also when translations into other formats have to be made you can be sure not to run into parser problems in these other formats.

Other words and morphemes to be avoided are highly ambiguous ones, e.g. the affixes “set” and “setting” belong to the most ambiguous words in English. "Set" alone has over 20 different meanings (set refers to the process of setting parameters or to a plural of parameters.

Avoid anything that is related to xml or regular expressions in your class name, since it might cause problems in other parsers you might want to use later.

6 Class definitions (temporary and formal ones)

Class definitions should provide the context and meaning of the class in a way to ease its interpretation. The definition should contain important keywords that describe the classes inherent attributes and relations to other classes in natural language. However in reality proper definitions can not be created for all universals, especially at the root level of the ontology (e.g. it is hard to define “thing”). A class should be given a humanly intelligible definition only when the necessary and sufficient conditions for being an instance of the corresponding universal are really understood. Before that, do not make up pseudo-definitions (e.g. circular definitions), but provisionally collect the necessary conditions in the comment field. Proofread your definitions carefully to eliminate typos and double spaces. As with class names, avoid using abbreviations that may be ambiguous. Keep in mind definitions will aid textmining approaches also, so be formal and consistent. If you refer to other classes, use their real natural language names and avoid the ‘artificial’ underscore delimiter.

In practice one would first capture non-formal definitions as they come from the domain experts, glossaries or gathered by a google:define search. These are captured with their provenance (meta-) data in a “tempdef” field. Then one creates a second definition which is more formal and standardized according to the defined principles mentioned below. [combine with following chapter:]

You can use different tools to help you gathering initial informal definitions. The most usable are:

http://www.medbioworld.com/advice/dict.html

http://www.pharma-lexicon.com/

Google, define:

WIKI

….

6.1 General rules for creating sound normalized definitions

1. Each definition refers to only one class.

2. Definitions should be as brief as possible, but as complex as necessary. Definitions should be as clear and concise as possible in order to convey the essence, "Das Wesen" (Silesius) of the universal to the user of the ontology.

3. The definition should be written at the same level of specificity as the class itself.

4. They should begin with an upper-case letter, can consist of more than one sentence if necessary and end always with a period (full stop).

5. Definitions should define classes and their referred universals and not the words used to refer to classes (class names), so in definitions avoid terms like ‘class’, 'descriptor', 'name', etc. that refer to RUs and not to the universals in reality. E.g. the definition of 'eye' is 'organ of sight', not 'is name of organ of sight', nor ‘class or concept describing an organ of sight’. Avoid using acronyms within definitions.

6. The definitions should explain what are characteristics (or properties) that distinguish members of this class from the others (the upper class and siblings). Notice that the formal definition is clear, concise, and unambiguous (i.e. you could look at something and say whether or not it belonged to the entity type).

7. Definitions with too many words like 'and', 'or', or 'where' in them should be viewed with suspicion.

8. Definitions should use simple, easy to understand words that are meaningful to most of the users. In the best case all terms in the definition can be find as classes in higher levels of the ontology and are thus defined.

9. It should be positive and not negative. Definitions like ‘all animals that are not a mammal’ or ‘ all non-membrane proteins’, which do not designate natural kinds are not helpful, since complements of universals are not necessarily themselves universals.

10. The formal rules for definitions laid down by Aristotle should be applied. When A is_a B, the definition of ‘A’ takes the form: An A is a B which C... e.g: “A human being is a mammal which is rational”. Essence = Genus + Differentiae. Definitions should start in the following way: “A [class described] is a [superclass], which/that [most relevant intrinsic properties (attributes and relations to other classes)]. It…. [Enter]”. When using the word “it” make sure you always refer to the described class only. If a class has more parents, I.e. multiple parenthood can not be avoided, mention all parent classes in the definition.

11. The definition should be free from words sharing the same root as the thing being defined (to be represented) and should not contain the class name itself. Avoid circularity in definitions like these:

An A is an A which is B (person = person with identity documents)

An A is the B of an A (heptolysis = the causes of heptolysis)

12. Each definition should reflect the position in the hierarchy to which a defined RU belongs. The position of a RU within the hierarchy enriches its own definition by incorporating automatically the definitions of all RUs above it. The entire information content of the hierarchy can then be translated cleanly into a computer representation.

13. The definition must be correct in all possible contexts the class is used, so that the class and all its synonyms are intersubstitutable with its definition in such a way, that the result is both grammatically correct and truth preserving.

14. Include some examples of well known prototypical instances or subclasses of the class.

Additionally have a look at the following paper by Jacob Koehler:

http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=1482721&blobtype=pdf

[Do we need definitions for particulars that we currently represent as classes, e.g. do brand names of instrument vendors need definitions???]

In the future definitions might be autogenerated through semantic conversions. Furthermore proper definitions can serve for quality control using textmining [17]. Automated inference of class definitions is already available from the Obol page. Note that these are automated, highly experimental and subject to change: Obol [http://www.fruitfly.org/~cjm/obol]

6.2 Property definitions

Object properties (relations) should have a definition as follows:

"The (name od relation)-relation indicates a (class name from one relationship) that is (nature of relation) for an (class name from other relationship).” For example, the definition for the property ‘storage (of material)’ might read: “A storage-relation indicates a material that is stored in a facility.” [??? refine]

7 Unique identifiers

[refine]

Following the decentralized web paradigm, every single RU (class or relation) should be versioned independently rather versioning the ontology as a whole. Therefore it is necessary to consider conventions for unique identifiers for RUs. If one tries to edit a set of modular ontologies held together by just the string class names, every time somebody wants to change a name, fix a spelling error, etc. there is a global change that is intrinsically unreliable or, if the ontologies are distributed, requires a major organisational effort. When the identifiers are formal ID numbers and human readable class names are kept as labels you can change the label without disturbing the linkages. Hence versioning becomes easier when using unique formal Identifiers for RUs in representational artifacts. Some ontology editors, like Protégé-2000, construct identifiers out of the ontology name and numbers automatically.

A unique identifier MUST NOT be deleted once used. IDs should be conserved at all times so that, even if a term is ‘defunct’ or has a new ID, someone searching using the old ID can find it.

As a rule of thumb while user friendly names for RUs should not cause problems for human processing, their IDs should not cause problems for machine processing. Always remind that an ID is associated with a definition and a universal rather than with the preferred class name.

7.1 Life science Identifier (LSID: http://lsid.sourceforge.net/)

The LSID concept introduces a straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today. Almost every public, internal, or department-level data store today has its own way of naming individual data resources, making integration between different data sources a tedious, never-ending chore for informatics developers and researchers. By defining a simple, common way to identify and access biologically significant data, whether that data is stored in files, relational databases, in applications, or in internal or public data sources, LSID provides a naming standard underpinning for wide-area science and interoperability. A LSID conforms to the URN standards defined by the IETF. Every LSID consists of up to five parts: the Network Identifier (NID); the root DNS name of the issuing authority; the namespace chosen by the issuing authority; the object id unique in that namespace; and finally an optional revision id for storing versioning information. Each part is separated by a colon to make LSIDs easy to parse. Here are a few examples:

urn:lsid:pdb.org:1AFT:1 à This is the first version of the 1AFT protein in the Protein Data Bank.

urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 àReferences a PubMed article

urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 àRefers to the second version of an entry in GenBank

LSIDs name and refer to one unchanging data object each. Unlike the familiar URLs of the World-Wide-Web, LSIDs are location independent. This means that a program or a user can be certain that what they are dealing with is exactly the same data if the LSID of any object is the same as the LSID of another copy of the object obtained elsewhere. The problem with URLs is that they always point to a particular web server (which may not always be in service) and worse, that the contents referred to by a URL often change.

A universal naming scheme simplifies the processing of data from a variety of sources, because the application does not need to have specific, hard-coded support for each naming scheme. This allows cross-referencing between data sources to be done implicitly using URI’s. One such effort currently underway is the Life Sciences Identifier (LSID) project. An example looks like this: urn:lsid:uniprot.org:uniprot:P49841. This LSID names a protein record in Uniprot that is referred to as P49841. It consists of parts separated by colons: A prefix “urn:lsid:”, the authority name; the authority-specific data namespace; and the namespace-specific object identifier (here “P49841”).

8 Namespace

Each RA has a unique string associated with , the 'namespace' (NC) of that RA. The NC serves as an identifier for all the terms in one RA and designates their origin unambiguously. When using a NC designated RU it is clear where the RU comes from (in which ontology it 'lives') and therefore in which context and how it must be interpreted. Using the NC together with a RUs identifier, it can be ensured that any RU within the www can be unambiguously referred to. By maintaining different namespaces for different ontologies it is possible for one ontology to reference RUs (classes, properties and individuals) in other RAs in an unambiguous manner and without causing name clashes. E.g. the OBI ontology at the moment refers to and makes use of the Dublin Core, DC ontology for annotating its RUs. It refers to these DC classes by importing the RA over the web and referring to each DC RU through the NC "http://purl.org/dc/elements/1.1/".

To ensure that namespaces are unique they usually are Unique Resource Identifiers (URI). As in the OWL language the class names are also part of a URI, they may not contain spaces or special characters. In practice the namespace URI is an URL where the ontology can be found from within the internet, e.g.:

For the NMR.owl : http://msi-workgroups.sourceforge.net/ontologies/msi/NMR.owl

For the OBI-ontology: http://obi.sourceforge.net/ontology/OBI.owl

To get the corresponding namespace from a URI just add the “#” prefix to the URI.

For better readability however one can internally substitute the full namespace with a short intuitive prefix, which should be the same as for the class ID, e.g. “obi” or “nmr”.

9 Location of webaccessible repository

There is no formal convention for determining the location of an ontology given its URI, but it is generally recommended that ontologies are made available on the web at a location that corresponds to their URI., e.g. the FuGO ontology should be able to be found under http://fugo.sourceforge.net/ontology/

The NC does not necessarily point to a valid URL. This is only a good practice recommendation. To share the RA with others and let others use RUs from your RA per import, you need to provide a stable web-accessible link to the ontology. Tt is suggested to create a symbolic link in the main directory of the workgroup.

If the latest version of the ontology file name changes or its physical location the symbolic link can be updated and there is no need to update/mail everybody that uses the ontology, i.e. OBO webmasters.

[refine:]

Physical positions for the obi.owl file:

http://svn.sourceforge.net/viewvc/*checkout*/fugo/trunk/ontology/OBI.owl?view=checkout

This is good for downloading the source. This always grabs the latest version, which is extremely useful for bulk-download software that currently uses OBO

http://svn.sourceforge.net/viewvc/*checkout*/fugo/trunk/ontology/OBI.owl?revision=44

This is the owl file itself, but the revision specific one

http://svn.sourceforge.net/viewvc/fugo/trunk/ontology/OBI.owl

This is the general svn page from where to download revisions and Diffs.

https://svn.sourceforge.net/svnroot/fugo/trunk/ontology/OBI.owl

This is the convention MSI uses for importing the NMR.owl.

The date on which a RA was frozen or its version number can be used to construct URIs for the RA versions.

Ontology URI: http://www.example.com/nmr-ontology

Ontology version URIs: http://www.example.com/nmr-ontology_061004

http://www.example.com/nmr-ontology_061126

Since SF is sometimes very slow, a faster acessable website would be better.

I would also suggest a simpler public URL that more closely mimics the OBI namespace URI - http://obi.sourceforge.net/ontology/OBI.owl

As mentioned this can be created as a symbolic link to the physical addresses above or to a faster accessible position.

The current OBO library system allows the specification of separate "source" and "download" metadata tags.

symlink (also called soft link or symbolic link) is a unix shell script that can be used to create and remove soft links to files. So, can we just create a softlink:

create softlink:

ln -s OBI.owl ~/(https:/)/svn.sourceforge.net/svnroot/obi/trunk/ontology/OBI.owl

The "~" stands for your home directory

Make the softlink:

http://obi.sourceforge.net/ontology/OBI.owl

maps to real file position https://svn.sourceforge.net/svnroot/fugo/trunk/ontology/OBI.owl

[check, refine]

10 Ontology Imports

To be able to reference to another web-based ontology the full ontology has to be imported into the active one. Then we can start “binning” of classes, e.g. from our domain dependant / community specific ontology into more general OBI or BFO ones.

[Look at the WIKI site : link ???]

11 Properties (Attributes and Relations)

See RO ontology (Ref).

Always formulate properties on the most general level possible.

Avoid blurred non-ontological and non-implementable relations like associated_with if you plan reasoning applications. A relation like annotates is not ontological in this sense, as it links classes not to other classes in nature, but rather to terms in a vocabulary that we ourselves have constructed. Avoid capturing closely related or even synonymous relations, e.g. derives_from and develops_from.

11.1Assigning "key-properties" to top level classes

The explicit allocation of class key-properties (the ones that define the essence of the class A, which discriminates it within its superclass B) fosters consistent taxonomisation of lower level classes, because the inheritance of these properties guarantees that all subclasses at all sublevels can be immediately counterchecked to be consistent with all superclasses at any higher level (this is a feature of the protégé frames visualisation in the ‘properties-view’, not the ‘logic view’). It is not enough to capture these properties in the definitions only, because the GUI-tools don't pass them on to the leaf classes like they do for formally assigned properties. Explicitly formalised properties help constraining the interpretation of their domain- classes and all subclasses, which is exactly what is needed to provide the context for classification. These key properties help to keep track of the intended (otherwise implicit) context, all the way downstream to the leaf nodes. Classification can be decided to be true or false e.g. for the following case: time_independent_study is_a ,...., is_a unfolding_through_time. If we would have assigned a key-property has_timeline to the top level class “unfolding_through_time” (or process), in the ‘properties view’ of the tab, we would immediately see this property (inherited) at the leaf node “time_independent_study”, and here we could (by having this information immediately visually accessible) decide more easily if this classification is valid, e.g. when we then see the has_timeline property associated to the “time_independent_study”, this feels counterintuitive at first and we might have a closer look at this classification or the definition. However, since a “time_independent_study” is not the same as a “study_without_timeline”, the classification is correct in this case.

Possible key-properties for a “process”-class could be starts_at, has_object_participant, induced_through. Key-properties for the “object” top level class could be has_position, has_mass, ….

12 Ontology file names and versions

A file-naming convention will help to capture basic metadata into filenames and provides a simple versioning mechanism, for files which our community members may upload into the file repositories. Any recommendations tackling this issue are of course not only dependent on the way files are stored and versioned, e.g. if svn/cvs is used, but also what kind of file related metadata is stored within the ontology itself, e.g. Owl can capture further data in its metadata sections or an external annotation ontology like RA_metadata.owl (link ???) can be imported, providing descriptors to describe such RA related metadata).

In general you would only capture the really necessary information in the filename, usually the ones that you would need to unambiguously identify the file and important file handling metadata.

Use a consistent version naming. A good practice is to align the version number with the Year and Month. Name each publicly new available version with the prefix “v.” followed by the single digit year and the month, e.g. a version checked in for deployment in February 2006 would be “v.6.2”. The disadvantage here is that you are not able to state anything about the scale of advancements archived between following versions.

When no automatic update and versioning system is used RA files and directories should be named according to the following syntax: (if svn is used the ShortRAname is enough).

ShortRAname[_Authority_Version_Date].ext

E.g.: NMR_MSI_v6-9_060920.owl

ShortRAname is a short descriptive RAs name.

Authority comprises the name of the RAs engineering authority or the organization. Separate author and organization with a dash if both are featured.

Version comprises the version number. Start the version number with a "v"; use "-" instead of "." in the version numbering (like "v6_2" instead of "v62").

Date comprises the date the file is released. For the date reference, the parts changing less should come first, as this eases alphabetical sorting according to the date: use "yymmdd".

Ext is the proper extension for the representation language separated by a "." (dot). There should only be one dot in the entire filename and that should be right before the file extension. "ext" is the standard file extension by which this file can be associated with an appropriate application that will handle it. This is generally in 2~4 lower case alphanumeric characters.

Allowed characters: The file name may contain upper and lower case text, numerals, "-" (dash) and "_" (underscore). [allowed unix filename characters ??? ]. Spaces, parenthesis, or other commonly used characters, such as "~", "&", or "#" will cause the file to be rejected. Use underscore as separators.

A similar convention is being practiced at w3c for their published work (e.g. note their page header information http://www.w3.org/TR/2004/REC-webont-req-20040210/ ).

13 Contributions

This document has been drafted by Daniel Schober and it has received input from the MSI Ontology WG, OBO WG and OBI WGs’ members, in particular from:

- Luisa Montecchi-Palazzi, Frank Gibson (PSI)

- Chris Mungall (OBO)

- Barry Smith (cBIO, OBO)

- Waclaw Kusnierczyk, Andrew Spears (IFOMIS)

- Gilberto Fragoso (OBI)

- Phillippe Rocca-Serra and Susanna-Assunta Sansone (MSI)

- Susanna Sansone (EBI)

14 References

1. D Schober: Metadata Annotations for Representational Units and Representational Artifacts. 2006.

2. O Fiehn, B Kristal, B van Ommen, LW Sumner, SA Sansone, C Taylor, N Hardy, R Kaddurah-Daouk: Establishing reporting standards for metabolomic and metabonomic studies: a call for participation. Omics 2006, 10:158-63.

3. H Hermjakob: The HUPO Proteomics Standards Initiative - Overcoming the Fragmentation of Proteomics Data. Proteomics 2006, 6:34-38.

4. PL Whetzel, RR Brinkman, HC Causton, L Fan, D Field, J Fostel, G Fragoso, T Gray, M Heiskanen, T Hernandez-Boussard, et al: Development of FuGO: an ontology for functional genomics investigations. Omics 2006, 10:199-204.

5. S Bradner: Key words for use in RFCs to Indicate Requirement Levels. Internet Engineering Task Force 1997, March.

6. S Zhang, O Bodenreider: Law and order: assessing and enforcing compliance with ontological modeling principles in the Foundational Model of Anatomy. Comput Biol Med 2006, 36:674-93.

7. SH Brown, M Lincoln, S Hardenbrook, ON Petukhova, ST Rosenbloom, P Carpenter, P Elkin: Derivation and evaluation of a document-naming nomenclature. J Am Med Inform Assoc 2001, 8:379-90.

8. O Tuason, L Chen, H Liu, JA Blake, C Friedman: Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput 2004:238-49.

9. B Smith, W Kusnierczyk, D Schober, W Ceusters: Towards a Reference Terminology for Ontology Research and Development in the Biomedical Domain. In: KR-MED 2006; 2006.

10. LI Morrow, MF Duffy: The representation of ontological category concepts as affected by healthy aging: normative data and theoretical implications. Behav Res Methods 2005, 37:608-25.

11. TR Gruber: A translation approach to portable ontologies. Knowledge Acquisition 1993, 2:199-220.

12. S Brenner: Life sentences: Ontology recapitulates philology. Genome Biol 2002, 3:COMMENT1006.

13. B Smith, W Ceusters, B Klagges, J Kohler, A Kumar, J Lomax, C Mungall, F Neuhaus, AL Rector, C Rosse: Relations in biomedical ontologies. Genome Biol 2005, 6:R46.

14. B Smith, J Köhler, A Kumar: On the Application of Formal Principles to Life Science Data: A Case Study in the Gene Ontology. In: DILS 2004: Data Integration in the Life Sciences. Lecture Notes in Computer Science; 2004. 124-139.

15. J Bouaud, B Bachimont, J Charlet, P Zweigenbaum: Acquisition and structuring of an ontology within conceptual graphs. In: Proceedings 2nd International Conference on Conceptual Structures: Workshop on Knowledge Acquisition using Conceptual Graph Theory. Lecture Notes Computer Sciience; 1994. 1-25.

16. G Vigliocco, DP Vinson, S Siri: Semantic similarity and grammatical class in naming actions. Cognition 2005, 94:B91-100.

17. J Kohler, K Munn, A Ruegg, A Skusa, B Smith: Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics 2006, 7:212.

***** NOTE: This document is a work in progress *****

Comments and ideas are welcomed and should be sent to: schober@ebi.ac.uk