Naming Conventions for
Controlled Vocabularies (CVs) and Ontologies
- Implementation Independent -
MSI Ontology WG: http://msi-ontology.sourceforge.net/
OBI Ontology WG: http://obi.sourceforge.net/
PSI Ontology WGs: http://psidev.sourceforge.net/
Table of contents
1.4 What
is a Naming Convention
1.4.1 How does one profit from applying naming conventions
2 (Meta-) Reference Terminology
2.1 Peculiarities
in getting familiar with modelling (meta-)terminologies.
2.2 Basic
entities and ‘levels of reality’
2.3 Naming
representational units (RU)
2.4 Naming
representational artefacts (RA)
2.4.1 Terminology or Vocabulary
2.4.10 Thesaurus (Structured Vocabulary)
2.4.11 Directed acyclic graph, DAG
3 Depicting representational units within text
4 General principles for creating sound RUs
4.4 Objectivity
– Intrinsic and extrinsic characteristics
4.5 Try
to avoid multiple parenthood at the beginning
5.1.1 Avoid linguistic ellipses and apocopes
5.2.1 Avoid different sorts of Synonyms
5.3 Acronyms
and Abbreviations
5.4 Registered
Product- and Company-names
5.5 Lexical
properties of class names
5.5.3 Character and word formattings
5.5.4.2 Hyphens, dash and slash.
5.5.5 Specific language requirements
5.5.7 Word order (Syntactic issues):
5.5.8 Word length and word compositions
5.5.8.1 Compound vs. atomic names for representational units
5.5.8.2 Splitting and merging classes
5.5.9 Affixes (prefix, suffix, infix and circumfix)
5.5.11 "Taboo" words and Character combinations.
6 Class definitions (temporary and formal ones)
6.1 General
rules for creating sound normalized definitions
7.1 Life
science Identifier (LSID: http://lsid.sourceforge.net/)
9 Location of webaccessible repository
11 Properties (Attributes and Relations)
11.1 Assigning
"key-properties" to top level classes
12 Ontology file names and versions
This document suggests some
implementation-format independent naming conventions for controlled
vocabularies (CVs) and ontologies. Metadata annotation elements are not covered here; these are addressed
in a separate <<Metadata Annotations for Representational Units and
Representational Artifacts>> document [1]. These recommendations have been developed to
guide the work of the Metabolomics Standards Initiative (MSI) [2] Ontology Working Group (OWG), the Proteomics
Standard Initiative (PSI) Ontology WG [3] and the Ontology for Biomedical Investigation
(OBI, previously ‘FuGO’) WG, a larger multi-domain collaborative effort [4].
Recommendations on
Implementation dependent realisations of these naming conventions in OBO and
OWL will be available in the near future.
The key words “MUST,” “MUST NOT,” “REQUIRED,” “SHALL,” “SHALL NOT,”
“SHOULD,” “SHOULD NOT,” “RECOMMENDED,” “MAY,” and “OPTIONAL” are to be
interpreted as described in the RFC-2119 document [5].
Sections in brackets […] are notes for the
editor only. Please ignore.
[add]
These naming conventions tackle
lexically, syntactical and semantical issues on naming representational units
(mainly class names and property names) in representational artifacts ranging from simple glossaries over
taxonomies and controlled vocabularies up to formal ontologies on the top end
of the semantic complexity scale.
This document is addressed to all biologists
and ontologists who are involved in the creation, administration and in the
review of symbolic representational artifacts (RAs) like taxonomies, controlled vocabularies
and DL ontologies.
(In part from:
c035347_ISO_IEC_11179-5_2005(E)-1.zip )
A naming
convention (NC) describes what is known about how names for
administered items are formulated in a consistent manner. It may be simply
descriptive;
e.g., where no registration authority has control over the
formulation of names for a specific context. This NC is prescriptive in the way
that it specifies how names 'should' be formulated. A NC can also enforce the
exclusion of irrelevant facts about administered items.
The NC reference or specification
document (like this one) shall cover the following aspects:
There are diverse NC documents available, e.g. [6, 7] but most naming conventions are
not sufficient enough to serve the needs, e.g. for text mining [8].
A rigorous formal and logically consistent way
of naming RUs within RAs eases
·
Indexing and
Categorisation of RUs
·
Integrated tool
access across different ontologies
·
Ontology
alignment (mapping), difference detection and merging (e.g. through PROMPT)
·
Consistent
visualisation
·
Unified
understanding of meaning to humans as well as web agents
·
Avoidance of
masked redundant content
The overall profit is the ease to access
different ontologies through a unified mechanism and thereby better exploit the
given ontological resources, i.e. in ontology libraries.
At first we would like
to clarify the terminology used to talk about the different idioms which are
the matter of this text.
When the structures of
RAs and RUs are explained, the problem is, that they can not easily be
introduced in a simple serially ordered manner (as the nature of text demands),
because each idiom heavily relates to all others and some of the idioms are
even fractal. So we can't expect immediate understanding of everything
mentioned when serially reading this text. Understanding will rather come
holistically in the sense that you might have to read the whole text once more
and while doing so, your understanding, your internal conceptualisation, on
each chapter will build up and re-new gradually. Do not worry, if you do not
get it at the first time. There will always be words which you might not
understand immediately. At the highest level of abstraction there will even
be words that you can not fully understand, e.g. ‘thing’.
Another issue tackles the completeness of such a description. If you
should write a book that contains all information about writing this book
itself (again a fractal approach), this would be a never ending incrementally
nested task and such book could never be finished. So, not everything (e.g.
some words from the meta terminology) can and shall be described, otherwise we
are likely to get stuck in what can be called the ‘Meta-Ether’, the little brother of ‘Analysis-Paralysis’.
We introduce a common reference terminology to harmonize cross domain
understanding of the things that are talked about.
For a more formal clarification have
a look at the ‘Terminology for Ontologies’ paper [9]:
We start out from a distinction of three levels on which entities can
exist:
Level
1 - Reality: The objects, processes, qualities, states, etc. in reality;
Level
2 - Mental Concepts: Cognitive representations of this reality on the part
of researchers and others;
Level
3 - Representational Artifacts: Concretizations of these cognitive
representations in (for example textual or graphical) representational
artifacts.
An ENTITY is anything which exists, including objects, processes, qualities and
states in on all three levels (thus also including representations, models,
beliefs, Protocols, documents, observations, etc.).
A REPRESENTATION is any model (for example an idea, image, record, or description) which refers to (is of or about), or is intended to refer to, some
entity or entities external to the representation. Note that any representation as any model per definition always leaves out many
aspects of its target and hence can always be expanded and is never complete in covering all
aspects of the target.
A COMPOSITE REPRESENTATION is
a representation built out of constituent
sub-representations as their parts, in the way in which paragraphs are built out
of sentences and sentences out of words.
The constituent sub-representations
are called KR idioms or REPRESENTATIONAL
UNITS (RU); examples are: icons, names, simple word forms, or letters, but
also classes and properties. If we take the graph-theoretic concretisation of
the Gene Ontology as an example, then the representational units here are the
nodes of the graph, which are intended to refer to corresponding entities in
reality. But the composite representation refers, through its graph structure,
also to the relations between these entities, so that there is reference to
entities in reality both at the level of single units and at the structural
level.
A COGNITIVE REPRESENTATION
(Level 2) is a representation whose representational
units are ideas, thoughts, conceptual models or beliefs in the mind of some cognitive
subject.
A REPRESENTATIONAL ARTIFACT (RA, Level 3) is a representation that is fixed in some medium in such a way that it can
serve to make the cognitive representations existing in the minds of separate
subjects (mental conceps) publicly accessible in some enduring fashion. Examples are: a
text, a diagram, a list, a controlled vocabulary, schema and knowledge representations (KR, also
called representational models) or ontologies. RAs can serve to convey more or less adequately
the underlying cognitive representations and can be correspondingly more or
less intuitive or understandable. RAs vary in terms of formality and semantic
expressivity (Text has a high expressivity but a low formality, DL has lower
expressivity but is much more formal).
We recommend using the
term 'class' (this is the same as 'type' or 'kind') to
refer to the RU that models an ontological 'universal' A 'concept' is the representation of a
universal in the researchers head, his idea of the meaning of an entity which
is due to change over time and experience [10]. “There are no valid parsers for concepts!”
and an ontology should model reality, not the representation of reality in some
head. So better avoid this term. Each class is represented through a 'class
name', a string that designates the class for humans, a unique
identifier, a definition in natural language. Each class can have properties
(in Protégé Frames also called slots) associated with it. These properties are
constrained by facets: Properties
which have values (ranges) of simple datatypes (e.g. integer, string, boolean)
are called 'attributes' or 'datatype properties'. Properties
which have classes or instances as their values are called 'relations'
or 'object properties'. The group of classes a property is associated
with is called its 'domain'.
An 'Instance'
is the representation of a 'particular' of a universal in reality. A
'particular' instantiates a universal and an instance (called an individual
in owl) instantiates a class.
[Here graphic: Andrew, Ontogenesis…]
[Cite papers:
Interpretation continuum, What are the differences…, DAG]
We can sort the different types of RAs according
to their formality and semantic expressivity.
Lassila and McGuinness have
presented an ontology spectrum that presents various levels of formalization
(2001 Deborah L. McGuinness. Ontologies come of age. In Dieter Fensel,
Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors, Spinning the
semantic web: bringing the world wide web to its full potential. MIT press,
2002. Available on-line at
http://www-ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm).
The most often cited types of RAs will be described here, highlighting
their relations to each other and their differences.
Any set of symbols or terms (in most cases
words or word compositions) used for communication, which can be interpreted by
the address in the way intended by the addresser. Interpreted means it is felt to be descriptive
in the sense that the perception of the terms induces some kind of
understanding or conceptual model, which ideally has as most overlap with the
conceptual model of the addresser. In this sense a terminology is the medium
for exchanging knowledge models. Language related terminologies consist of
words suitable for describing a domain of interest.
Key characteristic
(primary intrinsic quality, or quale): Intended meaning
Implementation
formalisms: Any text.
Semi-structured
data are usually considered to
be RAs that contain free-text fragments,
structured in accordance to some schema. Typical sorts of semi-structured RAs
are forms and tables, which have some strict structure (fields, parts, etc.),
but still the content of the specific parts of the document is a free-text.
Key characteristic:
combination of RA and free text
Implementation
formalisms: Tables, spreadsheets, RDB, Forms
Any terminology which is taken care
of by some registration authority or standardisation body (can be very small though, i.e. a project or
working group only) in the sense that
the terms used are controlled by a group. “Controlled” means the sense
and/or the appearances of the terms are defined in a consistent manner and the
authority has the power to enforce these. Each term should have at least a
unique identifier. The word "CV" does not say anything about the
structure of the terminology or RA, i.e. a CV can be a simple list of terms or
an ontology. No formal statement about the relationships between the terms have
to be made, but can be made. A CV does not have to state anything about the
meaning of its terms but usually informal definitions are provided for each
term. All terms should have unambiguously defined and
non-redundant meanings. Usually Homonyms (a term that has context-dependent
different meanings) are resolved and synonyms (different terms that refer to
the same meaning) are captured.
Def agreed by Barry, taken from
semeda: a controlled
vocabulary is a set of nodes each of which is associated with an identifier,
term, definition, and an
optional set of synonyms.
Key characteristic: A standard
body enumerates and defines the terms explicitly for unified usage.
A glossary is a
simple list of terms in a particular
domain of knowledge with definitions and explanations in natural language
which explain the meanings of newly introduced or uncommon terms.
Any list of words
which entries refer to entries in
another list. In contrary to a thesaurus the dictionary usually defines
words [needs work] .
A graph G consists of two sets
N and E. N is a non-empty set of nodes, and E is a set of edges, an edge being a pair of nodes
from N. G is directed if its edges are directed. The
node from which a directed edge originates is called the source and the one in
which it terminates is the target. A path in a directed graph is a sequence of
nodes _x0, x1, . . . , xn_
(n>0)
where
every two adjacent nodes xi and xi+1(0_i_n −
1) are source and target, respectively, of some edge.
The path is direct if n=1; indirect otherwise. The path is called a cycle
if x0 and
xn are the same node. A graph is acyclic if it
has no cycles.
A hierarchy
is a nested set of symbols or terms
(in most cases words or word compositions). In a hierarchy the principle used
to build the nested structure is not specified and can be of any transitive
relation (i.e. part-of, is-a, ….) and even of multiple relations at the same
time. The term refers to the graphical structure and does not specify the
semantics behind the parent-child relationship. In this sense
nested xml
elements are hierarchical when displayed as such, but the
meaning of 'B being nested in A' is not defined within the xml. Hierarchies
have meanings specifies via whatever the meaning of the hierarchical
relationship is.
There are one parent
only hierarchies (mono-hierarchies) and multiple
parent hierarchies (poly hierarchy or directed acyclic graphs, DAG), in which
one term can be found under more than one parent. Multiple
parenthood is a well established practice to profit from multiple inheritance
of properties.
Key characteristic:
Graph structure
When the relation used to build the hierarchy
is of one transitive relation only, i.e. the nested (child-) term stands in a
'is-a' or ‘part-of’ relationship to its parent term throughout, we speak of a Taxonomy (from Greek verb τασσεν or tassein = "to classify"
and νόμος or nomos = law, science, cf
"economy"). Taxonony was once only the science of classifying living
organisms.
The Taxonomy is a
hierarchy (usually a collection of controlled vocabulary terms) build according
to one intrinsic property of the items to be taxononized (e.g., whole-part,
genus-species, type-instance). Some taxonomies allow poly-hierarchies, which
means that a term can have multiple parents. If a term has children in one
place in a taxonomy, then it has the same children in every other place where
it appears.
A taxonomy is a directed
acyclic graph satisfying the following conditions [6]:
(1) The nodes in the graph are classes.
(2) An edge between x and y represents a direct
taxonomic (IS-A) relationship from x to y. x is called a child
(or subclass or subcategory) of y and y a parent (or
superclass) of x. A class–relationship–class
triple (x, IS-A, y), called a relation,
can also be used to represent the edge between x and y.
(3) A taxonomic (IS-A) relationship holds
between class x and y (i.e., (x, IS-A,
y) )
if (a) x is a child of y, or (b) there exists a class z such that the two relations (x,
IS-A,
z) and (z, IS-A, y)
hold.
If (x,
IS-A,
y) holds, x is called a descendant of y and y an ancestor of x; in such cases, x is more specific than y (or is subsumed by y) and y is more general than
x.
(4) There is one and only one class, called the root of the taxonomy,
which has no parents. Every class except the root has at least one parent.
(5) The class x1, x2, . . . , xn (n>1)
are
called siblings if they all have the same parent.
The difference between
a classification and a taxonomy is that a taxonomy classifies
in a structure according to one defined relation between the entities and that
a classification uses more arbitrary (or extrinsic) grounds. As an example of intrinsic
grounds, spinach is a vegetable and not every vegetable is spinach, so
spinach is a subclass of vegetable. The decision to place spinach in the class
vegetable is based upon data intrinsic to the entities, so this
would be a piece of taxonomy (a taxonomy with a subclass hierarchy). A
classification of vegetables according to the sortal “Do I like to eat it”
would be based on an extrinsic
property. This would lead to a
classification, not a taxonomy. A taxonomic relation is a relation between
entities in the taxonomy (the is_a relation
in most cases), a classification relates
the entities to something that is external.
When the relation used
to build the taxonomy is of 'part-of'
type, then we call such a taxonomy a Meronymy. For example,
'finger' is a meronym of 'hand' because a finger is part of a hand.
A collection of terms
allocated to resources by endusers in order to categorise or index them in a way that these endusers consider useful is called Folksonomy.
Terms in such 'democratic' folksonomies, are typically added in a fast,
pragmatic, decentralized and uncontrolled manner, without making the underlying
structures or principles explicit necessarily. The process of folksonomic data
(in most cases website-) annotation is intended to make a body of information
increasingly easier to search, discover, and navigate by human users. A
well-developed folksonomy is accessible as a shared vocabulary that is both
originated by and familiar to its primary users. Part of the appeal of folksonomies
is its independency of search engine censorship (which is currently applied by
all major Software companies, i.e. Symantec, eBay and Google).
A thesaurus is an associatively networked list of words
and their descriptions in natural language. The terms refer to each other through different often informal relations.
A thesaurus does not need to have a taxonomic structure. Usually it is a list
of controlled terms that refer to each other verbally. The relationships vary
in detailledness (can be simple ’synonymous’,
‘broader_than’, or even ‘related_to’
relations). A formal definition of a thesaurus designed for indexing
(according to wiki) is: "A list of every important term (single-word or
multi-word) in a given domain
of knowledge and a set
of related terms for each term in the list."
A DAG is a directed
graph with no directed
cycles; that is, for any node, there is no nonempty directed
path starting and ending on itself. The most famous prototype for a DAG is
the gene ontology controlled vocabulary.
An
Ontologies were
mentioned as ‘categories’ in Aristoteles Metaphysik, but the word 'ontology'
itself was first established in the 17.century. The Encyclopaedia Britannica
defines ontology as “the theory or study
of being as such; i.e., of the basic characteristics of all reality”. This
is a philosophy centered definition. The field has exploded with the dawning of
IT technology and has shifted in meaning within this field.
“Ontology” is the buzzword used on the internet when discussing
the semantic web. The WebOntology working group at W3C emphasises that ontologies are a machine-readable set of
definitions that create a taxonomy of classes and subclasses and relationships
between them.
The word ontology was
established to the biocommunity mainly through Gene Ontology, an effort that in
fact build a taxonomic CV. This has created much confusion over what an
ontology is. An ontology resembles both a kind of taxonomy plus definitions and
a kind of knowledge representation language that allows to capture additional
relations, not just the one used to build the taxonomic structure. In an
ontology one can specify the relation which is used to build the hierarchy
aside of many others. A clear boarder between a rich “taxonomy” and a “simple
ontology” is nevertheless hard to define.
The fundamental
difference between a classification and an ontology is in the richness of information
available formally. Both provide a list or structure of classes, but a
classification stops at that point, whereas an ontology also provides further information on the classes such as
definitions and properties like attributes and relations.
Ontology is defined in the DIP Glossary as “The
formalization of a terminology (set of terms and possibly their interrelations)
used in some domain of discourse. An ontology represents consensual knowledge
about a domain of discourse (in form of terms and possible interrelation among
them) in a formal way that can be shared between agents and makes this
knowledge accessible by machines. …” The most popular definition of an ontology
from the Semantic Web and AI perspective is the one provided in [11], http://ksl-web.stanford.edu/KSL_Abstracts/KSL-92-71.html : “An
ontology is an explicit specification of a conceptualization”, where “a
conceptualization is an abstract, simplified view of the world that we wish to
represent for some purpose.” Ontologies can be considered as RAs intended to
represent knowledge in the most formal and re-usable way possible. Formal
ontologies (considered in the AI) are represented in logical formalisms (like
OWL) which allow automatic inference over them or datasets aligned to them. We
would describe ontology as a CV
expressed in a formal representation language, which enables to formally
capture a defined semantics. The most well known representation languages
used to structure ontologies are OWL (DL-semantics) and OBO. Ontology
representation languages differ in their semantic expressivity. Ontologies are
rich enough to express meanings as formal and hence computer-accessible models
through use of defined related RUs. Ontology representation languages have a
defined syntax, semantics and grammar. Usually
it is regarded that the use of one of the following semantic idioms makes a CV
an ontology: object properties, cardinalities, restrictions and axioms.
Def agreed by Barry, taken from
semeda: In ontologies nodes from a CV (each of which is associated
with an identifier, term, definition, and an optional set of synonyms.) are linked by directed edges, thus forming a graph. This graph
represents a counterpart structure on the side of entities (classes,
universals) in reality, and its edges represent the relations (e.g. is-a or part-of)
which hold between these entities. If a node has a parent node in the is-a hierarchy, then we say that the corresponding
class is subsumed by this parent
node.
Another rather
pessimistic definition [12] states an ontology is “a common language to
express human confusion”….
From “Data, Information, and Process Integration with Semantic Web Services
(DIP)”, http://dip.semanticweb.org
and https://bscw.dip.deri.ie/bscw/bscw.cgi/0/3016:
Knowledgebase (KB) is a term with a wide usage and multiple
meanings. It can be seen as a dataset
described through some formal semantics bearing REPRESENTATIONAL ARTIFACTS.
A KB, similar to an ontology, is represented with respect to a knowledge
representation (or just a logical) formalism, which usually allows automatic inference. It could include multiple
axioms, definitions, rules, facts and statements.
In short a
knowledgebase is the ontology in use: when it is instantiated or it's classes
are used to annotate data. In this sense the ontology is the
T(terminological)box and the data annotated through the Tbox is called
A(assertional)box. Both together make a knowledgebase.
In general within word processing tools there
are the following possibilities to encoding metadata about RU types in text
styles: [to be added: Formatting convention when using ontological RUs in
literature – see OBO RO paper][13]
Underlined, Italics,
Bold (these three can only be applied in word processing software),
UPPERCASE, lowercase, S p a c e d, demarking affixes, e.g.
prefixes like > : # or _, e.g. _prefixed
circumfixi like " '
* '# , e.g. *circumfixed*.
A more direct way is to explicitly state the
type of a RU as prefix, e.g. class:some_class, property:some_property.
An other possibilitie is to put the term in xml
style elements which states of what type the RU is, e.g.
<class>some_class</class>. This is not recommended.
We
recommend the following formattings for depicting different RUs:
Universal: BOLD_UPPERCASE,
e.g. DOG
Particular: ‘normal lowercase’ (with single
apostroph), e.g. ‘fido’
Class: bold_lowercase,
e.g. dog
All simple words terms, e.g. a preferred name:
normal (with doubled apostrophes): “canis”
Instance:’bold_lowercase’
(with single apostroph), e.g. ‘fido’
Properties between classes: italic_lowercase e.g. is_a
Properties between all other RU types: bold_italic_lowercase,
e.g.
instance_of
Three kinds of binary relations can be
distinguished according to their domain and range types [13]:cc, ci, ii
An example:
The universal DOG is represented as an ontological class dog with the preferred term name “canis”. The class dog in the representational artefact is instantiated through the
instance ‘fido’ which models that
particular ‘fido’, which sits on your lap.
dog is_a mammal. The dog has_mood
. ‘fido’ has_mood “sleepy”.
·
Become
acquainted
with the capablities and incapabilities of
1. the representation formalism you use
2. its implementation language
3. the ontology engineering tool of
your choice.
·
Save often! Always
save to a new version number including the date. Protégé-OWL is not yet completely stable.
'Undo' is difficult and bugs occasionally corrupt ontologies beyond retrieval.
·
Don’t get into 'analysis paralysis'! You will not get it right at the
first time! Sometimes one has to throw things away and start again. Do not get into the ‘naïve euphoria’ either.
Not every fancy just-built piece of representation is an ontology worth
bothering others.
·
Don’t get stuck in the ‘Meta-Ether’.
Do not try to
capture all possible metadata. Only formalize what is of immediate use for the
projects outcome. You have to stop capturing
·
Don’t confuse the 3 layers of
reality. Always be
aware of the level you are modeling. Try to model consistently ‘reality
only’ and do not mix it with ‘models of reality’ within your model.
The ontology is your model of reality, therefore do not try to model an experiment AND the description of an experiment
(I.e. a protocol) unless you really need the 'protolollness'.
·
Avoid overloaded term names. The use of overloaded terms
such as “experiment” “method” “technique”, “instructions” has to be avoided.
They are
ambiguous and have too many meanings across diverse domains. In
this example a series of events or actions used should be represented as a
single or collection of atomic “Protocols” rather than using all the terms
above.
WordNet 2.0, http://www.cogsci.princeton.edu/cgi-bin/webwn
defines a module as a selfcontained component (unit or item) that is used in
combination with other components. This is also the case for RUs and for RA
build out of RUs. Build your RA in such a way that you have clearly separated
orthogonal modules, that relate to each other. These modules correspond to
upper level classes in your ontology.
[add]
Names of RUs (including the ones for relations) should
have the same meaning on every occasion of use and refer to the same universals and kinds
of entities in reality. Each name should refer to exactly one RU, and each
RU should represent exactly one entity in reality (a universal in the case of a
class). This principle of univocity excludes homonyms, terms that are used as
names of more than one RU. For example, if you use the term ‘cell’ as a name of
the class representing (the type of) cells as found in all organisms, the same
term should not be used as a name for a more specialized class representing
(the type of) cells as found only in plants. Likewise, the term ‘part of’
should not be used to name more than one relation, e.g., partonomy, set
membership, etc.
Further more:
Don’t confuse universals with ways of getting
to know types
Don’t confuse universals with ways of talking
about types
Don’t confuses universals with data about
types
Complements of classes such as ‘non-mammal’
or ‘non-membrane’ are not necessarily themselves classes and don’t designate genuine universals. Similarly, do not represent the absence of an NMR magnet as the presence
of the non-existence of an NMR magnet, e.g.: 'NMR magnet' has_status
"absent". Which universals exist is not a function of our
biological knowledge. Be aware that terms such as ‘unknown’ or ‘untypified’ or
‘unlocalized’ do not designate genuine universals. The positivity recommendation may need to be weakened; sometimes it can
make sense to have e.g. an "ex-vivo" role or a “non-living_organism”.
No distinction
without a difference. A child class must differ from its parent class in a distinctive way. A child class must share all the properties of its
parent classes (inheritance principle) and have additional ones that the
parents have not. Each class must be defined in a formula which states the
necessary and sufficient conditions for being an instance of the corresponding
universal. The sibling class of a given parent class should have differentia
which are really distinct. This means that the universals of these classes at
least have distinct (ideally non-overlapping = single inheritance) extensions.
The distinction between each pair of siblings must be explicitly represented
(opposition principle).
To characterize classes, formulate
intrinsic properties (properties that are inherent to the universal represented
by the RU) rather than extrinsic ones (properties that are asserted from
outside, e.g. accession numbers). ‘Intrinsic’ describes a characteristic or
property of some thing or action which is essential and specific to that thing
or action, and which is wholly independent of any other object, action or
consequence. A characteristic which is not essential or inherent is extrinsic
(from http://en.wikipedia.org/wiki/Intrinsic).
No class in the
hierarchy should have more than one superclass when starting to build an
ontology.
Sometimes a class seems to have multiple valid parent
classes, because:
• The word represents a complex concept.
• The word is a homonym (has more than one meaning).
• The discussion brings related concepts to light.
Refrain from using multiple parenthood at the beginning, because multiple
parenthood resulting in multiple inheritance can generate subtle but systematic
ambiguity in the meaning of the used formal is_a and part_of relations [14, 15]. Do not press the is_a into service to mean a variety of different things (see
univocity principle). Otherwise a relaxed reading of is_a relations can lead to assertions of is_a relations which erroneously cross the divide between different
ontological categories .
Domain-experts should build single parenthood taxonomies of their views
of reality. Other domain experts build the same for theirs and only later all these taxonomies will get ‘multidimensionally’ aligned within obo and
secure common nodes will result which make consistent (!) multiple inheritance
possible.
There are however many opinions on
this issue. The above statements represent the ‘realist’ perspective on things
and we might discuss this matter further, when we feel there is a real need for
multiple parenthood.
[Alan Rectors Normalisation and untangling practices have to be discussed
here, too…]
Each class representing a universal in a
representational artifact is labelled with a human readable class name. Class
names should be short, easy to remember and self-explanatory. The human readable class name should be used as default browser key or display key when navigating through the class hierarchy and should
therefore be as intuitive as possible to the ontology engineer building the
ontological structure. However this class name will not necessarily be used as
the main search attribute by the end-users or agents when they are searching
for classes. For this a short and intuitive class name should be captured as
preferred synonym, which would be a less explicit term of highest usage frequency
found in the domain literature, i.e. the term with the highest user acceptance.
Class
names should be precise, concise and linguistically correct (i.e.
they should conform to the rules of the language used). Often terms for RUs
are not precise, i.e. they do not capture the intended meaning. Imprecise terms
are especially problematic in the absence of good definitions. For example the
term “anatomic_structure, system or substance” does not give us
any clue as to whether the scope of the adjective prefix “anatomic” is
restricted to structure or extends also to system and substance. This ambiguity
can lead to problems like the following: If “anatomic” is restricted to
“structure” only, then “drug” and “chemical” would be classified under this
class, since these are clearly substances. If it is not restricted “drug” and
“chemical” could not be classified under this class.
Avoid
overloaded and highly ambiguous words and morphemes. A
sensefull cutoff has to be found between conciseness and unambiguity on one
side, and intuitivity and usage frequency within the domain on the other side.
For the preferred name, avoid adding many
semantically equally words, because this distracts and slows down the
perception of the intended meaning [16].
The class should
represent and be named after the intrinsic, underlying nature of the universal
to be represented, not according to extrinsic properties or roles a class can
play in a particular context. Embodying the whole meaning of the class - with all its
relationships to other classes - in its name is in most cases neither possible
nor recommended. Keep semantics in the definitions and formalize it
explicitly as properties and axioms. For example, a class “distinct_identifiable_physical_part”
should be just called “physical_part”. For the human-preferred name readability should have
higher priority than constraining interpretation through the class names. For
the class name that is used for OE, it is the other way round.
Epistemological statements (using meta-level jargon) don't
belong in the class names so avoid calling the class “instrument” “instrument_class” or
the relation “has_part” “has_part_relation”. Since each class 'A'
implicitly means 'the
class A', either prefixes or affixes involving “_class” must be avoided. The
same applies to suffixes like
"_entity" and "_type".
Be
explicit, try to avoid ellipses and apocopes,
because what you leave out or think as implicitly clear is not necessarily
known by others and in any case not for computers. An ellipsis and apocopes
are rhetorical figures
of speech, omissions of sentence parts, words or word
parts when used in vernacular language that are normally required by strict
grammatical or lexical rules but not by sense. The missing words are implied by
the context in human language. Ellipse usage often points to slang words which
should be avoided, or put as synonyms, e.g. "chemo" for
"chemotherapy".
(The aposiopesis is
special form of rhetorical ellipsis (wiki). Typical examples of this are: NMR
detects Receptor, and the Receptor the transmitter, in which the second
instance of the word detects is implied rather than explicit. )
The Plant Ontology used to use 'cell' to mean
'plant cell' in this way, which led to problems when they had to extend the
ontology to deal with bacteria in plants. They have now changed the definition
and name of their former 'cell' to ‘plant cell’ and created a broader
‘cell’ class. The general rule is, for
every expression 'E': 'E' means: E. The term ‘E’ means what the word ‘E’ means,
but the word ‘E’ may mean different things...
Sometimes hyphen usage is a hint for Ellipse
usage. This should be avoided, e.g. "bio- and genetechnology" would
be "biotechnology and genetechnology" and then probably modelled as
two separate classes "biotechnology” and “genetechnology".
Confusingly we sometimes use the same general
terms to refer both to universals and collections of particulars. Consider:
· HIV is an
infectious retrovirus
· HIV is spreading very rapidly through
This however could also be regarded as an
ellipise: The first ellipse "HIV" stands for "HIV-Virus",
the second ellipse stands for "HIV-Disease".
One definition of
synonymy, as proposed by ISO 1087-1:2000: A synonym is a “… relation between or
among terms in a given language representing the same concept, with a note to
the effect that terms which are interchangeable in all contexts are called synonyms; if they are interchangeable only
in some contexts, they are called quasi-synonyms.“ [I don’t think
‘quasi-synonym’ should exist, see next chapter]. The number of synonyms for a
class is not limited.
Should the same text string be used as a synonym for
more than one class? How do we handle
Homonyms? [???]. If you edit or delete a class name, the
old name can still be a valid synonym, e.g. if you change
"respiration" to "cellular_respiration", think of keeping
"respiration" as a synonym (but in this case make it a superclass…).
This helps other users to find "familiar" classes. 'Jargon' type
phrases, abbreviations and
acronyms are synonymous with the full
name as long as they are not used in any other sense elsewhere.
Translations of the class name into other languages are sometimes captured as
synonyms, too. We would recommend to capture translations in a different
element, e.g. owl provides a nice functionality to set the 'lang' attribute,
e.g. for the rdfs:label annotation property.
As we saw above some ontologists perceive synonyms as
not always 'synonymous' in the strictest sense of the word, as they feel they
not always mean exactly the same as the class they are attached to. Some ‘synonyms’ seem to be broader or narrower in
meaning than the class name; it may be a related phrase or alternative wording,
spelling or use a different system of nomenclature. Having a single, broad
relationship between a class and its synonyms is adequate for most search
purposes, but for applications such as semantic matching, the inclusion of a
more formal relationship set can be valuable. Here sometimes synonym types are introduced. However we do not recommend to capture such
‘synonym types’ as the GO style guide suggests. Capture only exact synonyms.
Thesaurus information should be kept in a thesaurus (SKOS) semantics and not
called synonym, but e.g. “broader tag” instead “broader synonym”.
One should also
capture object property synonyms (see section 4.1 of http://www.w3.org/TR/owl-guide).
Ideally, abbreviations in names should be avoided and
acronyms resolved. Names for RUs should be explicit, e.g. "number_of_residues" should be
used instead of a totally unintuitive "n_res". Abbreviations and Acronyms can have different
meanings in other domains, e.g. "Ca" for calcium could be mistaken
for "CA", which means cancer in many other fields. You will be
surprised how many different resolutions and meanings an acronym can have, e.g.
try “NMR” in the tool http://www.acronymfinder.com/ and you will get
about 20 meanings:
News Media
Representative, Nielsen Media Research, National Museum of Racing and Hall of
Fame (Saratoga Springs, NY), National Monuments Record (UK), Not My Rip
No Moves Received
(online gaming), Network Measurement Report, Non-conforming Material Report,
N-Modular Redundant (reliability), No Mail Receptacle (USPS) , No Maintenance
Requirement and New Mobile RAPCON.
When an acronym is commonly used with very high
frequency in everyday language in place
of its full name (then called an anacronym), for example “laser”, it can be used within a class name, while its
resolved name should be listed as synonym.
Top level
classes should never have abbreviations or acronyms in their names, however, there are bottom level classes in
which an acronym or abbreviation could be used. In these cases of compound
terms on the bottom level the acronym should be unambiguous and be resolved at
least in one of the synonyms. When an abbreviation is well known, unambiguous
and appears to be needed and is re-occuring in many RU names, then use the
abbreviation. NMR and HIV are border cases. Use feeling here. Only the main focus
Acronyms that are found frequently in the ontology can stay as they are.
Resolving e.g. “NMR” as
“nuclear_magnetic_resonance_spectroscopy” in each RU within
an NMR ontology makes too many terms unnecessary long and hard to read.
Do not
allow abbreviations which employ expressions with other meanings ('chronic
olfactory lung disorder' should never be abbreviated: cold). If they can’t be
avoided capitalize Acronyms. There is no
clear policy on when to spell out abbreviations, so use your common sense.
Proprietary
names should be captured as they are, as long as
this is not prohibited by ‘allowed character’ rules for the
element used to represent it. Proprietary names can should be captured as they
are and are allowed to break typographical conventions, e.g. there can be a
" AVANCE_II_spectrometer " (starting with a capital letter) and there
can be a CamelCase brand name like “SampleJet”.
Since product names often get very cryptic
(e.g. a Bruker NMR magnet has the product name “US_2”), we recommend a
convention that renders these more understandable: Use the company name as
prefix, the product name as infix and the product type (superclass) as headword/suffix,
e.g. use “Bruker_US_2_NMR_magnet” instead of “US_2”.
[parsers,
add ]
Names should be lower case letters throughout except for acronyms which
are capitalised (if their use in class names can't be avoided) and proprietary names, which are written as such. Acronyms and brand names can break the conventions
rules unless rdf-field restrictions prevent these. E.g. there can be a "NMR_instrument"
(starting with a capital letter) and there can be a CamelCase brand name like
“SampleJet”. Other KR-domains (semantic web / OWL, Protégé-group), use capitals
for beginning class names, while properties start with lower case letters.
Internal
capitalization is however enforced by some computer systems, and mandated by
the coding standards of many programming languages, i.e. Java coding style
dictates that UpperCamelCase be used for classes, and lowerCamelCase be used
for instances and members. So unless you plan to use auto generated java
classes or any MDA approaches to convert the ontology into software code avoid
CamelCase.
Terms designating
RUs should
consist mainly of alphabetic characters,
numerals and underscores. Whether
you will be allowed to use the space as word delimiter depends on the way the
implementation handles the strings for the representational unit in question. Avoid special characters where possible. Avoid character-combinations that may have a
special meaning in regular expressions or programming languages and XML. This recommendation
is largely dependant on what the parsers for the implementation format for the
specific RU can handle.
No accents, subscripts or
superscripts are allowed
(e.g. cm3
replaces cm3 and CO2 replaces CO2). The
Names of chemical elements from the periodic table should be written in full
length and should not be abbreviated with their symbols. (use hydrogen, copper
and zinc rather than H, Cu and Zn). Greek symbols should be
spelled out e.g. "alpha" instead of a. Temperature designations like 37° C. can be
represented as
37C or better be represented formally through a proper units
ontology.
Full stops, exclamation- and question marks do
not belong into class names.
Various kinds of punctuation
connect name parts, including separators such as spaces, hyphens, and grouping
symbols such as parentheses. These may have:
a) No semantic meaning. A naming rule may state that word separators
will consist of one blank space or exactly one special character (for example
the underscore) regardless of semantic relationships of parts. Such a rule
simplifies name formation.
b) Semantic meaning. Separators can convey semantic meaning by, for
example, assigning a different separator between words in the qualifier term
from the separator that separates words in the other part terms. In this way,
the separator identifies the qualifier term clearly as different from the rest
of the name. For example, in the data element name
“Cost_Budget-Period_Total_Amount” the separator between words in the qualifier
term is a hyphen; other name parts are separated by underscores.[???]
Other languages, e.g.
asian languages, form words using two characters which, separately, have
different meanings, but when joined together have a third meaning unrelated to
its parts. This may pose a problem in the interpretation of a name because
ambiguity may be created by the juxtaposition of characters. A possible
solution is to use one separator to distinguish when two characters form a
single word, and another when they are individual words.
Class name terms should be delimited by the "_" (underscore)
separator. The underscore substitutes the space character. Whether you will be allowed to use the space as
word delimiter depends on the way the implementation handles the strings for the
representational unit in question. Under the OBO umbrella one can find:
"MyClass" "My Class", "My-Class",
"My_Class", “My_class" and "my class" conventions,
sometimes even within one ontology. One convention is not necessarily better or
worse than the other as long it is used consistently within the ontology. Java
programmers, for example, use the "MyClass" (CamelCase-) convention,
because that is the standard for naming Java classes, whereas text miners use
"My class" convention, because it is easier to tokenize by natural
language processing tools. The CamelCase convention has problems to capture
class names like “Sample_pH” which would then read “SamplePH”. XML based
languages don't like the space as a separator, so check how your parser copes
with it in the (meta-) RU which captures the name for the RU.
The hyphen should be
avoided as word-separator and it should be used as in normal written English. Java will interpret
the Hyphen as a minus. Using the hyphen as separator would also cause ambiguity
when using hyphens when required by English, e.g. “copper-based_compound” and when used to restrict or refine the meaning
of a name, e.g. within homonym resolution: "bow-boat_part" and
"bow-the_weapon" as is still done in some ontologies. In general we
recommend to avoid overloading the hyphen or equivalent characters with
meanings and use the hyphen as used in natural language only.
The Hyphen has many meanings which we take for granted, but which have
to be assigned more explicitly to be processed by computers. When using the
hyphen one should be aware that its meanings can conflict: It can generally
mark an undefined "somehow-related-to" relationship, it can mark a
closer semantic binding as in “copper-based_compound” and can encode
substantiation like in "abdomen-sonography", but it can also mark a
divergence in meaning between the two words, as in "black-white". In “bio- and
genetechnology” it encodes an apocope, standing for the morpheme
“technology”. Sometimes the hyphen encodes different logical connectors like "and" or "or" and it can be used
to separate syllables when breaking a work in two at the end of a line. In
sentences it can of course also encode separation marks for additional thoughts
squeezed into a sentence as in “Enzymes – except Prions – are useful Proteins”
The hyphen also demarks numerical,
spatial or temporal lengths as in “1–4 telephone calls”, “Bremen–Hamburg” and “25.09.–28.12”, or is used as a minus or to indicate an omission as
in “the PC is worth 300,–“.
We need to differentiate between the hyphen and a dash. There are two kinds of dashes: the n-dash and the m-dash. The n-dash is
called that because it is the same width as the letter "n". The m-dash
is longer, he width of the letter "m". We use the n-dash for numerical ranges, as in "6-10
years." When we need a dash as a form of parenthetical punctuation in a sentence use the m-dash.
The slash
"/" means OR or AND in most cases and should be avoided
in class names as should logical connectives in general.
Consistency is required if encountering this special case.
Where there are differences in the accepted spelling between English and
US usage, use the US form, e.g. polymerizing, signalling rather than
polymerising, signalling.
A common source of misspelled tags is the translation from other
alphabets or characters. For example, the Umlaut, commonly used in German, is
usually represented by the Latin-1 character set. Since this character set is
often unavailable, Germans frequently represent an Umlaut character by means of
a longhand encoding, such as "ue" for "ü". Consistency is
required in these special cases to avoid mixture of "ü"s and
"ue"s.
Names for RUs should
be in the
singular form throughout. This prevents redundancy and
misclassifications, e.g. creating a class "experiments"
(plural) and then "experiment" as its subclass
deeper in the hierarchy (true only if the idiom used is checked to keep a
unique string, e.g. :NAME field in Protege). If you want to import legacy XML
or generate XML feeds from the ontology you have to use the singular form
anyway, since this is the expected convention for XML tags.
Class names are
always nouns, so use "randomisation"
instead of "randomise" if you intend to model a class, use
"randomises" if you model a property. Nouns are the
most concrete part of speech. Verbs can be converted to nouns. Adjectives
and adverbs, however, seldom convey meanings captured via atomic classes. They
correspond more to properties [section needs work ???].
Class and property names (verbs) should be
uniformly captured in present tense.
Sometimes a time
perspective is indicated within class or property names, i.e. ”to_be_measured”,
“measuring”, “measurement_taken”. Class names should be
normalized consistently into the present tense form or better be tense-less
nominals, e.g. “measurement”.
If you have to capture
plurals you have three possibilities e.g.
“protocols” “set_of_protocols”, “protocol_set” and “protocol_collection”. The
last form is recommended (just add
the “_collection” postfix to the singular class name), because it is easier
to spot (also for textmining). It is preferred over “collection_of_x” because
it is placed alphabetically directly beneath its singular form within the
hierarchy. The “X_set” convention has to be avoided since the word “set” is
highly ambiguous. Use plurals sparsely
and only if you really think you will need them for the application.
Creating for each singular x a plural-container of the form “x_collection”
creates a lot of classes, which we might not use at all. An instance of
'protocol' is a protocol and an instance of 'protocol_collection is a collection/set
of protocols. Be aware of the difference: Each class 'A' in an ontology has the implicit
meaning 'the class A'.
[Refine, (Chebi comment)]
NOTE: The realist distinguishes set
theory ‘classes’ (not the type we use the word class for here) from collections/sets
for ontological classes (types): Both
classes and collections are marked by granularity, but collections are
timeless. A set theory class endures through time and survives the turnover in its instances.
A set theory class is not
determined by its instances (as a
state is not determined by its citizens and as an organism is not determined by
its molecules). A collection/set
is determined by its members. It is
an abstract structure, existing outside time and space. The set of human beings
existing at t is (timelessly) a different entity from the set of human beings
existing at t' because of births and deaths.
Rules for compound term names should be
investigated, e.g.:
a) The object class term shall occupy the
first (leftmost) position in the name.
b) Qualifier terms shall precede the part
qualified. The order of qualifiers shall not be used to
differentiate names.
c) Descriptive property terms
shall occupy the next position.
d) Terms designating the parentclass
shall occupy the last position.
e) If a word in the name is redundant i.e. with
a word in the property term, one occurrence should be deleted.
f) Do not put the type of the RU you
model (i.e. '_class' or '_propertiy' ) at the end of the classname.
Names for RUs that are
used to show up in the hierarchy (i.e. the browser or display key) and should
be read in a fast manner for orientation purposes, should be at least four characters
long and as short as possible to
be easy readable and understandable. It should be avoided to create human
readable or preferred names that look like full sentences. Ideally, short and maximally
intuitive names are to be preferred. Names are useful only if they are in fact used
[see JacobKoehler paper."intelligibility
of GO terms" + DILS paper].
Word compositions longer than five
words and very complex morphemes should be avoided. When class names
are made out of more words, try to use words that are already defined in higher
hierarchy levels of the ontology. Build
compound names out of simpler ones from the ontology in a consistent LEGO-like
approach. Consistent means that the binding
operators (words used to connect the other parts of the class name) are used in the same sound manner throughout the ontology. ‘Recycle’ words whenever possible.
A formal class name can be given to a
class, i.e. a name for the class that is formally controlled through
linguistical rules and axioms. E.G. OBOL normalized ones, that adhere to
defined principles of word/morpheme/affix order and form or class names that
use a controlled natural language (CNL) such as KANT or ClearTalk or Attempto
Controlled English (ACE). CNL are subsets of natural languages whose grammars
and dictionaries have been restricted in order to reduce or eliminate both
ambiguity and complexity. CNL can improve readability for human readers and
improve computational processing of the text.
Sometimes one encounters rather long names for RUs, which encode a lot
of semantics within the name. These complex names are compositions of many
words and therefore are called compound terms. They often consist of a noun
phrase, like "sample_temperature_in_autosampler" embedding a prepositional term (localizational property like
"in_autosampler").
[Compositionality – see Chris
Mungall's OBOL , see Okren]
Try to avoid to use a paucity
of resources for expressing relations. GO for example captures relations
implicitly and indirectly within class names by constructing class names that
contain syntactic operators such as 'site_of', 'within', 'extrinsic_to',
'space', 'region', and so on.. This is a result of a lack of (e.g. asserted ‘location’)
relations. It then simulates assertions
of location by means of 'is_a' and 'part_of' statements involving such
composites, for example in:
extracellular region is_a cellular component
extrinsic to membrane part_of membrane
When the representational formalism allows to formalize properties and
the atomic compounds are already present, these classes can be refactored / dissected / decomposed into more primitive existing classes (atoms) and attributes or relations
between them (In owl-speak: you build a named/defined class from primitive
classes and restrictions). This is encouraged for OWL ontologies. When only an is_a
hierarchy (without properties) is provided, compound names should be kept in
the long form to capture what
the user really wants to express and one has to keep the semantics within the
class.
As long as working with CVs one should aim to be reasonably descriptive, even
at the risk of some verbal redundancy or longer names. That is why one often
finds rather long class names in taxonomic CVs.
When word combinations with genitive, dative or accusative case occur,
variants are possible, e.g. Combination into one single word, e.g. Breaking_off_the_experiment à experiment_breakoff or connection with hyphen, e.g. NMR_of_Hydrogen à Hydrogen-NMR.
According to DIN 12/1993, when new terms are created out of existing
already defined class names the following types of multi-word terms can be
distinguished (B. Schaeder, Fachlexicographie: Fachwissen und seine
Repraesentation in Woerterbuechern, 1994, Tübingen):
Determinative term linkage:
A second term occurs additionally, as a feature in the content of the
original term, whereby the latter is restricted. The resulting multi-word term
is a subterm. E.g. fast_NMR.
Disjunctive term linkage:
The new multi-word term encompasses the scope of both constituent terms.
E.g. GC_MS.
Integrating term linkage:
Objects associated to terms are combined into the next higher whole.
E.g. sponsor-investigator.
Conjunctive term integration:
The new term merges the contents of both constituent terms, and is their
next common subterm. E.g. investigator_study.
[To be evaluated…]
Simple (sometimes
hyphen separated) and bimorphemic compound terms like
"histology-result" should only be atomised into histology and result
when the occurring morphemes represent single important classes themselves
which are of use in other multi-word creations. E.g. for a clinical trail the
atomic morphemes "ethics" and "commission" are not
important, so a multi-word term like "ethics_commission" can stay like
this and needs only be defined once as is.
The standard procedure for refactoring / splitting a class
is to obsolete the original class and add a
suitable comment directing annotators to the new classes (see Metadata
Annotation document on http://msi-ontology.sourceforge.net/). Classes are merged in cases where two classes have exactly the same
meaning in all contexts (i.e. are synonymous). Usually this situation arises
when one class exists, and another wording of the same concept is added as a
new class instead of as a synonym, either because a curator didn't find the old
class or didn't know it meant the same thing.
The word-stem should be used to formalize class
names and affixes to names should be avoided where possible and in any case be
used consistently. When an ontology has many terms starting with the same
prefix, for example “sample_number”, “sample_origin”, it suggests the need for
transforming the postfixes into properties of a [prefix]-class when building
the ontology. If subclasses are named using the class-name and a further
descriptive morpheme, this should be done in a consistent way throughout the
subclasses. For example, a class "receptor" can have two subclasses
named either “katecholamine_receptor” and “peptide_receptor” (naming them just
“katecholamine” and “peptide” would be a bad practice since ellipses have to be
avoided and “peptide” designates a complete different class anyway). So there
should not be the names “katecholamine_receptor” and “peptide”. If one prefixes
a "receptor"-subclass name in the form xy_receptor, e.g.
"adrenaline_receptor" (having the ligand as xy (prefix), one can't
integrate receptors that are named according to their succeeding signalling transduction
module, e.g. "G-proteine_coupled_receptor" (and not the ligand) in a
consistent way. Infixes, circumfixes,
articles, conjunctions and possessive forms of words should be used
consistently, but be avoided when possible.
According
to the realist view on ontologies, logical connectives such as "and",
"or" and "not" should not be used within names for RUs, because
they will be formalised as constraints and axioms later (and hence will allow
for reasoning). 'rabbit or whale' does not designate a special universal of
mammal. In general, owl allows you to build named/defined classes and label
them accordingly.
Where possible, words from the metalevel (the representation formalism /
KR language) should not be used within names for RUs. The use of
database or ontology language keywords, for example "Model",
"Class", "KIF", "Clips" and "OWL" and
xml style tags or characters designating tags or regular expressions should be
avoided when possible, because you never know whether all parsers you might
need to use will handle these. Also when translations into other formats have
to be made you can be sure not to run into parser problems in these other
formats.
Other words and morphemes to be avoided are highly
ambiguous ones, e.g. the affixes “set” and “setting” belong to the most
ambiguous words in English. "Set" alone has over 20 different
meanings (set refers to the process of setting parameters or to a plural of
parameters.
Avoid anything that is
related to xml or regular expressions in your class name, since it might cause
problems in other parsers you might want to use later.
Class definitions should provide the context and meaning of the class in
a way to ease its interpretation. The
definition should contain important keywords that describe the classes inherent
attributes and relations to other classes in natural language. However in reality
proper definitions can not be created for all universals, especially at the
root level of the ontology (e.g. it is hard to define “thing”). A class should be given a humanly
intelligible definition only when the necessary and sufficient conditions for
being an instance of the corresponding universal are really understood. Before
that, do not make up pseudo-definitions (e.g. circular definitions), but
provisionally collect the necessary conditions in the comment field. Proofread
your definitions carefully to eliminate typos and double spaces. As with class
names, avoid using abbreviations that may be ambiguous. Keep in mind
definitions will aid textmining approaches also, so be formal and consistent.
If you refer to other classes, use their real natural language names and avoid
the ‘artificial’ underscore delimiter.
In practice one would first capture non-formal definitions as they come
from the domain experts, glossaries or gathered by a google:define search.
These are captured with their provenance (meta-) data in a “tempdef” field.
Then one creates a second definition which is more formal and standardized
according to the defined principles mentioned below. [combine with following
chapter:]
You can use different tools
to help you gathering initial informal definitions. The most usable are:
http://www.medbioworld.com/advice/dict.html
http://www.pharma-lexicon.com/
Google, define:
WIKI
….
1.
Each
definition refers to only one class.
2.
Definitions should be as brief as possible, but as
complex as necessary. Definitions
should be as clear and concise as possible in order to convey the essence,
"Das Wesen" (Silesius) of the universal to the user of the ontology.
3.
The
definition should be written at the same level of specificity as the class
itself.
4.
They
should begin with an upper-case letter, can consist of more than one sentence
if necessary and end always with a period (full stop).
5.
Definitions
should define classes and their referred universals and not the words used to
refer to classes (class names), so in definitions avoid terms like ‘class’,
'descriptor', 'name', etc. that refer to RUs and not to the universals in
reality. E.g. the definition of 'eye' is 'organ of sight', not 'is name of
organ of sight', nor ‘class or concept describing an organ of sight’. Avoid
using acronyms within definitions.
6.
The
definitions should explain what are characteristics (or properties) that
distinguish members of this class from the others (the upper class and
siblings). Notice that the formal definition is clear, concise, and unambiguous
(i.e. you could look at something and say whether or not it belonged to the
entity type).
7.
Definitions
with too many words like 'and', 'or', or 'where' in them should be viewed with
suspicion.
8.
Definitions
should use simple, easy to understand words that are meaningful to most of the
users. In the best case all terms in the definition can be find as classes in
higher levels of the ontology and are thus defined.
9.
It should
be positive and not negative. Definitions like ‘all animals that are not a
mammal’ or ‘ all non-membrane proteins’, which do not designate natural kinds
are not helpful, since complements of universals are not necessarily themselves
universals.
10.
The formal
rules for definitions laid down by Aristotle should be applied. When A is_a B,
the definition of ‘A’ takes the form: An A is a B which C... e.g: “A human
being is a mammal which is rational”. Essence = Genus + Differentiae. Definitions
should start in the following way: “A [class described] is a [superclass],
which/that [most relevant intrinsic properties (attributes and relations to
other classes)]. It…. [Enter]”. When using the word “it” make sure you always
refer to the described class only. If a class has more parents, I.e. multiple
parenthood can not be avoided, mention all parent classes in the definition.
11.
The
definition should be free from words sharing the same root as the thing being
defined (to be represented) and should not contain the class name itself. Avoid
circularity in definitions like these:
An A is an A which
is B (person = person with identity documents)
An A is the B of
an A (heptolysis = the causes of heptolysis)
12.
Each
definition should reflect the position in the hierarchy to which a defined RU
belongs. The position of a RU within the hierarchy enriches its own definition
by incorporating automatically the definitions of all RUs above it. The entire
information content of the hierarchy can then be translated cleanly into a
computer representation.
13.
The
definition must be correct in all possible contexts the class is used, so that
the class and all its synonyms are intersubstitutable with its definition in
such a way, that the result is both grammatically correct and truth preserving.
14.
Include
some examples of well known prototypical instances or subclasses of the class.
Additionally have a
look at the following paper by Jacob Koehler:
http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=1482721&blobtype=pdf
[Do we need definitions for particulars that we currently represent as
classes, e.g. do brand names of instrument vendors need definitions???]
In the future definitions might be autogenerated through semantic
conversions. Furthermore proper definitions can serve for quality control using
textmining [17]. Automated inference of class definitions is
already available from the Obol page. Note that these are automated, highly
experimental and subject to change: Obol [http://www.fruitfly.org/~cjm/obol]
Object properties
(relations) should have a definition as follows:
"The (name od
relation)-relation indicates a (class name from one relationship) that is (nature
of relation) for an (class name from other relationship).” For example, the
definition for the property ‘storage (of material)’ might read: “A
storage-relation indicates a material that is stored in a facility.” [??? refine]
[refine]
Following the
decentralized web paradigm, every single RU (class or relation) should be
versioned independently rather versioning the ontology as a whole. Therefore it is necessary to consider
conventions for unique identifiers for RUs. If one tries to edit a set of modular
ontologies held together by just the string class names, every time somebody
wants to change a name, fix a spelling error, etc. there is a global change
that is intrinsically unreliable or, if the ontologies are distributed, requires
a major organisational effort. When the identifiers are formal ID numbers
and human readable class names are kept as labels you can change the label
without disturbing the linkages. Hence versioning becomes easier when using
unique formal Identifiers for RUs in representational artifacts. Some ontology
editors, like Protégé-2000, construct identifiers out of the ontology name and
numbers automatically.
A unique identifier
MUST NOT be deleted once used. IDs should be conserved at all times so that,
even if a term is ‘defunct’ or has a new ID, someone searching using the old ID
can find it.
As a rule of thumb while user
friendly names for RUs should
not cause problems for human processing, their IDs should not cause problems
for machine processing. Always remind
that an ID is associated with a definition and a universal rather than with the
preferred class name.
The LSID concept introduces a straightforward approach to naming and
identifying data resources stored in multiple, distributed data stores in a
manner that overcomes the limitations of naming schemes in use today. Almost
every public, internal, or department-level data store today has its own way of
naming individual data resources, making integration between different data
sources a tedious, never-ending chore for informatics developers and
researchers. By defining a simple, common way to identify and access
biologically significant data, whether that data is stored in files, relational
databases, in applications, or in internal or public data sources, LSID
provides a naming standard underpinning for wide-area science and
interoperability. A LSID conforms to the URN standards defined by the IETF.
Every LSID consists of up to five parts: the Network Identifier (NID); the root
DNS name of the issuing authority; the namespace chosen by the issuing
authority; the object id unique in that namespace; and finally an optional
revision id for storing versioning information. Each part is separated by a
colon to make LSIDs easy to parse. Here are a few examples:
urn:lsid:pdb.org:1AFT:1 à This is the first
version of the 1AFT protein in the Protein Data Bank.
urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 àReferences a PubMed
article
urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 àRefers to the second
version of an entry in GenBank
LSIDs name and refer to one unchanging data object each. Unlike the
familiar URLs of the World-Wide-Web, LSIDs are location independent. This means
that a program or a user can be certain that what they are dealing with is
exactly the same data if the LSID of any object is the same as the LSID of
another copy of the object obtained elsewhere. The problem with URLs is that
they always point to a particular web server (which may not always be in
service) and worse, that the contents referred to by a URL often change.
A universal naming scheme simplifies the processing of data from a
variety of sources, because the application does not need to have specific,
hard-coded support for each naming scheme. This allows cross-referencing
between data sources to be done implicitly using URI’s. One such effort
currently underway is the Life Sciences Identifier (LSID) project. An example
looks like this: urn:lsid:uniprot.org:uniprot:P49841. This LSID names a protein
record in Uniprot that is referred to as P49841. It consists of parts separated
by colons: A prefix “urn:lsid:”, the authority name; the authority-specific
data namespace; and the namespace-specific object identifier (here “P49841”).
Each RA has a unique string associated with , the 'namespace' (NC) of
that RA. The NC serves as an identifier for all the terms in one RA and
designates their origin unambiguously. When using a NC designated RU it is
clear where the RU comes from (in which ontology it 'lives') and therefore in
which context and how it must be interpreted. Using the NC together with a RUs
identifier, it can be ensured that any RU within the www can be unambiguously
referred to. By maintaining different namespaces for different ontologies it is
possible for one ontology to reference RUs (classes, properties and
individuals) in other RAs in an unambiguous manner and without causing name
clashes. E.g. the OBI ontology at the moment refers to and makes use of the
Dublin Core, DC ontology for annotating its RUs. It refers to these DC classes
by importing the RA over the web and referring to each DC RU through the NC
"http://purl.org/dc/elements/1.1/".
To ensure that namespaces are unique they usually are Unique Resource
Identifiers (URI). As in the OWL language the class names are also part of a
URI, they may not contain spaces or special characters. In practice the namespace URI is an URL where the ontology can be found
from within the internet, e.g.:
For the NMR.owl : http://msi-workgroups.sourceforge.net/ontologies/msi/NMR.owl
For the OBI-ontology: http://obi.sourceforge.net/ontology/OBI.owl
To get the corresponding namespace from a URI just add the “#” prefix to
the URI.
For better readability however one can internally substitute the full namespace
with a short intuitive prefix, which should be the same as for the class ID,
e.g. “obi” or “nmr”.
There is no formal convention for determining
the location of an ontology given its URI, but it is generally recommended that
ontologies are made available on the web at a location that corresponds to
their URI., e.g. the FuGO ontology should be able to be found under http://fugo.sourceforge.net/ontology/
The NC does not necessarily point to a valid
URL. This is only a good practice recommendation. To
share the RA with others and let others use RUs from your RA per import, you
need to provide a stable web-accessible link to the ontology. Tt is suggested
to create a symbolic link in the main directory of the workgroup.
If the latest version of the ontology file name
changes or its physical location the symbolic link can be updated and there is
no need to update/mail everybody that uses the ontology, i.e. OBO webmasters.
[refine:]
Physical positions for the obi.owl file:
http://svn.sourceforge.net/viewvc/*checkout*/fugo/trunk/ontology/OBI.owl?view=checkout
This is good for downloading the source. This always grabs the latest
version, which is extremely useful for bulk-download software that currently
uses OBO
http://svn.sourceforge.net/viewvc/*checkout*/fugo/trunk/ontology/OBI.owl?revision=44
This is the owl file itself, but the revision specific one
http://svn.sourceforge.net/viewvc/fugo/trunk/ontology/OBI.owl
This is the general svn page from where to download revisions and Diffs.
https://svn.sourceforge.net/svnroot/fugo/trunk/ontology/OBI.owl
This is the convention MSI uses for importing the NMR.owl.
The date on which a
RA was frozen or its version number can be used to construct URIs for the RA
versions.
Ontology URI: http://www.example.com/nmr-ontology
Ontology version URIs: http://www.example.com/nmr-ontology_061004
http://www.example.com/nmr-ontology_061126
Since SF is sometimes very slow, a faster acessable
website would be better.
I would also suggest a simpler public URL that
more closely mimics the OBI namespace URI - http://obi.sourceforge.net/ontology/OBI.owl
As mentioned this can be created as a symbolic
link to the physical addresses above or to a faster accessible position.
The current OBO library system allows the
specification of separate "source" and "download" metadata
tags.
symlink (also called soft link or symbolic link) is a unix
shell script that can be used to create and remove soft links to files. So, can we just create a softlink:
Log in to obi.sourceforge.net/ontology/
create softlink:
ln -s OBI.owl
~/(https:/)/svn.sourceforge.net/svnroot/obi/trunk/ontology/OBI.owl
The "~" stands for your home
directory
Make the softlink:
http://obi.sourceforge.net/ontology/OBI.owl
maps to real file position https://svn.sourceforge.net/svnroot/fugo/trunk/ontology/OBI.owl
[check, refine]
To be able to reference to another web-based ontology the full ontology
has to be imported into the active one. Then we can start “binning” of classes,
e.g. from our domain dependant / community specific ontology into more general
OBI or BFO ones.
[Look at the WIKI site : link ???]
See RO ontology (Ref).
Always formulate
properties on the most general level possible.
Avoid blurred non-ontological
and non-implementable relations like associated_with
if you plan reasoning applications. A relation like annotates is not ontological in this sense, as it links classes not
to other classes in nature, but rather to terms in a vocabulary that we
ourselves have constructed. Avoid capturing closely related or even synonymous
relations, e.g. derives_from and develops_from.
The explicit allocation of class
key-properties (the ones
that define the essence of the class A, which discriminates it within its
superclass B) fosters
consistent taxonomisation of lower level classes, because the
inheritance of these properties guarantees that all subclasses at all sublevels
can be immediately counterchecked to be consistent with all superclasses at any
higher level (this is a feature of the protégé frames visualisation in the
‘properties-view’, not the ‘logic view’). It is not enough to
capture these properties in the definitions only, because the GUI-tools don't
pass them on to the leaf classes like they do for formally assigned properties.
Explicitly formalised properties help
constraining the interpretation of their domain- classes and all subclasses,
which is exactly what is needed to provide the context for classification.
These key properties help to keep track of the intended (otherwise implicit)
context, all the way downstream to the leaf nodes. Classification can be
decided to be true or false e.g. for the following case: time_independent_study
is_a ,...., is_a
unfolding_through_time. If we would have assigned a key-property has_timeline to the top level class
“unfolding_through_time” (or process), in the ‘properties view’ of the tab, we
would immediately see this property (inherited) at the leaf node
“time_independent_study”, and here we could (by having this information
immediately visually accessible) decide more easily if this classification is
valid, e.g. when we then see the has_timeline
property associated to the “time_independent_study”, this feels
counterintuitive at first and we might have a closer look at this
classification or the definition. However, since a “time_independent_study” is
not the same as a “study_without_timeline”, the classification is correct in
this case.
Possible
key-properties for a “process”-class could be starts_at,
has_object_participant, induced_through. Key-properties for the “object”
top level class could be has_position, has_mass, ….
A file-naming convention will help to capture basic metadata into
filenames and provides a simple versioning mechanism, for files which our
community members may upload into the file repositories. Any recommendations
tackling this issue are of course not only dependent on the way files are
stored and versioned, e.g. if svn/cvs is used, but also what kind of file
related metadata is stored within the ontology itself, e.g. Owl can capture
further data in its metadata sections or an external annotation ontology like
RA_metadata.owl (link ???) can be imported, providing descriptors to describe
such RA related metadata).
In general you would only capture the really necessary information in
the filename, usually the ones that you would need to unambiguously identify
the file and important file handling metadata.
Use a consistent
version naming. A good practice is to align the version number with the Year
and Month. Name each publicly new available version with the prefix “v.”
followed by the single digit year and the month, e.g. a version checked in for
deployment in February 2006 would be “v.6.2”. The disadvantage here is that you
are not able to state anything about the scale of advancements archived between
following versions.
When no automatic
update and versioning system is used RA files and directories should be named
according to the following syntax: (if svn is used the ShortRAname is enough).
ShortRAname[_Authority_Version_Date].ext
E.g.: NMR_MSI_v6-9_060920.owl
ShortRAname is a short
descriptive RAs name.
Authority comprises the name
of the RAs engineering authority or the organization. Separate author and
organization with a dash if both are featured.
Version comprises the
version number. Start the version number with a "v"; use
"-" instead of "." in the version numbering (like
"v6_2" instead of "v62").
Date comprises the date
the file is released. For the date
reference, the parts changing less should come first, as this eases
alphabetical sorting according to the date: use "yymmdd".
Ext is the proper
extension for the representation language separated by a "." (dot).
There should only be one dot in the entire filename and that should be right
before the file extension. "ext" is the standard file extension by
which this file can be associated with an appropriate application that will
handle it. This is generally in 2~4 lower case alphanumeric characters.
Allowed characters: The file name may
contain upper and lower case text, numerals, "-" (dash) and
"_" (underscore). [allowed unix filename characters ??? ]. Spaces,
parenthesis, or other commonly used characters, such as "~",
"&", or "#" will cause the file to be rejected. Use
underscore as separators.
A similar convention is being practiced at w3c for their published work
(e.g. note their page header information http://www.w3.org/TR/2004/REC-webont-req-20040210/
).
This document has been drafted by Daniel
Schober and it has received input from the MSI Ontology WG, OBO WG and OBI WGs’
members, in particular from:
-
Luisa Montecchi-Palazzi,
Frank Gibson (PSI)
-
Chris
Mungall (OBO)
-
Barry
Smith (cBIO, OBO)
-
Waclaw
Kusnierczyk, Andrew Spears (IFOMIS)
-
Gilberto
Fragoso (OBI)
-
Phillippe
Rocca-Serra and Susanna-Assunta Sansone (MSI)
-
Susanna
Sansone (EBI)
1. D
Schober: Metadata Annotations for
Representational Units and Representational Artifacts. 2006.
2. O
Fiehn, B Kristal, B van Ommen, LW Sumner, SA Sansone, C Taylor, N Hardy, R
Kaddurah-Daouk: Establishing reporting
standards for metabolomic and metabonomic studies: a call for participation.
Omics 2006, 10:158-63.
3. H
Hermjakob: The HUPO Proteomics Standards
Initiative - Overcoming the Fragmentation of Proteomics Data. Proteomics 2006, 6:34-38.
4. PL
Whetzel, RR Brinkman, HC Causton, L Fan, D Field, J Fostel, G Fragoso, T Gray,
M Heiskanen, T Hernandez-Boussard, et al: Development
of FuGO: an ontology for functional genomics investigations. Omics 2006, 10:199-204.
5. S
Bradner: Key words for use in RFCs to
Indicate Requirement Levels. Internet
Engineering Task Force 1997, March.
6. S
Zhang, O Bodenreider: Law and order:
assessing and enforcing compliance with ontological modeling principles in the
Foundational Model of Anatomy. Comput
Biol Med 2006, 36:674-93.
7. SH
Brown, M Lincoln, S Hardenbrook, ON Petukhova, ST Rosenbloom, P Carpenter, P
Elkin: Derivation and evaluation of a
document-naming nomenclature. J Am
Med Inform Assoc 2001, 8:379-90.
8. O
Tuason, L Chen, H Liu, JA Blake, C Friedman: Biological nomenclatures: a source of lexical knowledge and ambiguity.
Pac Symp Biocomput 2004:238-49.
9. B Smith,
W Kusnierczyk, D Schober, W Ceusters: Towards
a Reference Terminology for Ontology Research and Development in the Biomedical
Domain. In: KR-MED 2006; 2006.
10. LI
Morrow, MF Duffy: The representation of
ontological category concepts as affected by healthy aging: normative data and
theoretical implications. Behav Res
Methods 2005, 37:608-25.
11. TR
Gruber: A translation approach to
portable ontologies. Knowledge
Acquisition 1993, 2:199-220.
12. S
Brenner: Life sentences: Ontology
recapitulates philology. Genome Biol 2002,
3:COMMENT1006.
13. B
Smith, W Ceusters, B Klagges, J Kohler, A Kumar, J Lomax, C Mungall, F Neuhaus,
AL Rector, C Rosse: Relations in
biomedical ontologies. Genome Biol 2005,
6:R46.
14. B
Smith, J Köhler, A Kumar: On the
Application of Formal Principles to Life Science Data: A Case Study in the Gene
Ontology. In: DILS 2004: Data
Integration in the Life Sciences. Lecture Notes in Computer Science; 2004.
124-139.
15. J
Bouaud, B Bachimont, J Charlet, P Zweigenbaum: Acquisition and structuring of an ontology within conceptual graphs.
In: Proceedings 2nd International
Conference on Conceptual Structures: Workshop on Knowledge Acquisition using
Conceptual Graph Theory. Lecture Notes Computer Sciience; 1994. 1-25.
16. G
Vigliocco, DP Vinson, S Siri: Semantic
similarity and grammatical class in naming actions. Cognition 2005, 94:B91-100.
17. J
Kohler, K Munn, A Ruegg, A Skusa, B Smith: Quality
control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics 2006, 7:212.
***** NOTE: This document is a work in progress
*****
Comments and ideas are welcomed and should be
sent to: schober@ebi.ac.uk