Enguehard,
C., Pantéra, L., "Automatic Natural Acquisition of a Terminology", Journal
of Quantitative Linguistics, vol.2, n°1, p.27-32, 1995.
Automatic Natural
Acquisition of a Terminology
Chantal
ENGUEHARD
IRIN - Computer Science Research Institute of Nantes
IUT, 3, rue du Maréchal Joffre - 44041 Nantes Cedex 01 - France
enguehard@iut-nantes.univ-nantes.fr
Laurent
PANTERA
U.T.C. University of Technology of Compiègne
B.P.649 - 60206 Compiègne Cedex - France
Summary
The
authors present the ANA system which is capable of
automatically selecting the terminology from a technical domain by the analysis
of free text. It uses statistical procedures and a few heuristics which have
been inspired by the human learning of a mother tongue. Because the system does
not need any syntactical or lexical resources, it is independent of the
language (English, French, etc.) and of the level of discourse (technical,
colloquial).
Topical paper : documentation and
information retrieval, large corpora, knowledge acquisition.
The
automatic selection of the terminology of a domain has been mainly studied for
automatic indexation. BETTS showed that the best quality systems were those
which have thesaurus’ disposal [BETTS 91]. Also, there is a lack of
predefined thesaurus in many technical or scientific new domain, and building a
thesaurus requires the participation of specialists of the domain and of
terminologists. This a costly work, strongly depending on the willingness of
specialists of the domain.
In
this paper we present a new approch called ANA 'Automatic Natural Acquisition'.
In the first part we shall present our main ideas, a heuristic to modelize the
ability to learn a mother tongue, and some tools to recognize the different
forms of a unique information. The second part will include the general
architecture of the ANA system, the detail of the procedures we created
according to our specifications. In the last part we shall show some results
and evaluations.
Our
goal is to define a new way to automatically select the terminology of a
domain. These elements of the terminology, that we call terms,
could be used to index the texts or to build a taxonomy, and also in other
tasks of natural language processing as desambiguation of words or text
generation [SMAD 90].
Note
that our convention is to always write terms in capital letters.
First,
the system should have the ability to treat any text, even if it is not
well-written. Actually, there is an industrial need of automatic systems
capable of dealing with technical texts, but also with interviews (in a
knowledge feedback stage for instance). In these technical texts, correct
syntactical structures are not always adhered to, and neologisms
frequently occur[1]. In addition, the huge
quantity of texts does not allow for manual correction. Also, it is
inconvenient to use any predefined linguistic knowledge like a syntactical
parser or dictionary.
The
main criterion for deciding whether a treatment should be implemented or not is
its profitability with regard to the choice of terminology.
If a
very simple treatment leads us to avoid many bad terms but to fail a few
relevant ones, we use it. For instance, we decided that very short words (one
or two letters) could not be selected as elements of terminology. In texts
about music, where it could be about the "C key", we could fail to
select "C" as a term. However, by this rule we are sure to not select
words as "we", "if", "of", "be",
"it", "by", etc. Because they are very frequent, the
qualification of such words as terms would result in high noise.
Symmetrically,
we use some very simple rules which are not completely relevant : some of
the words qualified as terms are not (see the results). However, to suppress
these terms we would have to program complicated rules and thus, lose the
simplicity of the system which make it highly adaptative.
By
using statistical procedures the system extracts some pieces of knowledge that
are mainly about the language which is used (it could be English, or
French, or any European non-agglutinative language), and the subject it
is about. This knowledge stands in three lists :
• 'Functional words'
The
words of a language can be divided in 'functional words' (sometimes called
'empty words') and 'semantic words' which have a strong semantic content.
Typically, articles, pronouns, some current verbs are 'functional words'. We
could cite : "a", "any", "for", "in",
"is", "may", "of", "or",
"the", "this", "to", etc.
There
are 60 to 100 items in the list of
'functional words' which is called Wfonc.
• 'Scheme words'
Some
'functional words' indicate a semantic relation between words.
For
instance, in the expression "box of nails", "of" indicates
a certain relation between "box" and "nails". In
"colours of paintings" we find the same word "of" between
"colours" "paintings" (even if the relation is not the
same, linguistically speaking).
The
system is capable of selecting some of these 'scheme words' and using them to
determine some terms (see forward).
Usually,
there are less than 10 'scheme words'.
• 'Bootstrap'
The
'bootstrap' is a set of terms. It constitutes the kernel of knowledge about the
domain that the system will enlarge step by step.
20 to 30 items are enough to initiate the discovery of new terms.
For
instance, in the 'Do It Yourself' domain, we could find, in the 'bootstrap',
words like "HAMMER", "SHELF", "SCREW", "PAINTING", "BRUSH".
The
procedures which automatically determine these lists are detailed in [ENGU 92] (p. 110-128).
We
call our approch 'natural' because the main ideas come from the observation of
human ability to deal with language.
At
the beginning, babies have to learn their mother tongue. They do not know
anything about grammar or dictionary. Nevertheless, after living for months
with speaking people, they start to pronounce the very first words which
usually designate objects of their day-to-day life. The hypothesis is that the
repetition of the same sound, associated with the same perceptions
makes the baby associate the both, mainly by induction and generalization mechanisms
[BRAI 71].
We
adapted this idea for a computer : its only perception is the texts we may
furnish it with. Thus, we design a procedure to discover some relations between
the elements of the textual environment (the words) only by observing the repetitive
observations of two facts. According to SMADJA [SMAD 89], we make the hypotheses that frequent
co-occurrences of words are semantically significant but we limit on certain
words.
•
Let a 'fact' be the observation of an entity.
•
Considering our application on natural language, an 'entity is the
presence or the absence of a term or of a 'scheme word'.
• We
enunciate the postulate :
Because
it is an induction process, this procedure is valid only with the condition of
giving a huge amount of texts to the computer.
We
modelize a way of perceiving text to face this variability of the words. We
call this operation flexible recognition of strings. According to the
first point of our specification it should be done without any syntactical
rule.
A
term is a class of phrases or compound words or single words. For instance,
"COLOUR OF PAINTING" represents several
different strings : evidently "colour of painting", but also,
"colours of paintings", "colour of this painting",
"colour of any painting", etc.
• Let
'word' be a string of characters limited by blank characters. W is the set of words.
• A term
T is defined as an ordered list of words xi
: T = (xi), xi Î W, i Î {1,...,n}
• The
set of 'functional word' is noted Wfonc.
It is extensionaly defined.
example: Wfonc = {"a",
"any", "it", "of", "the"}
• We
call Restriction of a string X the ordered list of words in X which are
not functional words.
example: X = "colour of the painting" R(X) = ("colour"
"painting")
• The
editing distance beetween two strings X and Y is the function of the
minimum insertion or deletion of letters to transform X into Y (see [WAGN 74]).
example: distance("hammer","hamer") = 1 (deletion of 'm').
• The flexible
equality of two words is noted '≈' and defined by :
(X ≈ Y) <=> ( > 1) example: "hammer" '≈' "hamer"
• Two strings
are flexibly equal (noted '≈||') if the words of their
restriction are flexibly equal.
example: X = "colour of the hammer" R(X) = ("colour" "hammer")
Y =
"colour of any hamer" R(Y)
= ("colour" "hamer")
("colour
" ≈|| "colour " and "hammer" ≈||
"hamer") => (X ≈|| Y)
We
previously discussed the constraints of this system. To summarize, it should:
- not
use any syntactical parser or dictionary,
- be
tolerant of lexical variations,
To
respect these constraints we designed the system in two modules:
'Familiarization' and 'Discovery' (fig.1).
The
declarative knowledge is issued from the study of free texts during the first
stage called 'Familiarization'. This module extracts groups of information
about the language which is employed in the free texts to be treated, and about
the domain (see.I.2).
The
'Discovery' module achieves the extraction of the terms in three parallel ways:
'expressions', 'candidates', and 'expansions'. This process is
incremental : while some new terms are discovered they are added to the
set of terms. This set is used as a new bootstrap to discover more terms. This
process converges on a finite set of terms which constitutes the result :
it includes the terminology of the domain in question in the treated texts. In
addition some links between terms are automatically generated, and we intend to
transform the set of terms into a semantic network.
The
three lists previously determined by the 'Familiarization' module solely
constitute the required knowledge to discriminate the terms from the free
texts.
fig. 1 - General architecture
We
now define the three procedures to qualify strings as new terms. We rely
heavily upon the postulate which we defined previously (i.e. “frequent
co-occurrences of facts are semantically significant.”).
• We
consider that two facts co-occur when they are separated by less than a
fixed number of terms or words. In such a case they are said to be in the same
window.
•
'Co-occurrences of facts' will have different interpretations. It could
be :
- two
terms for the term extraction by 'expression',
- a term
and a 'scheme word' for the term extraction by 'candidate',
- a term
and a word for the term extraction by 'expansion'.
The
presentation of these three cases will be illustrated by examples in English on
'Do It Yourself' domain, with :
- Wfonc = {"a" "any"
"for" "in" "is" "may" "of"
"or" "the" "this" "to"},
-
"of" is a 'scheme word'
-
"WOOD" "COLOUR"
"BEECH" "TIMBER",
"DIESEL", "ENGINE" are some terms
• 'Expression'
A
new term is qualified in the expression manner when two existing terms appear
frequently (threshold TEXP) with almost the same
arrangement. The most frequent arrangement becomes a new term (and is inserted
in a semantic network).
example:
Here
are some items of free texts in which the two terms "DIESEL" and "ENGINE" have been identified
in the same window :
... "the" "is" ...
... "this" "has" ...
... "a" "never" ...
Result
: "DIESEL ENGINE" is qualified as a new
term. It is linked to "DIESEL" and "ENGINE" (fig.2).
• 'Candidate'
A
new term is qualified in the candidate manner when an existing term
appears frequently (threshold TCAND) with a word that
links a 'scheme word'. This word then becomes a new term.
example:
Here
are some items of free texts in which appear in the same window some terms
("WOOD" "COLOUR"
"BEECH" "TIMBER"),
the word "shape", and the 'scheme word' "of" between them :
... "any" "could" ...
... "this" "is" ...
... "the" "may" ...
... "new" ...
... "same" "in" ...
Result
: "SHADE" is qualified a new
term.
• 'Expansion'
A new term is qualified in the expansion manner when
an existing term appears frequently (threshold TEXPA) with the same succession of words. This succession
should not include any term, or any 'scheme word'. The beginning and ending of
the new term should not be 'functional words'.
example:
Here
are some items of free texts in which appear in the same window the same term
"WOODS" with the same term
"soft" :
... "use" "any" "to" "make" "this" ...
... "buy" "this" "or" "plastic" "for" ...
... "cheapest" "comes" "from" ...
Result
: "SOFT WOODS" is qualified as a new
term and inserted in the semantic network with a link to "WOODS" (fig.2).
• Semantic network
In
the semantic network we represent some morphological relations, and also the
co-occurrence of terms. "DIESEL" and "ENGINE", for example, could occur together but not as "DIESEL ENGINE" : in the sentence “we need diesel for the
engine” it could be interesting to disambiguate easily the word
"engine" by taking in account the proximity of "diesel".
fig. 2 - The semantic network framed by terms
• Incremental process
The
discovery module is incremental. The system carries on analysing the
texts until it does not find any new term. Here are the different stages of
treatment for each text :
1 - The system reduces the text
by replacing all the signs which are not letter or digit by the blank letter.
This 'reduction' step avoids problems which arise due to badly punctuated texts
(as ours were), and makes the system both simple and easily maintenable.
2 - A lexical analysis is performed
to recognize the terms previously discovered.
3 - The system collects some
items of text and memorizes them in the relevant objects ('expressions',
'candidates' and 'expansions').
4 - These items of text are
analysed in order to discover new terms and include them in the semantic
network.
5 - If new terms appeared the
text is analysed again (step 1), if not it stops.
Here
are some results obtained by the analysis of scientific papers about acoustics.
Because
we had only 25,000 words of text we could not
perform the 'familiarization' step and had to give to the system the lists
normally provided by this module:
• Data
: 2 'scheme words' ("of", "of
the"), 34 'functional words', 29 elements in the 'bootstrap' ("ARRAY",
"BEAMFORMING", "BOILING", "BUBBLES",
"EXPERIMENTS", "IMAGE", "TRANSMISSION",
"PULSE", "LEAKS", "TEMPERATURE",
"MONITOR", "INSTRUMENTATION", "REACTOR",
"RMS", "RECOGNITION", "SCANNING",
"SUBASSEMBLIES ", "IMPULSIVE ", "SIGNAL",
"TRANSDUCERS", "SOUNDS", "STRUCTURES",
"SENSORS", "LOCATION", "SODIUM WATER REACTION",
"SGU", "ULTRASONICS", "VELOCITY",
"WAVEGUIDE")
• Results: 200 new
were found. Here is a sample :
ACOUSTIC ACTIVITY |
ACOUSTIC AMPLITUDES |
ACOUSTIC
BOILING NOISE DETECTION |
ACOUSTIC LEAK DETECTION |
ACOUSTIC LEAK DETECTION
SYSTEM |
ACOUSTIC PULSE |
ACOUSTIC SIGNAL AMPLITUDE |
ACOUSTIC SOURCE LOCATION |
ACOUSTIC TRANSMISSION IN SGUS
PLANT AND LABORATORY MEASUREMENTS |
ACOUSTIC SURVEILLANCE
TECHNIQUES FOR SGU LEAK |
ACOUSTIC SURVEILLANCE |
ADVANCED SIGNAL PROCESSING |
AMPLITUDE |
ATTENUATION |
ATTENUATION IN X CELL |
ATTENUATION OF ACOUSTIC
SIGNAL |
BACKGROUND NOISE IN A SGU |
BEEN SET |
BEST ESTIMATED SOURCE
LOCATION |
DIAMETER OF THE SUBASSEMBLY |
DISTANCE TRAVELED IN CELL |
DROP VELOCITY |
DURING A REACTOR SHUTDOWN |
ESTIMATE |
ESTIMATED SOURCE LOCATION |
EXPERIMENT HAS SHOWN |
EXPERIMENTAL ARRANGEMENT |
EXPERIMENTAL MODEL |
EXPERIMENTAL PROGRAMME |
FAST REACTOR |
NUCLEAR FAST REACTORS |
NUCLEAR REACTOR |
PATH LENGTH THROUGH A CELL |
PATTERN RECOGNITION |
SIGNAL AMPLITUDES |
SIGNAL ATTENUTION |
SIGNAL PROCESSING |
SIGNAL PROCESSING TECHNIQUES |
SIGNAL STRENGTH |
SIGNAL TO NOISE RATIO |
VELOCITY OF SOUND |
VELOCITY OF SOUND IN SODIUM |
Of
course, all are not proper terms : "BEEN
SET" or
"EXPERIMENT HAS SHOWN", for instance, are
bad. However, an evaluation has shown that specialists in this particular
domain would keep at least three quaters of these terms.
In
addition, such a list of terms can be quickly corrected, as compared to the
time which would be involved in the manual selection of terms from free texts.
We
carried out some experiments on French interviews about the fast-breeder
reactor Super-Phenix. We performed the familiarisation step with good results
except for the 'functional words' list in which we had to add some terms. This
treatment will have to be improved. All the other initializations were
automatically performed.
• Data
: Texts : 120,000 words in texts, 100 'functional words', 6 scheme words', 125 terms in the 'bootstrap'.
• Results
: more than 3000 new terms.
These
results have previously been published in [ENGU
92] annex 5 and 3.
The ANA system selects the terms about any technical or scientific domain. It
is specialized in large corpora which has poor quality because it
learns about the language used in texts through an induction process. The texts
have not not be corrected, there is no need for any syntactical parser or
lexicon,.
Currently
we are using the same 'natural' approach to extract some semantic knowledge
from networks build by the ANA system (see [ENGU 92] in annex 4).
[BETT 91] Betts,
R., Marrable, D., "Free text vs controlled vocabulary, retrieval precision
and recall over large databases", Online Inf. 91, Dec., London,
p.153-165, 1991.
[BRAI 71] Braine,
M.D.S., "The acquisition of language in infant and child", in C. Reed
(Ed), "The Learning of Language", N-Y, Appleton, 1971.
[ENGU 92] Enguehard,
C., "ANA, Apprentissage Naturel Automatique d'un réseau sémantique", Thèse
de Doctorat de l'Université de Technologie de Compiègne, Décembre, 1992.
[SMAD 89] Smadja,
F.A., "Lexical co-occurrence : the missing link", Lit. and
Linguist. Comput. (UK), vol.4, n°3, p.163-168, 1989.
[SMAD 90] Smadja,
F.A., McKeown, K.R., "Automatically extracting and representing
collocations for language Generation", 28th Annual Meeting of the
Association for Computational Linguistics, NY, USA, p.252-259, 1990.
[WAGN 74] Wagner,
R.A., Fischer, M.J., "The string-to-string correction problem", J.
of the ACM, vol.21, n°1, p.168-173, Jan., 1974.
[1] According to [FALZ 89] we say that these texts are written in an operative language. These languages are characterized by the lack of synonymous and the strong use of the enunciative form.