Enguehard, C., Pantéra, L., "Automatic Natural Acquisition of a Terminology", Journal of Quantitative Linguistics, vol.2, n°1, p.27-32, 1995.

 

Automatic Natural Acquisition of a Terminology

 

Chantal ENGUEHARD

IRIN - Computer Science Research Institute of Nantes

IUT, 3, rue du Maréchal Joffre - 44041 Nantes Cedex 01 - France

enguehard@iut-nantes.univ-nantes.fr

 

Laurent PANTERA

U.T.C. University of Technology of Compiègne

B.P.649 - 60206 Compiègne Cedex - France

 

Summary

The authors present the ANA system which is capable of automatically selecting the terminology from a technical domain by the analysis of free text. It uses statistical procedures and a few heuristics which have been inspired by the human learning of a mother tongue. Because the system does not need any syntactical or lexical resources, it is independent of the language (English, French, etc.) and of the level of discourse (technical, colloquial).

 

Topical paper : documentation and information retrieval, large corpora, knowledge acquisition.

 

Introduction

The automatic selection of the terminology of a domain has been mainly studied for automatic indexation. BETTS showed that the best quality systems were those which have thesaurus’ disposal [BETTS 91]. Also, there is a lack of predefined thesaurus in many technical or scientific new domain, and building a thesaurus requires the participation of specialists of the domain and of terminologists. This a costly work, strongly depending on the willingness of specialists of the domain.

In this paper we present a new approch called ANA 'Automatic Natural Acquisition'. In the first part we shall present our main ideas, a heuristic to modelize the ability to learn a mother tongue, and some tools to recognize the different forms of a unique information. The second part will include the general architecture of the ANA system, the detail of the procedures we created according to our specifications. In the last part we shall show some results and evaluations.

I - Main ideas

Our goal is to define a new way to automatically select the terminology of a domain. These elements of the terminology, that we call terms, could be used to index the texts or to build a taxonomy, and also in other tasks of natural language processing as desambiguation of words or text generation [SMAD 90].

Note that our convention is to always write terms in capital letters.

I.1 - Specifications

No linguistic knowledge

First, the system should have the ability to treat any text, even if it is not well-written. Actually, there is an industrial need of automatic systems capable of dealing with technical texts, but also with interviews (in a knowledge feedback stage for instance). In these technical texts, correct syntactical structures are not always adhered to, and neologisms frequently occur[1]. In addition, the huge quantity of texts does not allow for manual correction. Also, it is inconvenient to use any predefined linguistic knowledge like a syntactical parser or dictionary.

 

Quantitative criterion

The main criterion for deciding whether a treatment should be implemented or not is its profitability with regard to the choice of terminology.

If a very simple treatment leads us to avoid many bad terms but to fail a few relevant ones, we use it. For instance, we decided that very short words (one or two letters) could not be selected as elements of terminology. In texts about music, where it could be about the "C key", we could fail to select "C" as a term. However, by this rule we are sure to not select words as "we", "if", "of", "be", "it", "by", etc. Because they are very frequent, the qualification of such words as terms would result in high noise.

Symmetrically, we use some very simple rules which are not completely relevant : some of the words qualified as terms are not (see the results). However, to suppress these terms we would have to program complicated rules and thus, lose the simplicity of the system which make it highly adaptative.

I.2 - Some tools

Statistical tools

By using statistical procedures the system extracts some pieces of knowledge that are mainly about the language which is used (it could be English, or French, or any European non-agglutinative language), and the subject it is about. This knowledge stands in three lists :

'Functional words'

The words of a language can be divided in 'functional words' (sometimes called 'empty words') and 'semantic words' which have a strong semantic content. Typically, articles, pronouns, some current verbs are 'functional words'. We could cite : "a", "any", "for", "in", "is", "may", "of", "or", "the", "this", "to", etc.

There are 60 to 100 items in the list of 'functional words' which is called Wfonc.

'Scheme words'

Some 'functional words' indicate a semantic relation between words.

For instance, in the expression "box of nails", "of" indicates a certain relation between "box" and "nails". In "colours of paintings" we find the same word "of" between "colours" "paintings" (even if the relation is not the same, linguistically speaking).

The system is capable of selecting some of these 'scheme words' and using them to determine some terms (see forward).

Usually, there are less than 10 'scheme words'.

'Bootstrap'

The 'bootstrap' is a set of terms. It constitutes the kernel of knowledge about the domain that the system will enlarge step by step.

20 to 30 items are enough to initiate the discovery of new terms.

For instance, in the 'Do It Yourself' domain, we could find, in the 'bootstrap', words like "HAMMER", "SHELF", "SCREW", "PAINTING", "BRUSH".

 

The procedures which automatically determine these lists are detailed in [ENGU 92] (p. 110-128).

 

Learning ability

We call our approch 'natural' because the main ideas come from the observation of human ability to deal with language.

At the beginning, babies have to learn their mother tongue. They do not know anything about grammar or dictionary. Nevertheless, after living for months with speaking people, they start to pronounce the very first words which usually designate objects of their day-to-day life. The hypothesis is that the repetition of the same sound, associated with the same perceptions makes the baby associate the both, mainly by induction and generalization mechanisms [BRAI 71].

We adapted this idea for a computer : its only perception is the texts we may furnish it with. Thus, we design a procedure to discover some relations between the elements of the textual environment (the words) only by observing the repetitive observations of two facts. According to SMADJA [SMAD 89], we make the hypotheses that frequent co-occurrences of words are semantically significant but we limit on certain words.

 

• Let a 'fact' be the observation of an entity.

• Considering our application on natural language, an 'entity is the presence or the absence of a term or of a 'scheme word'.

• We enunciate the postulate :

 

 

Because it is an induction process, this procedure is valid only with the condition of giving a huge amount of texts to the computer.

 

Flexible recognition of strings

We modelize a way of perceiving text to face this variability of the words. We call this operation flexible recognition of strings. According to the first point of our specification it should be done without any syntactical rule.

A term is a class of phrases or compound words or single words. For instance, "COLOUR OF PAINTING" represents several different strings : evidently "colour of painting", but also, "colours of paintings", "colour of this painting", "colour of any painting", etc.

 

• Let 'word' be a string of characters limited by blank characters. W is the set of words.

 

• A term T is defined as an ordered list of words xi : T = (xi), xi Î W, i Î {1,...,n}

 

• The set of 'functional word' is noted Wfonc. It is extensionaly defined.

            example:      Wfonc = {"a", "any", "it", "of", "the"}

 

• We call Restriction of a string X the ordered list of words in X which are not functional words.

            example:      X = "colour of the painting"          R(X) = ("colour" "painting")

 

• The editing distance beetween two strings X and Y is the function of the minimum insertion or deletion of letters to transform X into Y (see [WAGN 74]).

            example:      distance("hammer","hamer") = 1   (deletion of 'm').

 

• The flexible equality of two words is noted '≈' and defined by :

            (X ≈ Y) <=> ( > 1)                                example:        "hammer" '≈' "hamer"

 

• Two strings are flexibly equal (noted '≈||') if the words of their restriction are flexibly equal.

            example:      X = "colour of the hammer"            R(X) = ("colour" "hammer")

                                  Y = "colour of any hamer"                              R(Y) = ("colour" "hamer")

                                  ("colour " ≈|| "colour " and "hammer" ≈|| "hamer") => (X ≈|| Y)

 

II - The ANA system

We previously discussed the constraints of this system. To summarize, it should:

- not use any syntactical parser or dictionary,

- be tolerant of lexical variations,

II.1 - General architecture

To respect these constraints we designed the system in two modules: 'Familiarization' and 'Discovery' (fig.1).

 

The declarative knowledge is issued from the study of free texts during the first stage called 'Familiarization'. This module extracts groups of information about the language which is employed in the free texts to be treated, and about the domain (see.I.2).

The 'Discovery' module achieves the extraction of the terms in three parallel ways: 'expressions', 'candidates', and 'expansions'. This process is incremental : while some new terms are discovered they are added to the set of terms. This set is used as a new bootstrap to discover more terms. This process converges on a finite set of terms which constitutes the result : it includes the terminology of the domain in question in the treated texts. In addition some links between terms are automatically generated, and we intend to transform the set of terms into a semantic network.

The three lists previously determined by the 'Familiarization' module solely constitute the required knowledge to discriminate the terms from the free texts.

 

fig. 1 - General architecture

II.2 - Stages of discovery

We now define the three procedures to qualify strings as new terms. We rely heavily upon the postulate which we defined previously (i.e. “frequent co-occurrences of facts are semantically significant.”).

• We consider that two facts co-occur when they are separated by less than a fixed number of terms or words. In such a case they are said to be in the same window.

• 'Co-occurrences of facts' will have different interpretations. It could be :

- two terms for the term extraction by 'expression',

- a term and a 'scheme word' for the term extraction by 'candidate',

- a term and a word for the term extraction by 'expansion'.

 

The presentation of these three cases will be illustrated by examples in English on 'Do It Yourself' domain, with :

- Wfonc = {"a" "any" "for" "in" "is" "may" "of" "or" "the" "this" "to"},

- "of" is a 'scheme word'

- "WOOD" "COLOUR" "BEECH" "TIMBER", "DIESEL", "ENGINE" are some terms

'Expression'

A new term is qualified in the expression manner when two existing terms appear frequently (threshold TEXP) with almost the same arrangement. The most frequent arrangement becomes a new term (and is inserted in a semantic network).

example:

Here are some items of free texts in which the two terms "DIESEL" and "ENGINE" have been identified in the same window :

            ... "the" "is" ...

            ... "this" "has" ...

            ... "a" "never" ...

Result : "DIESEL ENGINE" is qualified as a new term. It is linked to "DIESEL" and "ENGINE" (fig.2).

'Candidate'

A new term is qualified in the candidate manner when an existing term appears frequently (threshold TCAND) with a word that links a 'scheme word'. This word then becomes a new term.

example:

Here are some items of free texts in which appear in the same window some terms ("WOOD" "COLOUR" "BEECH" "TIMBER"), the word "shape", and the 'scheme word' "of" between them :

            ... "any" "could" ...

            ... "this" "is" ...

            ... "the" "may" ...

            ... "new" ...

            ... "same" "in" ...

Result : "SHADE" is qualified a new term.

'Expansion'

A new term is qualified in the expansion manner when an existing term appears frequently (threshold TEXPA) with the same succession of words. This succession should not include any term, or any 'scheme word'. The beginning and ending of the new term should not be 'functional words'.

example:

Here are some items of free texts in which appear in the same window the same term "WOODS" with the same term "soft" :

            ... "use" "any" "to" "make" "this" ...

            ... "buy" "this" "or" "plastic" "for" ...

            ... "cheapest" "comes" "from" ...

Result : "SOFT WOODS" is qualified as a new term and inserted in the semantic network with a link to "WOODS" (fig.2).

 

Semantic network

In the semantic network we represent some morphological relations, and also the co-occurrence of terms. "DIESEL" and "ENGINE", for example, could occur together but not as "DIESEL ENGINE" : in the sentence “we need diesel for the engine” it could be interesting to disambiguate easily the word "engine" by taking in account the proximity of "diesel".

 

fig. 2 - The semantic network framed by terms

 

Incremental process

The discovery module is incremental. The system carries on analysing the texts until it does not find any new term. Here are the different stages of treatment for each text :

1 -   The system reduces the text by replacing all the signs which are not letter or digit by the blank letter. This 'reduction' step avoids problems which arise due to badly punctuated texts (as ours were), and makes the system both simple and easily maintenable.

2 -   A lexical analysis is performed to recognize the terms previously discovered.

3 -   The system collects some items of text and memorizes them in the relevant objects ('expressions', 'candidates' and 'expansions').

4 -   These items of text are analysed in order to discover new terms and include them in the semantic network.

5 -   If new terms appeared the text is analysed again (step 1), if not it stops.

 

III - Results

English texts

Here are some results obtained by the analysis of scientific papers about acoustics.

Because we had only 25,000 words of text we could not perform the 'familiarization' step and had to give to the system the lists normally provided by this module:

Data : 2 'scheme words' ("of", "of the"), 34 'functional words', 29 elements in the 'bootstrap' ("ARRAY", "BEAMFORMING", "BOILING", "BUBBLES", "EXPERIMENTS", "IMAGE", "TRANSMISSION", "PULSE", "LEAKS", "TEMPERATURE", "MONITOR", "INSTRUMENTATION", "REACTOR", "RMS", "RECOGNITION", "SCANNING", "SUBASSEMBLIES ", "IMPULSIVE ", "SIGNAL", "TRANSDUCERS", "SOUNDS", "STRUCTURES", "SENSORS", "LOCATION", "SODIUM WATER REACTION", "SGU", "ULTRASONICS", "VELOCITY", "WAVEGUIDE")

 

Results: 200 new were found. Here is a sample :

ACOUSTIC ACTIVITY

ACOUSTIC AMPLITUDES

 ACOUSTIC BOILING NOISE DETECTION

ACOUSTIC LEAK DETECTION

ACOUSTIC LEAK DETECTION SYSTEM

ACOUSTIC PULSE

ACOUSTIC SIGNAL AMPLITUDE

ACOUSTIC SOURCE LOCATION

ACOUSTIC TRANSMISSION IN SGUS PLANT AND LABORATORY MEASUREMENTS

ACOUSTIC SURVEILLANCE TECHNIQUES FOR SGU LEAK

ACOUSTIC SURVEILLANCE

ADVANCED SIGNAL PROCESSING

AMPLITUDE

ATTENUATION

ATTENUATION IN X CELL

ATTENUATION OF ACOUSTIC SIGNAL

BACKGROUND NOISE IN A SGU

BEEN SET

BEST ESTIMATED SOURCE LOCATION

DIAMETER OF THE SUBASSEMBLY

DISTANCE TRAVELED IN CELL

DROP VELOCITY

DURING A REACTOR SHUTDOWN

ESTIMATE

ESTIMATED SOURCE LOCATION

EXPERIMENT HAS SHOWN

EXPERIMENTAL ARRANGEMENT

EXPERIMENTAL MODEL

EXPERIMENTAL PROGRAMME

FAST REACTOR

NUCLEAR FAST REACTORS

NUCLEAR REACTOR

PATH LENGTH THROUGH A CELL

PATTERN RECOGNITION

SIGNAL AMPLITUDES

SIGNAL ATTENUTION

SIGNAL PROCESSING

SIGNAL PROCESSING TECHNIQUES

SIGNAL STRENGTH

SIGNAL TO NOISE RATIO

VELOCITY OF SOUND

VELOCITY OF SOUND IN SODIUM

 

Of course, all are not proper terms : "BEEN SET" or "EXPERIMENT HAS SHOWN", for instance, are bad. However, an evaluation has shown that specialists in this particular domain would keep at least three quaters of these terms.

In addition, such a list of terms can be quickly corrected, as compared to the time which would be involved in the manual selection of terms from free texts.

 

French texts

We carried out some experiments on French interviews about the fast-breeder reactor Super-Phenix. We performed the familiarisation step with good results except for the 'functional words' list in which we had to add some terms. This treatment will have to be improved. All the other initializations were automatically performed.

Data : Texts : 120,000 words in texts, 100 'functional words', 6 scheme words', 125 terms in the 'bootstrap'.

Results : more than 3000 new terms.

These results have previously been published in [ENGU 92] annex 5 and 3.

 

Conclusion

The ANA system selects the terms about any technical or scientific domain. It is specialized in large corpora which has poor quality because it learns about the language used in texts through an induction process. The texts have not not be corrected, there is no need for any syntactical parser or lexicon,.

 

Currently we are using the same 'natural' approach to extract some semantic knowledge from networks build by the ANA system (see [ENGU 92] in annex 4).

Bibliography

[BETT 91]        Betts, R., Marrable, D., "Free text vs controlled vocabulary, retrieval precision and recall over large databases", Online Inf. 91, Dec., London, p.153-165, 1991.

[BRAI 71]         Braine, M.D.S., "The acquisition of language in infant and child", in C. Reed (Ed), "The Learning of Language", N-Y, Appleton, 1971.

[ENGU 92]        Enguehard, C., "ANA, Apprentissage Naturel Automatique d'un réseau sémantique", Thèse de Doctorat de l'Université de Technologie de Compiègne, Décembre, 1992.

[SMAD 89]      Smadja, F.A., "Lexical co-occurrence : the missing link", Lit. and Linguist. Comput. (UK), vol.4, n°3, p.163-168, 1989.

[SMAD 90]      Smadja, F.A., McKeown, K.R., "Automatically extracting and representing collocations for language Generation", 28th Annual Meeting of the Association for Computational Linguistics, NY, USA, p.252-259, 1990.

[WAGN 74]     Wagner, R.A., Fischer, M.J., "The string-to-string correction problem", J. of the ACM, vol.21, n°1, p.168-173, Jan., 1974.

 



[1]    According to [FALZ 89] we say that these texts are written in an operative language. These languages are characterized by the lack of synonymous and the strong use of the enunciative form.