Dimensions of Applied Linguistics
ARTIFICIAL LANGUAGES:Natural Language Processing -
Some Linguistic Implications

Prev | Home | Next

The last two decades have widened the interest in and development of computer systems for understanding and generating natural languages. What is important to observe is that the man-machine interaction in the field of natural language processing has motivated scholars to explore the application of computer systems with two distinct orientations - one, which has the research objective of 'language engineering' and other, which is confined to its research goal of 'language theory testing'. While the former field of research activities focusses on the application of linguistics to the various areas of natural language processing (NLP) with a pragmatic goal in mind, the latter area of research activities concentrates on different possible applications of NLP to linguistics for developing formal linguistic theories and testing proposed linguistic models (Grishman, 1986). Both areas of research, on Ene one hand, imply a sound body of linguistic knowledge and, on the other hand, have direct implications for scientific formalization of linguistic theories.

Language engineering aspect of computer system has following classes of its application.

1. Speech Synthesis


Speech synthesis is the production of artificial speech through man-made devices which generate speech-like sound waves. It is interesting to no e that mechanical 'artificial talker, and electronically equipped 'Voder' and 'Vocoder' which served as 'talking machine' are slowly being replaced by computerized device for generating synthetic speech. High speed digital computers and electric circuits serve as effective means for generating speech-like sounds. Computer synthesizers are now considered to have more promise than the OVE 11, an electronic synthesizer built at the Royal Institute of Technology in Stockholm, Sweden, the Pattern Playback Synthesizer built by the Haskins Laboratories of New York, and the Vecoder system produced by the Bell Telephone Laboratory. For example, a computer controlled Line Analog Speech Synthesizer (LASS) was completed in the phonetic laboratory of UCLA, USA even two decades earlier (Ladefoged, 1964; Harshman, et al., 1967; Hiki, et al., 1968).


2. Machine Translation


Ever since the computer was invented, automatic translation of natural languages became a dream of programmers. It became a viable area of language engineering from within the embryonic field of artificial intelligence. The early researchers viewed machine translation (MT) as basically an engineering endeavour. With repeated failures in achieving their stated goal, scholars engaged in this area soon realized that the field of MT belongs to both linguistics and Computer Science. 'It was then demonstrated that fully-automated high quality machine translation is possible only when the meaning of the input text is taken into account, in addition to its Syntax and a version of a bilingual dictionary at the word or even phrase level' (Nirenburg, 1987 : xv). With all the advancement in the field, it is riot being considered as a fully automated process, simply because current software Can neither absorb 'encyclopaedic knowledge' of the World for applying to the translation task nor use meaning at the pragmatic level in input-output transaction of translation process. In fact, there are three American based companies which are at present engaged in producing commercial translation software - (a) Automated Language Processing Systems (ALPS-Provo, Ut), (b) Weidner Communications (Northbrook, Illinois), and (c) Logos Computer Systems (Wellesley, Mass). Their software products also reflect the state of the art in machine translation. Alps Transactive software is designed as 'tools to aid translators' and is engineered to operate in an interactive manner in which a human translator guides the software. Contrary to this, software products of Weidner and Logos are designed to take the translation as the first input and human translator editing the output. In their perspective, machine automat serves as a 'Junior translator' with an instruction to report to a human 'Senior translator'.

On theoretical level, MT has evolved three competing strategies during its development (a) 'direct' translation strategy, (b) the 'transfer' strategy, and (c) 'interlingua' strategy (Tucker, 1987 : 22). The direct translation strategy starts from the text of the source language and through a series of successive operational stages produces the output text in the target language. It neither involves 'parsers' nor an 'intermediary' language. It depends primarily on classified dictionary information, detailed morphological analysis and sentence patterns, and next processing software and in which the output of one stage serves as the input of the next stage. The Georgetown MT system i s the first of its kind (Zarechnak, 1979) and can be considered as a classic example of this strategy. The transfer strategy involves three processes and their corresponding stages - (i) analysis of a sentence of source language as an abstract labelled structure, (ii) transference o-f this abstracted structure and lexicon of the source language into the structure and lexicon of the target language, and (iii) restructuring of the sentence of the target language as the final output. This strategy has been adopted by MT groups as GETA in Grenoble (Boitet, et al., 1985). The interlingua strategy makes the MT possible through a universal language. The process is drawn primarily from the area of, artificial intelligence which motivates the scholar to equate utterances in an interlingua (i.e., Universal language) with formulae-of a knowledge representation scheme that involves high level structures, such as, scripts, plans, etc. It also involves three stages - (a) abstraction through analysis, i.e., it analyses the text of source language and represents it in form of language-free conceptual representation, (b) augmentation of information through inference, i.e., the conceptual representation is provided at this stage with the contextual /world knowledge implicit in the text through inference mechanism, and (c) reproduction in the target language, that is, the language-free representation is mapped on to the full expression of target language through natural language generator which takes into account the appropriateness of the interaction between inferential information and the abstracted structure (Carbonell, et al., 1981).

Undoubtedly, with the availability of microprocessors, advancement in the field of artificial intelligence and pragmatic oriented linguistic theories, the field of MT has received a new vitality and optimism.

3. Man-machine Interface


This area falls within the field of artificial intelligence. The primary aim of artificial intelligence (Al) is to make computers behave as intelligently as possible. It has been defined as "the science of making machines do things that would require intelligence" (Minsky, 1968). The field ranges over different types of topics such as theorem-proving, game-playing, pattern recognition, expert systems, knowledge engineering, use of natural language (Andrews, 1983 : 12). It has been argued by scholars that natural language is the most convenient man-machine interface device for communication, particularly for people other than computer scientists. It is true that the computer system is being used in a big way for automatic information retrieval. Nevertheless, still more concerted research is being carried out to make the computer function as a conversational partner. Scholars are of the opinion that 'those concerned with the design of information systems should now be concentrating on functional requirements for the user-oriented, natural language systems of the future' (Lancaster, 1977 : 39). The proliferating use of computer systems and genuine interest of the different sections of a society in the use of computers in the daily information processing function also creates a need for natural language processing (Sager, 1981).

Here we must differentiate between 'formal' language, 'natural' language and 'real' language. We must realize that a computer program is the embodiment of a formal system. As all operations that undertake inside the computer are based on the binary principle, all instruction as well as data have to be ultimately in binary form. The computer internal language that operates the binary world is known as machine language. Machine language is computer-sensitive language and thus, each model of computer has its own unique machine language and that is the reason why machine language programme of one computer cannot normally be fed into another model of computer. Programme writing became easier only with the introduction and subsequent development of high level languages like FORTRAN, COBOL and BASIC. Such programming languages are excluded from the category of 'natural languages' simply because they do not correspond to aspects of real languages like English, Russian, Hindi, etc. For example, the compiled programming language such as ALGOL models aspects of the language of mathematics. It is to be noted that all non-natural compiled programming languages are meant to be used by computer specialists and are based on the model of mathematics, logic, etc. Contrary to this, 'natural languages' have also the capabilities to respond meaningfully to their input requirements of a computer system but are oriented basically to model the features of real language directly. As pointed out by Benson (1979), 'natural' languages are instances of formal language since they are also designed languages meant to be used as input by computer programme. But they are called 'natural' because they accept a reasonable fragment of any real language as the command language of the programme.

Research in the Natural Language Processing (NLP) brought scholars from many disciplines into a single fold. For the first time a highly successful intrdisciplinary workshop called TINLAP (Theoretical Issues in NLP) was held at MIT in 1975 with the purpose of "bringing together researchers and students from computational linguistics, psychology, linguistics and artificial intelligence to provide a forum at which people with different interests in, and consequently different emphasis on, the problem of language understanding could learn of the models developed and difficult issues faced by people working on other aspects of understanding". TINLAP-2 was organized in 1978 at the University of Illinois with such six wide ranging topics as: (1) Language representation and psychology, (2) Language representation and reference, (3) Discourse: Speech act and dialogue, (4) Language and perception, (5) Language mechanism in natural language, and (6) Computational model as a vehicle for theoretical linguistics.

In fact, researchers engaged in NLP made the language theory testing aspect of computer system a promising field of enquiry. The earlier research was primarily confined to the area of testing of grammars proposed by theoretical linguistics, such as Friedman's Transformational Grammar Tester (Friedman, 1971). The NLP research field motivated scholars to develop complete understanding system by taking into account those areas of linguistic enquiry which have so far been inadequately explored by linguistics. One can identify at least two such fields: (a) Representation of knowledge and (b) Development of 'parsers'.



(a) Representation of Knowledge


One finds a number of suggestions for structuring information such as 'frames' (Minsky, 1975), 'scripts' (Schank and Abelson, 1977), 'information formats' (Sager, 1975), etc. Frames identify new information in terms of known patterns central to the analysis of texts. Scripts aim at capturing one's knowledge about stereotyped sequences of events. Predicate argument relations in the context of any particular verb are made the key concepts for 'information format'.

Attempts have also been made to test the operational efficiency and psychological reality of representational models. For example, the LNR research group of the University of California at San Diego has developed a representational format for meaning and a system for its testing. As reported by Gentner (1978), verb meaning is accepted as a starting point for two reasons: (a) verbs provide the central organizing semantic structure in sentence meaning, and (b) verbs are tractable. Meanings of verbs in this format are represented in terms of inter-related sets of sub-predicates such as CAUSE or CHANGE in an inter-related manner. The following basic assumptions underlying representation can be tested as hypotheses:

(a) a verb's representation captures the set of immediate inferences that people normally make when they hear or read a sentence containing the verb,
(b) in general, one verb leads to many inferences,
(c) these networks of meaning components are accessible during comprehension, by an immediate and largely automatic process,
(d) the set of components associated with a given word is reasonably stab,le across tasks and contexts,
(e) surface memory for exact words fades quite rapidly, so that after a short time, only the representational network remains (Gentner, 1978 : 3).

In this representational model the nodes and arrows correspond to the concepts and their relationship. It is also suggested that more paths in. the representation mean more conceptual paths in the memory. For example, let us take the following three sentences of Hindi.

(1) mohan ke paas tasviir thii
'Mohan had the picture'
(2) mohan ne shiva ko tasviir dii
'Mohan gave Shiva the picture'
(3) mohan ne shiva ko tasviir bechi
'Mohan sold Shiva the picture'

The meaning of the above mentioned three verbs - honaa, denaa and bechnaa as interconnected subpredicates and their complexities can be shown by the following graphic representations.


image 1

image 2



(b) Development of 'Parsers,


Central to the system for natural language information processing is a parsing programme that produces syntactic analysis of input sentences utilizing a programming language (specially designed for writing natural language grammars)g a word dictionary, and procedures for transforming string i3arse trees into informationally equivalent structures. String analysis or computation provides the structure of a sentence through the string of its constituent units. These units serve as items to which syntactic and semantic constraints apply. These units are also accepted as information carriers of the sentence.

There are many kinds of 'parsers' which have come into use, for example, PARSIFAL (of Marcus, 1980), ATN (Augmented Transition Network) Parser; ELI-Processor (Riesbeck and Schank, 1976), Wilks' Parser (Wilks, 1975), Moptrans Parser (Lytinen, 1984). Marcus has argued in favour of deterministic model of parser. The structure of grammar interpreter called PARSIFAL -has therefore a structure based upon the hypothesis that a natural language parser need not simulate a non-deterministic machine. His 'Determinism Hypothesis' claims that 'natural language' can be parsed by a computationally simple mechanism that uses neither backtracking nor pseudoparallelism and in which all grammatical structures created by the parser ' 'indeliable' in that it must all be output as part is of the structural analysis of the parser's input' (Marcum, 1978 : 236). Marcus has discussed- in detail the two specific universal properties of human languages pointed out by Chomsky (1973, 1975 and 1976), i.e., the Subjacency Principle and the Specified Subject -Constraint. He has demonstrated that these two constraints fall out naturally from the structure of a - grammar interpreter called PARSIFAL. The result thus provides indirect evidence for the Determinism Hypothesis (Marcus, 1978). In fact, most of natural language parsers are developed in the direction of processing somewhat constrained input sentences (Carbonell and Hayes, 1983). Attempts have also been made to develop a parser which can also parse semigrammatical sentences by application of a formalism called Tree Adjoining Grammar (TAGs). Formal properties of TAGs been explicated by Joshi and Levy (1982) and Shankar and Joshi (1985). According to the followers of this type of parsers, TAGs first define a set of elementary trees and an adjunction operation that produces complex trees through a set of rules that combine simple trees. (A tree is simply the structural description of a sentence of a given language). First, basic sentence structures are studied and abstracted, and are stored into a set of tree banks. The adjunction is then made operative as a derivative process.

In recent years, interest has been shown by scholars to make the parser semantically impregnated. Some of the parsers have absorbed merely certain semantic features into their syntactic analysis, as is the case with METEO-TAUM parser (Chandioux, 1976), while some of them incorporate semantics as the basic premise for the analysis of the input text, as is the case with Wilks' parser (1975). An argument for an integrated approach to processing wag put forward in which it was stated that while syntax needs semantics, semantics also needs syntax. it was suggested that a parser should take care of both syntactic and semantic constraints. Lytinen (1987) thus argues that his MOPTRANS parser is an integrated parser, in the sense that syntactic and semantic processing take place in tandem.

Linguistic Perspective and Implications


We have so far discussed the relationship between language or linguistic studies and computer or natural language processing. It is, however, important to know what valuable input theoretical linguistics can provide to computer processing. While discussing this topic linguists and computer scientists both should realize that computer processing and translation machines are constrained by the state of the art in linguistics. This puts on linguists and the field of linguistics an added responsibility.

Scholars have discussed the role of computers in language research (Srivastava, 1983; Sedelov and Sedelov, 1979) and microcomputers in primary language learning (Keith and Glover, 1987). It is interesting to see what the language sensitive fproblem-areas' are for natural language processing and how linguists and the field of linguistics can help resolve those problems. Similarly, for linguists it has become necessary to reflect upon the question what can theoretical linguistics learn from the experience gained in the field of natural language processing and computational models? Following are the directions for the development of a theory of linguistics for making itself relevant to computer processing.

(i) Making Linguistics Formal


Since a computer programme is the embodiment of formal system, it requires linguistic knowledge to be stated in the formalized terms of logic and mathematics. The formal characteristics of linguistic theory have been stressed in generative- grammar. This has been achieved in this model in two ways. Firstly, it makes the notion of 'grammar' as fundamental to its theory, rather than language, which is considered by Chomsky as epiphenomenal. Secondly, it makes the grammar syntax-oriented. It should be emphasized that the semantics of syntax in this model has been extended now' to include the field of phonology and semantics and is defined as- 'the computational component of the language faculty'. In the words of Chomsky, "so let us put forth a thesis, which will ultimately be the thesis of autonomy of syntax. It says that there exists a faculty of the mind which corresponds to the computational aspects of language, meaning the system of rules that give certain representations and derivations. What would be the parts of that system? Presumably the base rules, the transformational rules, the rules that map S-structures onto phonological representation, and onto logical forms, all that is syntax' (Huybregts and Riemsdijk, 1982 : 114).


(2) Making Computational Power Restrictive


Computational Power Restrictive It is generally agreed that the computational apparatus of a grammar cannot be let loose to become too powerful, so that it allows even those systems that are not possible in human - languages. The apparatus should be just powerful enough to allow all the facts of natural languages, but not so powerful - that it generates even those structures that never occur in natural languages. A linguistic theory has therefore to specify the possible form of a human grammar and the constraints upon such a grammar. In generative model the term 'constraint' refers to a condition which restricts the application of a rule. The constraints which - are universal to all languages qualify as universal properties of language while those restricted to a given language are said to be language - specific properties. For example, consonant clusters of English have the sequence constraint rule that in initial position it is restricted in number to three (e.g., split, string, scream); in a sequence +Cl C2 C3 V, Cl can only be [S], C2 [p, t, k] and C3 [r] or [1]. As pointed out earlier, some of the constraints claimed bychomsky, such as Subjacency Principle and Specified Subject Constraint, fall out naturally from the grammar interpreter component of a computer called by Marcus as PARSIFAL.

(3) Making Theory More Integrative


Man and machine both are excellent symbol manipulators. Viewed from the point of semiotics, science of signs, language as a system of symbols has to be analyzed on three distinct levels -(a) syntax, which studies the relationship between symbol and symbol, (b) semantics, which studies the relationship between symbol and object and (c) pragmatics, which studies the relationship between symbol and its users. Parsers are to be developed in order to interpret the sentence on all these levels in an integrative manner.

Look at the following three sentences:
(a) Mohan frightens Lila.
(b) Mohan observed the ball.
(c) Mohan did not kiss Lila.

All these sentences give more than one reading, but the cause of their ambiguities are different. In the case of (a) it is the relationship between the two symbols. Mohan and frighten which is giving two possible interpretations - (1) A Vt, i.e., Mohan as an agent-relationship with the verb which is transitive. The Hindi equivalent will be 'mohan Lika ko daraataa hai' and (2) I - Vi, i.e., Mohan as an instrument and verb as intransitive. The Hindi equivalent of this reading will be: Lila mohan se dartii hai.

In the case of (b), the ambiguity arises because of the referent of the word 'ball' which can be interpreted either as 'gala affair' as in the sentence 'I attended the ball' or as a 'spherical object' as in the sentence 'I kicked the ball'.

The sentence (c) is ambiguous because a reader can negate any major constituent of the structure:

N


NP VP


V NP

The three possible interpretations are thus:
4. Some may have kissed but not Mohan.
5. Mohan may have done something to Lila, but did not kiss her.
6. Mohan may have kissed someone, but not Lila.

Similarly, one can find out the reason for the anomalous nature of the second sentence of the following pairs.

(al) mohan so rahaa hai
'Mohan is sleeping'
*(a2) per? so rahaa hai
'Tree is sleeping'
(bl) kal jo aadmii maaraa jaaega, vo so rahaa hai 'The man who will be killed
tomorrow is sleeping'
*(b2) kal jo aadmii maaraa gayaa, vo so rahaa hai 'The man Who was killed yesterday
is sleeping'
(cl) abe shyamu! tu kyaa so rahaa hai
'O Shyamu! are you sleeping'
*(c2) O pitaa jii, tu kyaa so rahaa hai
'O father, are you sleeping'

(4) Making Interpretation Context Sensitive


Information Processing requires knowledge of the world and the context of situation of utterances. It also requires information sharing between sentences. It is true that the sentence: 'My typewriter has bad intentions' is anomalous because it violates selection restriction rule, but if we replace 'typewriter' by 'dog', 'snake' and 'microbe', whether the resulting sentence is judged to be anomalous can be determined, as pointed out by Palmer (1976 : 46), only by what we know about intelligence of dogs, snakes and microbe, i.e., our knowledge Similarly observe the following two sentences:

(a) Mohan was looking for the glasses.
looking for the glasses for him to drink.
(b) Mohan was looking for the glasses.
looking for the glasses for him to read.

The first sentence in (a) and (b) are identical, but we associate different meanings with the word 'glasses' in each case because of the information of the following sentences. It is obvious that there is no way to understand or translate the first sentence in (a) and (b) by reading these sentences in isolation. Without building a 'context' by looking across sentences and making inferences across statements, a viable theory of information processing cannot be developed. It is to be stressed that for natural language processing we have to make our language analysis with all three levels grammatical analysis, information processing across sentences and encyclopedic knowledge.

Scholars actively engaged in the field of NLP are well aware of the fact that alongwith the description of the phonology, morophology, syntax and semantic ' s of a natural language, they have to face at a certain stage problems of anomalous sentence constructions, contradictory statements and ambiguous linguistic structures. On many occasions they have to deal with the problem of meaning at the pragmatic level where the same sentence may play different roles in different discourse settings. They also face the problem of organizing sentences logically so that a coherent structure of a dialogue should get generated. For resolving 'problems' at all these levels, a NLP project needs a qualified linguist.

Linguists associated with a NLP project, however, must be informed about the fact that linguistic problems may or may not be the real problems of NLP and for no reason whatsoever, their problems be substituted for real NLP problems. It should be emphasized that the core-linguistics centres around the question - 'What language is' and extends its domain on periphery resolving problems related to the question 'what language does'. While involving linguistic questions, the field of NLP is basically concerned with problems that are related to the question - 'how language works', and that too in the context of artificial intelligence. Secondly, most of linguistic information comes packaged to a linguist as a part of formal grammar, but as pointed out by Raskin (1987 : 49), 'the linguist should be smart enough to know that packages are not ready for use in NLP'. Being an applied linguist, he has to make these rules applicable in different situations. Lastly, as stated above linguistic implications for NLP are many and multifaceted. As a scientific field of enquiry, linguistics has developed itself in many areas to an extent that it can provide easy solution to many problems of NLP but there are still many areas such as pragmatics, conceptual processing, information sharing between sentences, discourse implicatures, meaning in relation to world knowledge etc., in which linguistics has yet to. become ripe for a paradigm shift. It is in this context that we said earlier that the development of the field of NLP is also constrained by the state of art in linguistics.