Lexicography in India
Part 2:Dictionary Making: Theory and Techniques

The Computer, A Tool For Dictionary-Making In India

 Susie Andres

 

Introduction

The last decade or two have witnessed a remarkable increase in the production of computers and a significant diversification in the uses in which they are put.  The relavant question, today, in an ever widening area of research in the sciences and in the humanities, as well as in the area of industrial development, is : “How can be ‘harness’ the computer to help us solve our production and research problems?” In order to help you, whose presence at this conference evidences a keen interest n lexicography, answer for yourselves the question, “Can the computer help us with the task of dictionary – making in India?”, I will try to convey some idea of (!) what a computer is, (2) what it can do that can be relevantly applied to dictionary-making, (3) how the lexicographer’s materials and purposes can be “plugged in” to the computer, and (4) what facilities are available for these purposes here in India.

 

1. What the Computer is

The computer is an electronic machine that is capable of performing a number of simple computational operations which may be used to calculate solutions to complex problems –it can add, subtract, divide, and multiply.  IT can be thought of as a large box which contains thousands of little electronically-sensitized storage cells, which comprise the machine’s “memory”.  these cells (usually called memory ‘locations’ or ‘registers’) can be individually activized in such a manner in such sequences that the following functions can be automatically performed :

1.        Data, usually punched or recorded in a specified format, can be read from punched cards or paper tape, or from magnetic tape, and stored in the machine’s memory for immediate processing; or it can be read off one of these and stored (in the same or different format) on magnetic type or punch cards for processing at a later time.

2.             Pieces of data can be shifted from one location to another in the machine’s memory, always can be shifted from one location to another in the machine’s memory, always displacing any information previously stored in the location which is currently being filled with new information.

3.           Arithmetic operations may be performed on bits of data and the results stored in memory locations in the computer, from which they may be printed out or transferred to magnetic tape or punch cards, or they can be retained in the computer memory and serve as new data for other operations.

4.             A number of differing operations may be applied to the same piece of data before a new piece is processed.

5.             Any operation or sequence of operations may be repeated again and again in order to process a large number of stretches of data.

6.             Pieces of data, consisting either of the data supplied by the user or of the results of computation performed by the machine, may be compared with one another for size.  Accompanying “stage directions” indicate what instructions are to be carried out next, given certain results of the comparison.  This operation is one of the main features of alphabetization programs.

II. Lexicographic Functions That the Computer Can Perform

By means of the operations listed above, the computer can do a number of jobs that face the lexicographer.  True, they are all essentially forms of high-speed clerical work, and the dictionary-maker will still need to use his intelligence and skill to execute those phases of his task that require more thought.  However, once the lexicographer’s “raw materials – the language data – have been adequately prepared for presentation to the computer, and once the machine has been supplied with the instructions on how to do the job, it can do the work much more accurately and thousands of times faster than nay human being ever could. The time thus saved leaves the lexicographer free to concentrate on those parts of the total project that require a more rigorous application of his intelligence.  The following are some of the lexicographic facilities that the computer user has at his disposal.

1.             The computer can prepare a concordance from text punched on paper tape or cards and print out the results.  All of the information that is to be accounted for in the concordance (text name, source date word and or morpheme breaks, line and page numbers) must be coded into the text.  There is room for a fair degree of complexity in such a concordance if the aims of the dictionary program call for it.  The computer cannot judiciously select appropriate semantic units, or the best examples of their use; nor can it supply meanings or glosses. However, it can pair glosses with lexical items if a text with matching morpheme by morpheme, word by word, or phrase by phrase glosses comprises the data.  On the other hand, these things can be done by the lexicographer when he has the complete list of words in context, printed in alphabetical order, before him.

2.    It can delete from information stored on tape any stretches of data that are not needed, and are identified as such, and rewrite and reprint the condensed file.

3.             It can make editorial changes-including generalised orthographic changes-in stored data and again rewrite the corrected file.

4.             Using a built-in alphabetization routine, it can alphabetize lists of data according to the Roman order, or, given a set of recording instructions, it can alphabetize such lists in any other systematic order desired.1

5.             Using one set of dictionary data, properly encoded, it can first alphabetize the whole set using the first item of each entry as a basis for the “sort”, and then it can realphabetize the whole set using the gloss or the meaning from each entry as the basis for the “sort”. That is, it can alphabetize and print out two-way bilingual dictionaries.  In fact, it can prepare dictionaries containing any number of languages, with each of the languages by turn treated as the “target” language.

6.             Incorporating some or all of the above functions, it can arrange blocks of information, such as expanded dictionary entries, as main and sub-entries, and cross-reference these entries.

7.             When appropriately coded, it can prepare tapes that will control the operations of such auxiliary machines as phototypesetters and electro-mechanical plotters. Phototypesetters are machines that prepare photographic masters for offset printing in a variety of type styles, and electro-mechanical plotters are machines that “draw” masters for offset printing, using any form of script whose plotting instructions have been coded into them.  Successful experiments with the former have been carried out at the Massachusetts Institute of Technology Computing Laboratory, using the Photon 560 phototypesetting machine2, and successful experiments with producing Devanagari characters by the latter method have been conducted at the Tata Institute of Fundamental Research in Bombay , using the Cal Camp Plotter.3

8.             The computer can permanently store processed data on magnetic tape for use in the preparation of future editions of a published work. The data can be read off tape by an editing program, which makes the necessary changes and rewrites the data on tape-in any desired format-ready for the printing-out or typesetting phases of the project. All of this can be done without the long and tedious task of manual manipulation of data, manuscript typing, and type-setting involved in the ordinary production of new editions.

    Perhaps the best example of how much of the manipulatory machinery just described has been utilised in dictionary making is the work done by Wőlfgang Wòlck at the Research Center for the Language Sciences at Indian University , in Bloomington .   Wőlck and his co-workers developed a set of computer programs for the preparation of a computerized dictionary of Andean languages.  (See Wőlck. 1969).  The dictionary is actually a working file of the Quechua language of South America , which has been stored in some part of the computer’s auxiliary memory.4 Each file element consists of a Quechua lexical entry with English and Sapnish glosses, and other relevant information.  The kind of information included with each entry is : (1) A file number, used for bookkeeping purposes and for permitting easy access to elements that need revision or correction, (2) Language name, (3) Data item, the head lexical item to which all of the other information pertains (4) Allomorph (s) (5) Derivatives, (6) Grammatical class,  (7) Whether the item is a loan word or native, (8) Dialect of the data item, (9) Source of the data, (10) English gloss, (11) Spanish gloss, (12) Citation(s) (13) Dialectal cognates, (14) Dialect(s) of the cognates, (15) Adopted spelling or variants, (16) Entry date (17) Transcriber’s name, (18) Alphanumeric codes, (19) Semantic domain, and (20) Comments.

    This file constitutes a “master” file from which various kinds of information can be extracted by the computer-when specifically programmed to do so-and printed out in any desired form.  The following are some of the listings that can be obtained :

 

1.             Alphabetized bilingual or trilingual word lists ordered by any one of the languages included.

2.             Listings by dialect.

3.             Headwords and cognates from other dialects-useful for comparative work.

4.             Entries listed according to historical sources of approximately the same date and area of use-useful for the study of historical change.

5.             A listing of all loan words within a specific semantic domain.

6.             Multilingual glossaries.

7.             Comprehensive dictionaries, with cross-referencing and including most of the information given.

 

III. How the Lexicographer Turned Computer Programmer, sets up his System of communication with the Machine.

Perhaps we will most easily gain an insight into what a lexicographer must know and be able to communicate to the computer about his entire project-the goals at each stage of its development and his material-by considering briefly how a computer user would approach his problem.

   The machine cannot “think”.  It can do only what it is instructed to do in a precise, carefully-laid-out plan of work called a computer program.  Before he can write the necessary instructions for the machine, the programmer must plan his procedure.  He does this by drawing up a “blue print”, called a Flow Chart, which graphically describes the order in which the required operations are to be performed.  Following this “outline”, he uses a special computer language to specify one by one the operations which the machine must perform, and the order in which they are to be executed.

   In order to try to get a clearer picture f how this is done, let us look briefly at a flow chart the outlines the processes involve in the production of a concordance.  First of all, let us consider a sampling of the kinds of concordance we might want to prepare.  The column headings in Figure 1 indicated three degrees of complexity among the many that are available to the concordance-maker.  Under INPUT we indicated the form in which the data must be presented to the computer, and under OUTPUT we describe what will be printed out for consumer use.

   It will be immediately obvious that it is the third column that would be of greatest interest to a lexicographer.  Here, the INPUT would consist of text with a unit by unit literal English translation given immediately below each line, an exemplified below.5

MAI      BHAARAT         ME       HUU

  I              INDIA              IN         AM

     Each line of data would, moreover, be labelled with an index reference to the text name, page, and line fro which it was taken.  The OUTPUT would be an adequately indexed concordance of all the words in the text or all except commonly occurring words like pronouns, some auxiliary verbs, and postpositions, which are likely to be irrelevent to the purposes of the lexicographer.  This output could then be checked by the lexicographer and culled of all undersirable entries and or citations.  The reduced word list, with accompanying glosses and citation lines, could then serve as the INPUT to the dictionary processing program. 

   Now, consider the logical sequence in which the computer must be made to perform certain manipulations of text data in the process of preparing a concordance.  These operations are outlined in the Flow Chart in Figure 2.  First, the machine reads a line of data.  Then it checks to see if that is the line in the data.  (The last line is marked in a special way).  If the line just read isn’t the last line, then the computer isolates the first word in the line and, after checking to make sure that is has read a word (that is, all the words in the line have not been “used up”), it writes the word, and the line of data from which it was taken, on tape.  Then following the arrow, it goes back and isolates the next word in the line and writes it on tape, again including the line from which it was taken.  This cycle is repeated until all the words in the line have been identified and written on tape.  The next time it encounters the question, “Have all the words in the line been processed?”, it follows the YES arrow back to the instruction to read a line of data; then it goes through the same procedure as before, reading the next line of data and separating out all the words in that line, one at a time.  The it processes the next line, and so on, until it has processed al the lines of data.  This time, when it encounters the question, “Is this the end of the data?”, it follows the YES arrow from the first diamond and goes to a new part of the program.  It reminds the tape on which all the words have been written and CALLS another part of the whole program, called SORT.  This is a subprogram, which reads all the words off the tape and alphabetizes the whole LIST.  When SORT has finished that job6, the main part of the concordance program instructs the machine to print out the alphabetically arranged list of words matched by their source lines of data.  When the end of the list is reached, the machine stops.

   Inside the SORT subprogram mentioned above is another subprogram that translates the romanized representatives of Devanagari characters into a number code suitable for alphabetization according to the Devanagari order.  It is also too complex t discuss here, but it might be of interest to include a sample of the

INPUT code, with the characters represented by some of the code configurations,and the number code into which they can be converted.

 

IV. Facilities Available in India

There are large computer installations in several research institutes in India , such as the Tata Institute of Fundamental Research in Bombay and the Indian Institute of Technology in Kanpur .  Since I am familiar only with the former institute, I will restrict myself to discuss the facilities available there. The computer installation at the Tata Institute of Fundamental Research is comprised of a Control Data Corporation 3600 Compute System, several smaller computers, and some auxiliary machines, one of which is the Cal comp Plotter mentioned above.  The CDC 3600 is quite a large computer, probably the largest in India , and has proved to be an efficient system for the development of the programs described below.

    While working in India under the sponsorship of the Summer Institute of Linguistics, Dr. Colin Day and Mr. Warren Glover, aided by time grants, developed a series of computer programs that either are dictionary-processing routines or can be used to aid in dictionary-making7.  Some of them have since been modified to permit alphabetization according to the Devanagari order and the writing of output in Devanagiri characters. The following is an inventory of the computer programs presently available for use.

1.       Vocabulary-sorting programs that will print out and store on tape for future use (a) vernacular-English and English-vernacular word lists of up to 3500 bilingual entries alphabetized according to either the Roman or the Devanagari order or (B) vernacular-English-regional language, English-vernacular-regional, and regional-vernacular-English word lists of up to 2500 trilingual entries, again in either Roman or Devanagari order.

2.       Editing programs to edit texts and vocabulary lists or to make orthographic changes in vocabulary lists.

3.       Programs that will make frequency counts of words in text and calculate the percentage occurrence of each.  Such counts are useful when words are to be selected for inclusion in a concise dictionary.

4.       A simple concordance program8 which can readily be modified to any degree of complexity desired.

5.       A set of programs (in the development of which Dr. Ramani of the Tata Institute of Fundamental Research, and I, collaborated) which will convert an alphabetized dictionary into Roman and or Devanagari script in a format suitable for offset printing.  Figure 4 is a sample trilingual word list actually produced by computer.

 

The above mentioned programs are constantly being revised.  It is hoped that any or all of them can be made sufficiently flexible and comprehensive to handle many of the dictionary-making needs of India.9 I can foresee, moreover, that the development of such files as the one set up for Andean languages would be a notable asset to the field of language research in this country.

 

FOOTNOTES

1I have used such a device to alphabetize word lists in the Devanagari order.

2See Barnett, 1965

3This project was carried out in collaboration with Dr. S. Ramani, a Research Fellow at the Tata Institute of Fundamental Research in Bombay .  Dr. Ramani wrote the program that generates Devanagari characters and advised me on how to write the actual plotting instructions, which were to be implemented by the program.  I want to take this opportunity to thank him for his willingness to work on this project as well as for all of the time and effort he put into it.  I also want to thank Professor Ashok R Kelkar of Deccan College in Poona for making helpful criticism and suggestions when I was developing an alphanumeric code (consisting chiefly of Roman characters) for Devanagari characters and also when I was designing the Devanagari characters that were to be reproduced by the plotting program.

4 It was interesting for me to note that the computer that was used by Wolfgang Wőlck and his associates to develop their Andean dictionary was the same model as the one we use at the Tata Institute in Bombay .  This suggests that it is not too much to expect that we might develop such a project in India .

5Idioms would probably best to recorded and treated as units’ that is, word boundaries would need to be eliminated by markers of some sort in order to prevent the treatment of the words in the idiom as separate units.

6We need a separate FLOW CHART for SORT, but it is too complex to include here.  It’s main feature is that it arranges words or phrases, which are representated in the machine as numbers, in order of size.  The number representation is normally such that alphabetization will be done in the Roman order, unless other coding instructions are given.

7The Summer Institute of Linguistics acknowledges with gratitude the help given by Professor R. Narasimhan, Head of the Computer Group at the Tata Institute of Fundamental Research, both in giving of his time to discuss some of the projects undertaken by the Institute and also by making grants available for the development of the projects.

8This program was prepared by Dr. Ramani of the Tata Institute.

9I would like to add that it seems quite reasonable to expect that the characters used in some of the Indian writing system other than the Devanagari could also be designed for production on the Cal comp Plotter.  An interesting interview that I had with Mr. R.K. Joshi, a calligrapher at Ulka Ads in Bombay, earlier in the month (March 1970) gave rise to the hope that eventually a number of complete sets of plotting instructions for Devanagari characters (or any other system of characters) could be prepared to provide the facility of a choice i the calligraphic style to be used for plotted output. (Mr. Joshi’s chief interest is in the development of a simplified and somewhat modified and extended Devanagari script which might be used for all Indian languages).

 

REFERENCES

Andres, Susie and S. Ramani 1970a. “The codification of the Devanagari Script for automatic data-processing.” Indian Linguistics, Vol. 31, No. 3, pp. 91-102.

----------------------------------------1970b. “A note on programming a character generator for the Devanagari script.” Technical Report No. 82. Bombay : Computer Group at the Tata Institute of Fundamental Research, Mimeo.

Barnett, Michael P. 1965. Computer Typesetting: Experiments and Prospects. Cambridge , Massachusetts : The M.I. T Press. Bradley. Henry G., and William R. Merrifield. 1965. “On constructing bilingual dictionaries.” Unpublished paper. North Dakota : Summer Institute of Linguistics.

Harrel, Richard S. “Some notes on bilingual lexicography.” Unpublished paper. Georgetown University .

Kay, Martin. 1969. “The computer system to aid the linguistic field worker.” P-4095. santa monica , California ; The RAND corporation.

Kelkar, Ashok., and Lachman M Kubchandani. 1968. “The possibility of using computer methods for the historical dictionary of Sanskrit : an assessment” An unpublished report. Poona : The Deccan College Postgraduate and Research Institute.

Wőlck, Wolfgang, 1969.  “A computerized dictionary of Andean languages.” Language Sciences. Bloomington : Indiana University .  Research Centre for the Language Sciences.  No. 8, December 1969.

 

Concordance Possibilities

Simple unreferenced with Roman Alphabetization

Unreferenced, with Hindi Alphabetization

Referenced and closed with Hindi Alphabetization

INPUT: Text typed with lines not exceeding 80 characters in length, in whatever Romanization the user chooses to employ for the representation of Indian characters.

 

INPUT: Text typed with lines not exceeding 80 characters in length, in the code designed to represent Devanagari for the programs described below, or in Devanagari characters in lines not exceeding 50 characters in length. **

INPUT: Text typed in lines of 70 Roman characters (in the specified code) or 40 Devanagari characters with unit by unit literal English translation below each line.  Spaces 71-74 would be reserved for a 4-letter code to identify the text, space 75-78 for the page number of the text, and 79-80 for the line number.

OUTPUT : A list of the words of the text, arranged according to the Roman alphabetical order and each accompanied by the lines of text from which it was taken.

OUTPUT : A list similar to the one described to the left, but arranged alphabetically in the Devanagari order.

OUTPUT: A list of the words from the text, arranged according to the
Devanagari alphabetical order and each accompanied by (1) its English gloss, (2) the line of the data from which it was isolated, and (3) the source identification code and the page and line number specifying its location in text.

 ** In this case a key-punch operator would need to be trained to punch the required code directly from the text in Devanagari.

Figure 1


Top

Hindi English Collocational Dictionary

Shreeprakash Kurl

A study of Collocational is a study of the nature and potentiality of co-existence of words strung together in a basic sentence. Some linguistics have examined this phenomenon of language under two different terms – Collocation and Colligation.  They defined collocation as the study of the capacity of the co-occurrence of grammatical items.  Collacations present the speaker with an open and wider choice of range of associating lexical items whereas colligations, being the study of the grammatical patterning, leave the speaker with a very closed and fixed choice of items.  However, the present paper cannot afford to draw any water-tight compartment between the two levels of this study and keep them apart  and handle them separately as if they are not a part of a basic sentence.  All the components of a basic sentence are glued together by two types of rules-Grammatical and Semantic.  The grammatical rules provide the syntactic components an order of arrangement whereas the semantic rules provide those components a network of an internal relationship, the potentiality to associate with a particular type of words and reject other types of words.  Thus a collocational study sets up a device to examine the capacity of co-occurrence of the components of a sentence.

             Collocation include two types of expressions-Cliche and Idioms.  When the words become hackneyed and almost meaningless by their over-use they are called cliche. Idioms are a group of words whose meaning cannot be understood just by learning the meaning of its components in isolation.  For instance, a foreigner learning Hindi may know the meaning of the verb ‘ukharnaa’ and also the meanings of all those nouns which can associate with this verb, yet it may still be difficult for him to discover that

 ‘(qaafilaa) ukhaŗnaa’ means ‘(tribe) to move’

‘(tabiiyat) ukhaŗnaa’ means ‘to lose (one’s interest)’

‘(fauz) ukharnaa’ means ‘(army) to disintegrate’

‘(baazaar) ukharnaa’ means (‘market) to be closed’

‘(saakh) ukharnaa’ means ‘to lose (one’s credit)’

             Similarly, the meaning of the noun ‘aavaaz’ may be known to him and also the meaning of the verbs ‘aanaa, uthaanaa, karnaa, khulnaa, tuutnaa’ but when these verbs associate with ‘aavaaz,’ he may find it difficult to figure out that.  

‘aavaaz aanaa’

‘to hear the noice sound’

‘aavaaz uţhaanaa’

‘to raise an objection’

‘aavaaz karnaa’

‘to make a noise’

‘aavaaz kasnaa’

‘ to quip, make a passing remark’

‘aavaaz khulnaa’

‘to regain (one’s lost) voice’

‘aavaaz ţuuţnaa’

‘one’s voice to crack’

‘aavaaz denaa’

‘to call aloud’

‘aavaaz baithnaa’

‘to become hoarse’

‘aavaaz  bharraanaa’

‘to become hoarse’

‘aavaaz maarnaa’

‘to call aloud’

‘aavaaz maariijaana’

‘to become hoarse, have laryngitis’

‘aavaaz lagaanaa’

‘to call aloud’

 Similarly his knowledge about the meaning of hte collocations like ‘acchaa lagnaa, aakh aanaa.  buxaar aanaa, shaadii karnaa’ etc will not help him until he discovers the grammatical and semantic rules about the use of the collocations.  For example, he will be expected to be familiar with the following framework before he ventures to make a possible Hindi sentence using the above collocations.

1

X1

ko

Y(1)

‘acchaa lagnaa’

‘to be pleasing

2

X1

 

Y1 se

‘shaadii karnaa’

‘to marry X’

3

X1

 

Y1 se Z1 kii

‘shaadii karnaa’

‘to marry X to Y’

4

X1

ko

 

‘buxaar aanaa’

‘to have fever

5

X1

kii

 

‘aakh aanaa’

‘to have pink eyes’

This type of analysis will not only give the learner a readymade grammatico-semantic frame of the target language but will also make him conscious of the nature of the language.

   Collocations could also be a powerful source for the study of complex social traditions of the culture of the target language.  The learner will find the collocation ‘first husband’ or ‘previous husband’ very rare if not totally impossible in the Hindi language.  However, he can quite frequently find the collocations, ‘first wife’ or ‘previous wife’ reflecting upon the male orientation of the Indian society.

   The teaching of Hindi as a foreign language is no more limited to the teachers’ question, “yeh kyaa hai” ‘what is this’ and to the students’ answer, “yeh qalam hai”  ‘this is a pen’ or to the grammatical transformations, like, “mai jaataa huu” ‘I go’ to “mujhe jaanaa chaahiye” ‘I should go’ only.  Today foreign students are getting more and more interested in learning Hindi not only as a vehicle of communication or exchange of thoughts but also as a vehicle of cultural exchange.  The growing strength of the native land of Hindi, together with the strong urge for a better and an easier way of understanding of the country, force the Hindi-linguists to do a lot more than simply writing text-books.  It is my considered opinion that the teaching of Hindi can also be facilitated  by the proper  use of a linguistically oriented collocational dictionary as it is the only place where each word can be treated individually as well as in association with other words.  The tendency to associate is the most fundamental patter into which lexical items enter and it is

X1= animate, logical subject, always followed by the postposition ‘ko’, therefore always in the oblique case (1,4).

Y1= either animate or inanimate, grammatical subject, never followed by any postposition therefore in the direct case.

X1=animate grammatical subject, never followed by a postposition except in the perfective aspect, therefore in the direct case, (2,3)

Y1=animate, followed by the postposition ‘se’ (2,3)

Z1= animate, logical as well as the grammatical object, always followed by the postposition

X1=animate, functioning as an adjective, always followed by an adjectival postposition, therefore in the oblique case, (5)

noticed that the Hindi learners often make mistakes is this area. Such a dictionary will not only help them learn the right associations but will also help them select the right sociolinguistic register and help them acquire competence in translation.

ऊपर (uupar) (X(1) ke+uupar aanaa) to resemble (लड़का बिलकुल अप ने बाप के ऊपर आया हॆ)

1. on3.above
2. upon, on the top of 4. upstairs
1. ऊपर उठनाa. to rise, aspire
b. to be lifted
2 ऊपर उठनाa. to promote
b. to lift, uplift
3. ऊपर का outer, outward
4. ऊपर का खर्च overhead charges
5. ऊपर का दूधformula milk
6. ऊपर का पद higher position
7. ऊपर की बातa. something supernatural, something demonic,
b. formal
8. ऊपर के दाँतfalse teeth, dentures
9. ऊपर के लोगsuperior officer
10. ऊपर वाला God
11. ऊपर ऊपर से a. through the back door
b. from the top
12. ऊपर से a. on the top of
b. superficially

माल (maal) m.

1. wealth, property5. a beautiful girl, a real dish
2. goods, luggage, thing(s) *6 an ugly boring or crude girl, a dog
3. merchandise 7. a string to turn the spinning wheel
4. delicacise, goodies 
माल उड़ाना to gorge on delicacies
माल काटना to embezzle
माल खाना see 1, 2
माल पीनाsee 1,2,3
माल मारना see 1,2,3,4

लगना(lagnaa) Int +

1. to be attached
2. to be harnessed
3. to be fixed
4. to be affixed
5. to be stuck

लगना(lagnaa) Int -

पर में लगनाY to irritate Y
(यह दवा बहुत तेज है। मेरे जरूमों पर लगती है।)
को लगनाY to agree with X
(उनका स्वास्थ्य सुधर रहा है। दवा उनको लग रही है।)
को लगनाY to agree with X
(सुरेश को लगतो है कि पिता जी आज नहीं आएँगे।)

Top

Some Problems In Compiling Bilingual Dictionaries

G.N. Reddy

 

This paper presents briefly some of the editorial problems encountered by Indian lexicographers in the  production of bilingual dictionaries with English as the source language. Normally the dictionary making word-whether it be a bilingual one, or  a unilingual one-has two phases or aspects, namely (1) selection of the material, ‘words’, for the main entries and (2) definition and / or explanation for each of the main entry.  And the problems that arise in dictionary making can be discussed under the above two phases or aspects, because they are neither identical nor of the same complexity.  I am, in this paper, concerned only with some of the problems relating to the first aspect, that is, selection of the English ‘word’ material for the main entries in a bilingual dictionary.

   The selection of English words for a bilingual dictionary largely depends upon the judgement of the editor or compiler who necessarily has to take into consideration the type of users of his dictionary and their need.  While unabridged dictionaries in English may contain as many as 500,000 words, the usual desk dictionary or a college dictionary may contain 120,000 to 150,000 words. In these English dictionaries, the selection and inclusion of the word material is decided upon from the point of view of the user. The Indian lexicographers do not seem to have made any attempt to prepare an English word list of their own, taking the Indian user into consideration.  They usually base their word list on one or two of the standard dictionaries in English according to their judgement.  The compilers are usually silent on the criteria for such omissions or additions, even while they acknowledge a particular dictionary their basis. Their selection of word material seems to be more arbitrary than anything else.  To illustrate this point, we can look into the following four English bilingual dictionaries which are largely based on the Concise Oxford Dictionary (COD):

     A few entries given below from the above dictionaries may be examined.

 

COD

KD

TD

BSID

CEHD

(A) baboo/babu

Ö

Ö

X

X

badmash

Ö

Ö

X

X

baksheesh

Ö

Ö

X

X

bandicoot

Ö

Ö

X

Ö

 

1.   “In the matter of selection of words, the committee have taken as their guide the Concise Oxford Dictionary (edition not mentioned).  In the grouping of words under single heading the Conscise English Dictionary by Charles Annandale has been generally followed.  The shorter Oxford Dictionary and the Webster’s International Dictionary have also been frequently consulted for meanings and for additional useful words-scientific, technical etc.”

E.R. Srinivasamurthy,

Chief Editor in the Preface

2.   “The Oxford English Dictionary (Concise.  1958 edn.) be taken as the basis for preparing the basic word list”.

Dr. A.C. Chettiar

Chief Editor in the Preface.

3&4 These two dictionaries do not make any particular reference to any English dictionary but the editors seem to have largely followed C O D.

 

C O D

K D

T D

BSID

CEHD

cabob (s)

x

Ö

x

x

canarese

Ö

Ö

x

x

chela

Ö

Ö

x

Ö

cooly/coolie

Ö

Ö

Ö

Ö

koran

Ö

Ö

x

Ö

(B) albeit-

x

Ö

x

x

cat-, catah-, cath-

Ö

x

Ö

Ö

cheilo-, /chilo

x

x

x

x

ecto-

Ö

x

x

x

pre-

Ö

x

Ö

x

pro

Ö

x

Ö

x

(C) –cy

x

x

x

x

-ery

Ö

x

x

x

-ful

Ö

x

x

x

-fy

Ö

x

x

x

-ry

Ö

x

x

x

(D) Abderite

Ö

Ö

x

x

Abernethy

x

x

x

x

abigail

x

Ö

x

x

absquatulate

x

Ö

x

x

accidie (accedia)

x

x

x

x

Aceldama

Ö

x

x

x

ack emma

x

x

x

x

Agnus Dei

x

x (only. ‘Agnus is given)

Ö

x

alb

Ö

x

x

Ö

Baedekar

Ö

x

x

Ö

braise v.t.

Ö

x

x

Ö

caber

Ö

Ö

x

Ö

calando

x

Ö

x

Ö

camerlingo

x

Ö

x

x

chop suey

x

Ö

x

x

chin chin

x

Ö

Ö

Ö

Christy Ministrels

Ö

Ö

Ö

Ö

crouton

x

Ö

x

Ö

cruller

x

Ö

x

x

 

In the above list, a few samples under four categories are given to show that no dictionary seems t have consistent criteria either for inclusion or for deletion of certain lexical items.  A few Indian words in English are given in category A.  There seems to be no justification to include such Indian borrowings in the Indian bilingual dictionaries, except for those of the items which have undergone some significant phonological and / or semantic change.  Under categories B and C prefixes and suffixes are given respectively and in their entry also, the Indian dictionaries do not seem to have any definite criteria.  A criterion could be set up in this regard at least for such of the prefixes and suffixes which are semantically definable, Category D seems to include usually such words which have little frequency of occurrence from the point of view of the Indian reader.  The lexicographers, in the absence of any definite criteria or frequency study with regard to such words, appear to have made their selection or deletion rather arbitrarily.

    It must be accepted that it will be difficult to maintain that all the words in English vocabulary are of equal importance either to the native or to a foreigner.  Normally, a bilingual dictionary with a definite purpose to serve need not include all the words in the source language.  OED contains approximately half a million words that are defined and explained, and out of this multitude Shakespeare is said to have used 25,000 only.  In each language a few thousand words which constitute the core of the language serve about 90 percent of the communication needs of its speakers.  Thus, it is necessary that the word list for a bilingual dictionary must be made keeping in view the class of users.  Such a list may be made, instead or arbitrarily making it according to the whims and fancies of the editors, by a frequency study based on the English books and writings generally used in India .  This need not be confined to particular linguistic region of India as the English used in India is generally the same for all the linguistic regions.

    We are now under the impact of both the British and the American English.  There is variation among the British and American lexicographers both in their list of lexemes and their spelling.  This again poses a problem for the Indian lexicographers.  Some Americanisms which are not found in British dictionaries are also gaining currency in our country.  For example : bleachers’ bluet, burro, jumbo, okay/ O.K., pinto pixilated etc.  The COD indicates some of these Americanisms with an asterisk.

    It is to be recognised that India isਠ