15.01.08

How to make a dictionary- 15.1.2007

COMPUTATIONAL LEXICOGRAPHY




Criteria for Good Lexicography:

• Quantity:
– Completeness of coverage:
o extensional coverage: number of entries
o intensional coverage: number of types of lexical information

• Quality:
– Correctness of information:
o Types of lexical information
– Consistency of structure:
o Macrostructure
o Microstructure

o Mesostructure



Concordance :
• A KWIC (KeyWord In Context) concordance is a special kind of preliminary, corpusbased dictionary:
– each word in a text corpus is paired with its contexts of occurence in this corpus
• Note: Google is a special form of KWIC concordance
• Example text:
“My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.”



Alphabetically ordered KWIC:




Simplest KWIC procedure:
1. Corpus creation: make a corpus of texts in electronic format
2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)
3. Keyword list extraction (all words in text)
4. Context collation (for each keyword)
5. Search for KWIC in corpus
6. Store output and format– for printing, hypertext (CD, web)




KWIC: Dictionary Making
• The function of a KWIC is
– to make searching for lexical information more efficient by putting context information about words in one place – for making “Word Sketches” (Adam Kilgarriff)
• grammatical descriptions: parts of speech
• dictionaries: examples of use, collocations, ...
• Project: Make concordances from your text corpora and use them to collect lexical information for your Toolbox lexical databases


The Status of Dictionaries:
• Remember that the dictionary is
– one of the three main components of language documentation:
• corpus of recordings and texts
• dictionary
• sketch grammar
– the central component of any linguistic description
– the most useful linguistic product for use by the speech community, or non-linguists in general



The Ibibio Dictionary:
• The Ibibio Dictionary
– uses information from Elaine Kaufmann's Ibibio Dictionary
– the information was re-typed into an Office table format
– this was converted into
• Toolbox format for further lexicographic extension
• LaTeX for formatting (cf. the Ibibio Concordance)• Project: extend the Ibibio corpus, concordance, dictionary in scope & context





QUIZ :
• What are the 6 main steps in KWIC
concordance construction?
• Explain each of these steps:




KWIC procedure: 1. Corpus collation
• My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.



KWIC procedure: 2. Tokenisation
• In the text:
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.
• Process
– upper case (capital) letters
– punctuation marks
• To produce:
my first sight of england was on a foggy march night in 1973 when i arrived on the midnight ferry from calais


KWIC procedure: 3. Keyword List
• Replace each SP (space) sequence by a LF (linefeed) / NL (newline)
• Sort the list alphabetically
• Remove duplicate words









KWIC procedure: 4.Contexts




KWIC procedure: 5. Search
• For example:
– on is found in the middle of the following context
units:
• was on a
• arrived on the
– arrived is found in the middle of the following context
units:
• i arrived on
– etc.



KWIC procedure: 6. Output format














































Keine Kommentare: