Linguistics 158: Computer-aided methods in linguistics
John B. Lowe /
Department of Linguistics
University of California, Berkeley - Spring 1997
Assignment 4 : Corpora
(due: Th 13 Mar , 1997)
The point of this assignment is:
- To demonstrate that you can use a largish text resource as the basis for drawing inferences about linguistic patterns.
- To get an interesting, though perhaps preliminary, result. You should be trying to find at least one new fact that as far as you know no one has yet observed.
Instructions
Find a corpus.
- If you created a "textbase" or corpus in Assignment #3, you may wish to use it. It does not matter for purposes of this assignment that the text is not perfect: it matters that it is in good enough shape to be used for some analysis.
- You may find or build a corpus to use based on per-existing material. If you search the web, you will find a number of sources of "full-text", dictionaries, thesauri, and other corpora. You could download one or more of these for use in meeting the requirements of this assignment. If you are interested, there are a large number of dictionaries and other lexical sources in Sino-Tibetan and Bantu languages available.
- If you would like access to data in some of the large English text corpora demonstrated in class, contact me and I will arrange either for you to do the searching yourself or get a subcorpora of sentences which you can analyze using a microcomputer.
Provide an analysis.
- That is, using corpus tools (e.g. CONC, KTagger) or other software which you find or develop (e.g. HyperCard), perform some sort of linguistic analysis on the corpus.
- Your analysis should take into account any weaknesses or shortcomings of your data. If your dataset is merely a sample, or if you know that it has certain random or systematic deficiencies as a result of scanning or the way you choose the sample, try to account for these explicitly.
- Here are some suggestions for things you might do
What to turn in.
- What you turn in should be in the format of either:
- a squib, that is, a brief sometime witty literary effort. This should be at least six (6) pages long. Include some bibliographical references.
- a research prospectus or pilot project. In this case, you should have three sections, each a page or two long: an introduction, stating the goal of the research and expected results; a methodology section, which outlines how you carried out the research, and a results section, which described what you found and what the implications of the research are. Since this is a pilot project, you need not come to any definite conclusions. If it looks like you are on to something, you may wish to expand the effort (either by getting more or better data or doing a more thorough analysis) and use this assignment as the basis for a final project.
- Brownie points are given if you use the computational work you do in this assignment in meeting the requirements of another linguistics class.
"Answers" to homework
This document is: http://www.linguistics.berkeley.edu/Lx158/assignments/Corpora.html
[Ling 158 Home Page |
Linguistics 158 schedule]