19 January 2010

give and take: math and linguistics

this is a response to the excellent post "Why Linguists Should Study Math" over at The Lousy Linguist, which i found via fellow Cornellian @nmashton on twitter. i was going to just write a comment there, but i realized that it would probably become rather long.

first of all, let me say that i am in absolute agreement with the sentiments put forth by Chris in his post. in fact, i'm going to be auditing the brand new, never-before-offered Statistics for Linguists course this semester. but i think that one major point needs to be added.

simply: there is a grave asymmetry in linguists learning math versus mathematicians (or statisticians, or computer scientists, etc.) learning linguistics.

here's the scenario. you're a grad student in linguistics. this means that you went to high school once, and probably were rather good at most of your subjects, or you wouldn't be a student anymore. in high school, they made you learn math. if you were really good at it, you made it through single-variable calculus; if not, probably trig. even if you didn't like it and haven't touched math since, you should have a decent sense of How Math Works, in case you need to pick it up again.

but the converse just isn't true. i've audited the NLP course at Cornell, which is taught by an excellent professor in the CS department who has a very solid grounding in theoretical linguistics. but that almost doesn't matter given the fact that there are zero prerequisites for the course. that's right, no LING101, no nothing. the demographics of the ~80-person lecture break down roughly as 70 CS undergrads, 9 linguistics undergrads, and 1 lonely linguistics grad student.

so what's the big problem? they'll learn as they go, right? learning by doing is the best way, no? wrong. as has been shown time after time on Language Log and elsewhere for this and other fields (law, education, etc.), these would-be NLPers have a complex against linguistics. i think they recognize that they're uninformed on the finer points of linguistic theory, but because "hell, i speak a language!" they don't think they need any more expertise to solve complex linguistic problems. throw more code at it, throw more servers at it, we can brute force our way through. i've watched them re-invent the wheel, and it's a square wheel with an off-center axis. and they're not looking to refine its design, or ask those crazy round-wheeler linguists what they've got cooking in their lab. instead they're trying to make titanium and carbon-fiber square wheels, thinking that will improve things. the mantra is to strive for good enough rather than (i concede, unattainable) perfection.

i think that linguists are more and more cognizant of the need for mathematical training. and for those who just aren't math types, they're willing to go find fellow linguists who are, or even statisticians and computer scientists outside their departments to collaborate with. but nobody comes knocking on the linguistics department door. it's open, guys, and seriously, you could stand to visit. we won't bite.


John L said...

It's also the case that most of current linguistic theory (as opposed to linguistics in general) is unsuitable for computational consideration. Here's what Bayer et al said about a dozen years ago in Chapter 8 of "Using Computers in Linguistics" (see URI at end); it hasn't changed much since, I'm afraid.
In response to the demands imposed by the analysis of large corpora of linguistic data, statistical techniques have been adopted in CL which emphasize shallow, robust accounts of linguistic phenomena at the expense of the detail and formal complexity of current theory.

Nevertheless, we argue in this chapter that the two disciplines, as currently conceived, are mutually relevant. While it is indisputable that the granularity of current linguistic theory is lost in a shift toward shallow analysis, the basic insights of formal linguistic theory are invaluable in informing the investigations of computational linguists; and while corpus-based techniques seem rather far removed from the concerns of current theory, modern statistical techniques in CL provide very valuable insights about language and language processing, insights which can inform the practice of mainstream linguistics.
In other words, a large subset of language can be handled with relatively simple computational tools; a much smaller subset requires a radically more expensive approach; and an even smaller subset something more expensive still. This observation has profound effects on the analysis of large corpora: there is a premium on identifying those linguistic insights which are simplest, most general, least controversial, and most powerful, in order to exploit them to gain the broadest coverage for the least effort.


Ed Cormany said...

John, the snippet you quoted here seems accurate (i haven't had a chance to read the linked items yet). and you're right, i think little has changed. that's what i was getting at with the square wheel metaphor. the first pass gets something like 60-75% of cases right. not bad! but instead of making the leap to the higher level to get 85 or 90+ percent right, the trend is to polish the first pass so it gets 76 or 77% right. i'm not saying that polishing might not be useful, but it would make sense to go for bigger returns first, where possible, and then revisit it.

and before anyone jumps on me, those numbers are just for illustrative purposes, but i think they're not too far off, at least for certain NLP tasks.

John Lawler said...

As you'll see when you do read it, it's not 75%, but closer to 95%, at least if you're looking at meaning. And that was 15 years ago. Without parsing at all. How much are we willing to pay for the extra 5%?

Chris said...

Ed, excellent response. I think you've made a good point that the linguistics part is generally considered "easy" or 'trivial" compared to the computational part. but I guess we all have our biases. Physicists think the pjhysicics part is the tough part, biologists think the biology part is the tough part, etc. True team work is rare, unfortunately.

As for John & John (same John??), I spend a few years in the NLP industry (info extraction) and the 95% sounds about right for things like entity tagging, POS tagging, and such. But wow, that last 5% made my head spin. Truly difficult.

Ed Cormany said...

ok, good for me to get my numbers straight. in the NLP course i took, the goal to shoot for was 65-70% on tasks like Chris mentioned, but we were beginners.

perhaps what i was thinking of were more complex tasks, like machine translation. even if it's at 95%, giving it a little syntactic help rather than saying "omg, our algorithm doesn't do wh-questions* right, we better just enlarge our corpus and n-gram catalog" could go a long way.

*this is presuming that the people working on these things would recognize a wh-question if it hit them in the face. and that's one of the more transparently named linguistic terms, at least for English speakers.

Chris said...

FYI, regarding those numbers, they are inevitably based on some "gold standard" created by human annotators. Breck Baldwin has a nice post on the value of gold standards wrt certain NLP tasks (e.g., gene name recall) at his LingPipe blog HERE.