07 December 2009

seek and ye shall not find

morphological revelations on my morning comb through twitter and facebook statuses:

whoa. on the other hand, this isn't entirely unexpected. morphology tends to be entropic, that is, it favors simplicity and regularity and minimal expression, and moves in that direction over time. this doesn't mean the language apocalypse is upon us any more than the heat death of the universe, as predicted by physical entropy, is. just like physical entropy, language entropy can be locally reduced by other factors, particularly token frequency. that is to say—in the broadest terms—speakers are likelier to hang on to irregular forms of words that are used all the time, and tend to regularize words that aren't as common.

that brings me to my "whoa" moment. i just hadn't realized that 'seek' was possibly on the cusp of regularization. so the question is, how does 'seek'/'sought' stack up to other verbs with past tense forms in -ought? to get a comprehensive list, i turned to a reverse dictionary, which yielded just five non-compound -ought pasts: bought, fought, thought, brought, and our test case, sought. next to test their frequencies i headed to wordcount.org, a nifty visualization of frequency in the British National Corpus. admittedly the BNC might not give the most precise results for predicting the tendencies of young speakers in Michigan, but should be accurate enough. here are their ranks (not token counts; smaller numbers indicate higher frequency):

buy/bought: 785/1129
fight/fought: 1484/3204
think/thought: 102/152*
bring/brought: 631/461
seek/sought: 1875/1895

the data reveals that i perhaps shouldn't be as surprised as i was. 'seek' is the least frequent of the five verbs, although strangely 'fought' is the least frequent past tense form. i starred 'thought' since its frequency is probably affected considerably by use of the noun 'thought'. also of note is the fact that 'bring' is the only item whose past tense is more frequent than the base form; this is due to the fact that 'bring' requires a progressive present tense ("I bring the wine" ≠ "I am bringing the wine" but rather "I (habitually) bring the wine"). despite—or perhaps owing in part to—its frequency, 'bring' is subject to taking on a different irregular pattern, 'bring'/'brang'/'brung' in many children's speech and some adult dialects.

anyhow, to wrap this up, it looks like 'sought' might well be the best candidate of these forms to undergo regularization, even if i hadn't expected it before. the only other form that might do the same is 'fought'-->'fighted', but i think that would be even more surprising...i'm actually wondering why its frequency turned out to be so low in the BNC.

a postscript: although i certainly have 'sought' as the past tense of 'seek' in its basic sense "to look for", 'seeked' is also in my lexicon. it's the past tense of the relatively new lexical item 'seek' "to move rapidly through a video or audio clip". 'sought' is terrible as its past tense:

i seeked ahead 2 minutes to skip the commercials.
*i sought ahead 2 minutes to skip the commercials.

this kind of regularization is a common symptom of generating a new, distinct lexical entry from an existing form, cf. the classic case bad/worse/worst vs. bad/badder/baddest.

[UPDATE] regarding 'wrought', which is very low frequency, and i (rightly) eliminated from consideration as not being a productive past form. i commented the following on the ongoing facebook thread that prompted this all:
'wrought' is a strange case...it's actually the old past participle of 'work' (e.g. "wrought iron" = "worked iron" ≠ "wreaked iron"), and the historical past tense of 'wreak' is regular 'wreaked'. they got conflated because both 'work' and 'wreak' were used in the "____ havoc" idiom. since 'wrought' is almost never used outside the idiom any more, it probably doesn't fit into the regularization question here.

27 August 2009

surprisal for dogs

today's Frazz comic:

i don't know if anybody has actually done research on dogs' abilities to learn frequency-based patterns (although we had cottontop tamarins not so long ago). and unlike the grammarpattern-sensitive monkeys, Mario didn't even wait to confirm the probability-based prediction, he just went for it.

07 February 2009

AND??? and i hate you, congress.

why, why do i do things like try to figure out what has been going on in the Senate regarding the scazillion-dollar stimulus plan? it only a) gets my blood pressure up and b) confirms that our elected representatives are morons, or at least have stared at legislative doubletalk for so long that their judgements about English have been seriously compromised. exhibit 1: Senate Amendment 309, introduced by Thomas Coburn (R-OK)

At the appropriate place, insert the following:

SEC. __. LIMIT ON FUNDS.

None of the amounts appropriated or otherwise made available by this Act may be used for any casino or other gambling establishment, aquarium, zoo, golf course, swimming pool, stadium, community park, museum, theater, art center, and highway beautification project.

take another look. "…museum, theater, art center and highway beautification project"??? that is one hell of a project. in fact, i'm pretty sure you won't be finding any such mega-conglomerate initiative anywhere in the original bill. and they passed this amendment. what a waste of time. idiots.

of course, having someone proofread the damn thing and change and to or would have saved the nonsense that this will create in conference committee, in the courts if the bill is signed into law with this amendment in place, &c. &c.

01 February 2009

Pearls Before Swine takes on English-only

sums it up pretty well, i should say (click for big):

27 August 2008

flavors of English on Google

i was just looking through the site statistics for this here blog. one of the most interesting and useful bits of information that statcounter provides me are the search terms that people use. i would say that 99% of these searches are done on Google — we really have drunk the pagerank kool-aid. a lot of searches are pretty lengthy and specific (e.g. "kobe bryant interview in italian" or "who is the girl in the benny lava video?"). one recent search stuck out to me, though. somebody searched for just the word "whomever", and wound up at my previous post "The Office on whomever". i thought that was pretty remarkable. i clicked through on the link that statcounter provided me and saw that the search was made on google.co.uk, and that descriptively adequate was on the front page of results, at position number 6.

then, for whatever reason, i decided to re-run the search using google.com. my post was nowhere to be found on the first page. the results were entirely different. descriptively adequate finally showed up at #14 on the list of results. what's going on? certainly google hasn't written different versions of pagerank to deal with different localizations of English? as far as cataloguing search results goes, the fact that a bunch of Americans in California wrote the algorithm shouldn't adversely affect Brits and the like.

i couldn't stop there. i ran the search on all of the English Google localizations that i could think of, and got even more different results. i've also noted the number of total results that Google estimates, which also (oddly) vary by localization.

localization#total hits
google.com147,480,000
google.co.uk68,200,000
google.ca78,180,000
google.com.au108,190,000
google.com.nz78,460,000

as i was compiling this table i remembered that Google mucks with your search results if you're signed in (which i of course had to be in order to access blogger, without which i couldn't be writing this post). i signed out, and on google.com the DA link rose to #4. i guess i should just be happy i'm on the front page on all of these searches. but there are still lingering, bizarre questions.

why does Google report different numbers of hits for different localizations?
no clue. (comments are open!)

what is causing the rank fluctuations even when i'm not logged in?
some clue. on all of the non-US localizations there is a feature "search pages from [country name]". perhaps i've got fewer australian sites linking to my blog, so my rank is slightly lower in australia than in the US or great britain.

why the hell is Google biasing my custom algorithm against my own damn blog?!
i mean throw me a bone here, guys.

and the baffler...
why do i get this on google.ca?
i mean, you're kidding, right? i'm sure that the frequency of whatever is much higher than that of whomever, but 8 million hits on a word that's in the dictionary should be enough data for google to not question my intent. and why only canadians, eh? this, of course, isn't the first time that i've seen weird spelling suggestions on Google. so perhaps they really do think they know something about English varieties that i don't?