How Many Words is Fluent?

Posted on June 20, 2010

Googling around the internet I found a lot of sites where people had written in saying, “I am studying language XYZ, and I want to know how many words I have to know to be able to read a newspaper.”

This question is particularly relevant for people who are studying Chinese, where each word is a character, and most students know the exact number of characters that they can read. Whereas students who have been studying Spanish, German, or Vietnamese for a period of years, wouldn’t generally know the exact number, or may not even know an approximate number of words that they understand.

This information is relevant for anyone studying a foreign language, including English, particularly if your goal is to study at a university overseas or to work in a professional job in the foreign language environment.

Checking a number of websites, the answers varied substantially.

On aksville.com, someone took the time to write a long reply, explaining that major newspapers, such as USA Today, are written at a 6th to 8th grade level and require approximately 3,000 words to read.

Another site, called blogonebytes.com: “I read somewhere that to be able to carry on a good conversation in “Mandarin Chinese” one should know about 3,000 characters, and about 7,000 characters to read technical books.”

A follow up comment by a reader on the same site said, “You will need to know a minimum of 3000 characters to be proficient. You will need to be able to speak and understand in the range of 5000-7000 characters.”

According to Omniglot, a site which I tend to have a lot of respect for, “The largest Chinese dictionaries include about 56,000 characters, but most of them are archaic, obscure or rare variant forms. Knowledge of about 3,000 characters enables you to read about 99% of the characters used in Chinese newspapers and magazines. To read Chinese  literature, technical writings or Classical Chinese though, you need  to be familiar with at least 6,000 characters.”

I had always heard that the range was somewhere between 1,500 and 3,000 words to read a newspaper. In the case of Chinese, I know that I can read right about 3,000 characters, and yet, I absolutely cannot read a newspaper. If you hand me a newspaper, I can pick out words that I know, but I can’t actually read and understand the stories.

In Bangkok, I have several friends who are extremely conversant in Thai, and they can read a menu. But they would need an entire day and a dictionary to read a single newspaper story. And even then, they wouldn’t understand everything.

With German, after four years of studying and working as a translator and researcher in the country, I can obviously read anything. But, I have no idea how many words I know. Now that I am embarking on my study of Bahasa Malay, and also making plans to go back and finish learning Vietnamese, I am becoming very curious how long it will take to get my reading level anywhere close to what it is in English or Spanish. My own experience with Chinese made me question this 3,000 word figure. Also, as a person who earns most of his living from writing for magazines, newspapers, and books, I would hate to believe that I only write a 3,000 word vocabulary , and on a 6th to 8th grade level.

As many times as I attended 9th grade, you would think I would be writing at least at high school level.

The two facts that I wanted to verify were, the average reading level of The New York Times, my hometown paper, and the average number of words per edition.

The first question was easy to answer.

The May 2, 2005 edition of “Plain Language At Work Newsletter”, Published by Impact Information Plain-Language Services, explained that there are two generally accepted scales for determining the reading level of various publications. They are the Rudolph Flesch Magazine Chart (1949) and the Robert Gunning Magazine Chart (1952). Both charts analyzed such aspects of a magazine or newspaper such as, average sentence length in words and number of syllables per 100 words. Based on this information, they assigned a school grade reading-level to the publication. According to this rating system, The Times of India was considered the most difficult newspaper in the world, with a reading level of 15th grade. The London Times scored a 12th grade reading level, as did the LA Times and the Boston Globe. The survey must have been flawed, however, because they assigned The New York Times a reading level of 10th grade, which is lower than the LA Times, when everyone knows quite well that New York is better than California or any other place which is not New York.

If you get most of your news from Time Magazine, you might be pleased to know that Time and TV Guide both scored a 9th grade reading level.

The survey didn’t cover newspapers written in languages other than English, but if we assume that we are shooting for an average 10th grade level, this will probably be close to what you need to read a newspaper in any language.

The next question was much harder to answer. How many words do I need to read the New York Times? I have never believed the low estimates of 3,000 or less, simply because every event that happens anywhere in the world, any human situation can appear in the Times as a news story and could of course, require the appropriate vocabulary.

To answer the question, I went to the June 4, 2010 New York Times online and I chose 8 articles, taken from several different sections, because I assumed they would all require different vocabulary. The stories were: “Pelicans, Back From Brink of Extinction, Face Oil Threat”, “BP Funneling Some of Leak to the Surface”, “John Wooden, Who Built Incomparable Dynasty at U.C.L.A., Dies at 99”, “An Appraisal : Wooden as a Teacher: The First Lesson Was Shoelaces”, “Should you be able to discharge student loans into bankruptcy?”, “On the Road to Rock, Fueled by Excess” as well as other tidbits, announcements and follow up articles.

In some cases, if the articles were very long, I didn’t take them in their entirety, assuming there would be much repetition of words.

In all, I took parts of about 8 stories, comprising 51 pages of text. The stories I took didn’t even represent 10% of the total content of this particular edition of The New York Time, June 4, 2010 online edition.

I pasted the words into a word document, converted them to a single column table, which ran over 450 pages long. Then I sorted the table alphabetically. Up to this point, it was easy, just pressing buttons. Next, I had to go through all 450 pages, all 10s of thousands of words, removing duplicates. It was one of the most tedious exercises I have ever conducted in my life. It was exactly the type of obsessive compulsive behavior that gets people locked up in mental institutions. It took 16 hours. By the 10th hour, I began hallucinating. Nearing the 12th hour, I believed I was a hummingbird of some kind.

I allowed plural forms of nouns, so I counted “car” once and “cars” once. I also included all forms of a verb, so “walk” once, “walked” once, and “walking” once. I counted proper nouns, including place names, as the names of people and countries will come up in the news and you need to know them. Also, in foreign language, particularly Asian languages, the grammatical forms and proper names may not even be recognizable if you haven’t studied and learned them.

When I was finished, I found that the random sampling of stories I chose contained 4,139 unique words. This was much higher than the estimates I had read on some websites, but was well in line with what I suspected. If I had the energy to complete a similar analysis of the entire edition, I would have to believe the number would increase. And if we monitored the newspaper over a period of one month, analyzing the text every day, and comparing the vocabulary against an accumulated list, I would imagine that it would grow. Most likely the difference in vocabulary from day to day would be small, but still, the necessary vocabulary would increase.

Comparing the dialogues in my Chinese textbooks with the vocabulary that appeared in these New York Times articles, much of what I learned in school was useless. For example, all foreign language textbooks have chapters devoted to shopping at the market, where you have to memorize tedious lists of Fruits and vegetables. In these Times articles, not a single fruit name was mentioned. Neither my Vietnamese, Chinese, or Bahasa textbooks include the names of heads of state of various countries. But obviously, these names came up in world news stories.

Below is a small sampling of words that I found in the news story which, I don’t know how to say in Chinese. Some of these words, I question, however, if the average 9th grader would know them. Do 9th graders know: abetted, absinthe, archeo-feminist, or bearish?

abetted albeit assesses bankruptcy biofuels
able-bodied. Amandine assessment batch biography
abortions ambivalent assets bawdy-sweet black-clad
absinthe anachronistic asthmatic bearish bleak
absurd. anarchic audience-pleasing Bedford blemish
accord Appended aura befriended blockade
across-the-board Archbishop autobiography behind-the-back blowout
activists archeo-feminist autograph-seekers benefits bond
Advocates articulate awfully best-selling booster
aerodynamic assertion babbles bioenergy breakthrough

Names and proper nouns are important for understanding news stories. In language textbooks you may learn the names of major countries and the capital cities, but news happens in small cities and even villages as well. To read the news you need to know the names of political parties, famous people, economic theories, financial indices, global corporations, educational institutions, associations, and international organizations such as the UN.

All of these names were taken from the same collection of stories. Do you know how to say these in Vietnamese or write them in Thai?

Cypriot Delta Geneva Mediterranean Bihar
Baltic Democrat Greece Nehru Turkish-controlled
Brooklyn Denmark Uttar Metropolitan Nasdaq
Iranian Dow Midwesterner Mayor Polytechnique
Louisiana Durbin Scotch Reich Iskenderun.
pro-Greek Dutch-Irish Rev. Latino Kentucky.
California Baptist BENJAMIN Bonaventure/Agence Burke/Associated
Cambridge Chicago-based Berkeley Pennsylvania. Bush
Cyprus Barataria-Terrebonne Navy BP Dallas-Fort
Audubon Gandhi. Bess Dalit Arce

How many of the above terms were you able to translate or transliterate into the language that you study? This is the level of reading that an adult native-speaker can do, and this should be your goal. If the task doesn’t seem daunting enough, remember, in this article, we were only concerned with vocabulary. But you could have a vocabulary of a million words not be able to understand a newspaper or a book. For real communication, you need a comprehensive approach to language, which includes culture, syntax, context, and grammar.

It’s a long stretch. I know. And it can seem impossible. But remember, every Sunday in New York City Catholic mass is said in 29 languages. For more than a century, large numbers of immigrants, my family included, have been coming to America and Canada in search of a better life. Most of them learned English with less than half of the education of the average person reading this article.

So, if your Grandma and Grandpa could learn a new language to a level of functionality, so can you.

33 Comments

  1. Antonio, I was fascinated by your procedure for counting unique words. It’s brilliant! I tried it now using a NYT paragraph. My only question is how exactly did you get the words from the paragraph into a column of single words that could be alphabetized? Of course, it’s possible to do it by hand, but very tedious….

    Your comment about memorizing lists of fruits and vegetables vs. no fruit mentioned in your sample made me laugh alot.

    I do think that the total number of words needed for “fluency” is higher now than in the past because of the internet and everyone’s access to so many news stories about different topics ranging from animal behavior to scientific phenomena to actual political news, not to mention entertainment news.

    Sincere thanks for your linguistic and cultural efforts!!

  2. Thanks for the support. I originally tried counting the words and sorting them by hand, but as you said, that was too tedious. then I contacted my brother, who is much smarter about computers than i am. he taught me how to import the whole text into a graph, with one column and unlimited rows. Next, i used a sort function to alphabetize the list, and then did a second sort which eliminated duplicate words. It sounds more complicated than it was, but these were all functions that exist on Microsoft word.

  3. Antonio, you are great. Every time when I read a book in English, there are new words never see them before, that appears only once so I forget the meaning. May be you know learning english books that every story include more words in english, but also include the previous words that were in the first stories.
    Hope that you understand me.
    thank you

  4. If you ever want to do this again without the 16 hours of tedium, you can copy and paste your alphabetically column into excel and then input this text into the cell immediately to the right of the first cell:
    =if((a2=a1),”Dup”,”-“)

    Then pull and drag that formula cell all the way down to the end.

    It will then show – for one word
    and show ‘dup’ for the duplicates.

    Then, if you copy that formula cell, then paste right over it with paste special (right click to get to it) and select ‘values.’

    Sort the -/dup column, delete all the dups, and now you can jump to the end of the column and see now many numbers that gives you. (Just look at the row number.)

  5. Antonio
    Saya juga belajar Bahasa Indonesia. Y hablo castellano and English, I actually found this article through google. I am at about 1,000 or so words in Bahasa, I cannot read a newspaper yet. I do wonder though if your process is not flawed a little bit. I do think that the names listed are international, maybe not in the poor school systems that we have here in the U.S. but many would know those names. Also a plurar form of a noun is not really a different word, once we learn car we know cars as a plural, that’s pretty much first grade or less. Walk and Walking is second grade. I wonder what that number would have been if we excluded plurals and tenses?

  6. Sam, thanks for commenting. Actually it is not true that the country names in the article are international and everyone should know them in every language. In fact, that doesn’t even make sense. My experience in studying Chinese, for example, is that you absolutely can’t guess at what the Chinese character for Geneva would be and that 90% of your Chinese or Taiwanese teachers wouldn’t know because their education in world geography is so low. Also, as for the US having a poor education system, you are right that we rank 18th in the world, but that is out of 300. So, you are still talking about only a small population who have a better primary education. And in more than 50% of the world’s countries, less than 60% of the population has access to high school, whereas in the US roughly 97% of kids graduate high school and more than 80% pursue higher education, even if they don’t graduate.

    As for knowing plurals in first grade, are you claiming that in first grade you already knew how to make plurals in Chinese, Korean, Bahasa, and every other language you may wish to study? If not, then we need to count plural forms as distinct words. In German, for example, there are about five or six ways to form a plural and you really need to know them all.

    Walk and walking, are you saying that you already know how to make the progressive/continuous tense in each of the world’s languages, and that you believe that this is something we all know already, so it shouldn’t be counted?

    I was careful to explain how I did my informal analysis, to avoid creating bias. And, yes, I know it was cumbersome and possibly not accurate. If you have a better method for counting words, please do a similar article and post your data here. Interestingly, you did confirm my supposition, that 1,000 words is not enough to read a newspaper.

  7. I’m really sorry to say, I agree completely with Sam. “Car” and “cars” are the same word, and it is not useful in any way to consider them as unique words. Counting plurals, tenses and inflections would easily create the semblance of knowing double or triple one’s actual vocabulary. It makes the most sense to talk about words in the sense of lexemics. Counting only lexemic forms, I have read studies that estimate that the average American knows anywhere from 50,000-100,000 English words. That is a huge range, I know, but it just goes to show how difficult of a task it is to pin down an exact amount of words one needs to know. Consider also that some languages, especially English with an estimated 500,000+ word and ever-expanding lexicon, create their nuances through the knowledge of many words. English is my first language, but hardly a week goes by that I do not learn a new word from any of the various books or newspapers that I read- and I’m one with a college reading level! If I’m constantly learning new words in the language I have been speaking my entire life, then I can only reasonably expect to do the same with every other language I learn.

    It really is not a question of a finite number, but a continual process and dedication to improve. Keep reading, keep picking up new words, ALWAYS ask what a word means when someone says one that you are not familiar with. Even in one’s own mother tongue.

  8. I loved this article. especially how you tediously examined those articles to come up with unique words. However, I think you are overlooking something. Chinese characters are nothing like English words. Usually in English, one word simply means one thing. However, in Chinese, one character can be combined with other character to make up new words. for instance: take their word for heart(xin). It can be combined with love(ai) to mean compassion(aixin). with open(kai) to mean happy(kaixin). with small(xiao) to mean careful(xiaoxin). and with leave(li) to mean mean centrifugal force(lixin). I could go on with atleast several more dozen examples of just this one word. In Chinese, if you know a handful of characters you can put them together and you really know a whole lot of words. That is why 3000 characters seem so few, but in reality those characters make up tens of thousands of words.

    And another note to what you said about proper names. In English, we have about as many names as we can think of. It’s just shy of unlimited. However in China, they simply have characters and are forced to use the characters they have to express various proper names. A few countries for example: Beautiful(mei) country(guo) means America. Law(fa) county(guo) means French. Day(ri) origin(ben) means Japan. Horse(ma) come(lai) west(xi) asia(ya) means Malaysia. Alot of the words they use are sound-alikes(like Malaysia). But almost all of the words consist of characters that mean something else.

    One final note is that Chinese does not have plurals. And they also do not have tenses(in the way we do). For example: there is no difference in Chinese between “cat” and “cats”. there is also only one word for “walk”, “walked”, and “walking”.

    So when they say that it takes 3000 characters to read a newspaper, I feel like that is fairly accurate. English “words” are much different than Chinese “characters”. Within those 3000 characters I’m sure there are more than 20,000 “words”.

  9. I found your article vary interesting and useful. I am an undergrad who is getting a degree in china in chinese, Although I am not quite fluent in chinese, I can read a newspaper (not word-for-word, working on that) expect for the specialized articles (also working on that). I do feel i should point out that 3000 characters are not actually 3000 words. I doubt I know 3000 characters (i dont count and dont know anyone who does) but i know well over 3000 words. This is because most chinese words are actually combinations of characters (usually 2 in length). I do disagree a bit with Tyler’s post in that while most last names are fixed, first names are invented and not fixed. (but you still know its a name) Like in my name 李想, 李= is a last name, one of the most common, however 想 is my first name and (also a common character) but i could be called 李瑞 this 瑞 is less common and a student might not recognize it but could probably still tell that it is a name.
    Also, many meanings of double or triple character words are not obvious and can not be easily deconstructed. So i do feel knowing 3000 characters may prepare you for reading a newspaper in chinese, but you had also better know at least 6,000 words if you want to get past a simple article and into more elevated and varied news.
    An example of this system is the HSK (汉语水平考试) a horrible exam really, however they divide the different levels up into number of words not characters. An Example would be level six which test 5000 words (supposedly) does not even test all 3000 most commonly used characters.
    this was just an add on

  10. Hi, men you and I have the same obsessions (for the next time use excel and you can do the same work in minutes, or use a free text analizer.
    Excuse my english I´m mexican and normaly a don´t write in English.
    Ok a did the same, I bought a New york time sunday´s edition and while reading all the newspaper including ads, I cross each word a couldn´t understand, It was massive, just in 50 pages a didn´t understand about 5000, but I can understand your post entirely. So as yo do, I don´t take so seriously that with a few amount of words you can get through other language.
    There is a study, A no very trusted by me, that says that an averange person who speak spanish use 300 words to speak, I just laught.
    I most know about 20,000 words in english and I can´t read the new york times, and as you said, words like calf, blender, axe count on my list, but are almost useless reading TNYT.

  11. A fascinating and entertaining article, the part about the humming bird hallucination was hilarious! I think someone mentioned using Excel formulas to automate the task and on a similar note the hours of tedium you described immediately made me think that a Perl script with regular expressions could have handled the task in seconds and even taken care of pretty complex expressions to deal with regular plurals, tenses and the like.
    I would love to know exactly how many words I know in Korean because I can hold a conversation competently, but also, admittedly, reading a newspaper I struggle to do more than pick out individual words. Unfortunately I don’t suppose there could be such a convenient way of automating this task of counting every single word I know, ho hum…

    Again thanks for a great article.

  12. There are several problems with your logic.

    First, the 3,000 words are not just any word. You need to know the most common 3,000 words. Therefore, if you are a cook and know what the word parboil means, that will not do you any good in reading a general text, unless you are reading about cooking.

    Second, Omniglot states “Knowledge of about 3,000 characters enables you to read about 99% of the characters used in Chinese newspapers and magazines.” The key phrase here is 99%. That means for every 100 words, you get to drop one word off the end of the list that only occurs once. In that case, how closer did you get to your 3,000 word mark?

    Third, your exercise didn’t account for contextual inference, where one does not actually know a word, but can infer its meaning (or the general concept of the written sentence) through all the other words around it. In other words, your mind skips the word it does not know but you still understand the sentence. Consider, for example, the sentence “The eddy of the water pulled the boat off course.” You might not know the word eddy, but you know the words water, pulled, boat, and course.

  13. Thanks for this intelligent response. I agree that you need the 3,000 most common words. if you have attended school or used basic language learning materials for language learning, then one would assume that you have learned the most common 3,000 words and not words specific to a technical field, such as photography, neuroscience or others. However, I think part of my point is that the 3,000 most common words are not enough to read a newspaper, because any words that have ever existed could and do appear in newspaper articles. the example i keep coming back to is the name of every city, country, or region in the world, every world leader or famous person, every concept, such as the great depression, post traumatic stress disorder, recession, physical policy, these are all words I found in recent newspaper articles, and which you would not learn in the first 3,000 words in school. As for the 99% point, you are suggesting there is an error margin of 1%. That’s fine, I would be willing to admit an error margin of 5% and it still wouldn’t change the outcome of the research, which is, with 3,000 words, you can’t read a newspaper. As for context, you are right, that many words can be guessed from context. But research has proven that if you only understand 55% of a text, you won’t be able to infer the meaning. without a doubt, in Chinese, I read a tad over 3,000 characters but can’t derive ANY meaning from a newspaper, apart from isolated words. In Vietnamese my vocabulary is 4,000 words, and the result is the same, I can make very little sense of a newspaper.

  14. Actually, I did use Excell to isolate and count words, but I am not very tech savy. If you can develop a better program I would be very excited to be able to feed pages of the New York Times into it and not only count the unique words on a given day, but over a period of 30 days, see which words repeat and how often… I am absolutely convinced that even a figure of 3,000 words is not only low but simply wrong, and way, way off. The complexities of the collection of news stories on any given day go way beyond the vocabulary and concepts you would learn in two or three years of Vietnamese studied at the university level in Saigon. I completed more than half of the four year program and saw that we would never get there.

  15. Thanks, but I need everyone to stop telling me to do use Excel. I did use excel. and it didn’t take minutes, it took days. if all of you are capable of doing this type of analysis in minutes, but didn’t, then you lack motivation. “Those who can read, but don’t, are no better than those who can’t.” Thanks for what you said about 300 words. Berltiz bases their whole approach on an assumption that in daily life we only use 500 words. I find this preposterous. Also, if you have ever worked with someone who learned their langauge from berlitz or Pimsler, where these assumptions are being made, they are so unbelievably limited in what they can and cannot understand. for one thing, they dont normally have any synonyms. they might not bucket but not pail. and they wouldnt have colloquialisms or expressions, such as “to kick the bucket.” they wouldnt have proper names and concept names…. I appreciate your comment because you are an English learner, and are speaking from experience. Thank you again.

  16. I think that there must be some miscommunication between this frequently reoccurring figure of 3,000 characters and the reality of reading a newspaper. since, everyone who comments or sends me email agrees that with 3,000 words you can’t read a newspaper, this must be true. what may also be true, however, for Chinese is that with 3,000 characters you can compose all or most of the words necessary to read a newspaper. But even if this is true, I think it is a mute point. I know that kai shin means happy. but if i didn’t know that, and i saw kai and shin would I be able to deduce that it meant happy and not open-heart surgery? these low numbers of 300-500 words for daily functioning and 3000 for reading a newspaper, seem not stand up to any logic or analysis.

  17. On your first point, I think you may be right, that with 3,000 characters you could combine and recombine to read a newspaper story. BUT, if I didn’t already know lixin is centrifugal force, would I just deduce that? if I saw heart and power, would i just say, “Ah yes, that must mean centrifugal force.” I suspect the average person wouldn’t come up with that meaning, and so, even though you know 3,000 characters, you wouldn’t be able to read a newspaper. You would need to actually know the thousands and thousands of words that occur in a text.

    On your second point: It’s the same as your first point. I happen to know that mei guo is America. But if I didnt know that mei guo is America, would I see mei and guo and make the assumption, first that it is a proper name (even though it is not written with a capital letter) and B that it means America? Absolutely not. I know mei guo because i learned mei guo. again in context and spoken, when someone says (ma) come(lai) west(xi) Asia(ya) you would probably guess that it meant Malaysia, but in a newspaper, i would read all three of those characters, which i recognize, but wouldn’t necessarily deduce that they meant Malaysia.

    On your third point about plurals, I am not sure why you included that. Yes, we all know that Chinese does not have plural, but to say they don’t have tense is too simple. Verbs do not change for tense, but you have words which you add to denote future and past and even progressive (continuous) granted tense plays less of a role in Chinese than in English, but it exists. You are full on wrong when you said walk and walking are the same. Chinese has continuous tense.

    Your final point is EXACTLY my point, with 3,000 characters, YOU CAN’T read a newspaper because a newspaper has 20,000 words. 80% of those 20,000 words would be meaningless to you, even if you know the individual characters, unless you actually studied those specific words.

  18. I have in fact done some compilations on both tv programs(subtitles) and web pages, (around 10 different sites and around 60.000.000 words. The are of course some words that are extremely common and others that can be common for the time being. Right now there are much more talk about debt crisis, Greece etc than it where last year. Over all however, with only the 2-3000 most common characters you can form a very high percentage of the words in an typical newspaper. You can also read the subtitles on even more films. For daily usage this will cover at great deal of what you ever will encounter. It will however not give you enough if you like to read classical novels, technical manuals, branch specific articles.I find it however easy to live with that. English is not my native language either so I have to look up words all the time as well. I never expect me to be fully fluent in Chinese but I can really enjoy films and news papers and read them with relatively little problem with my limited knowledge (around 2000+ character and 7000 words). For an excellent study on this, based on movie subtitles, you can read this university study: http://expsy.ugent.be/subtlex-ch/

  19. I’m sorry to say, but if you have trouble grasping an article in Chinese you simply don’t have solid knowledge of 3000 characters. I can grasp the basics of a an average article with a vocabulary of about 3000 words. If I had that number down in unique character I’d have an almost perfect understanding. I don’t doubt that at all.

  20. I’m a native Spanish speaker. Once I did an admittedly highly unscientific calculation of words needed to fluently read material like newspapers, magazines, sci-fi novels, technical books, etc. based on the fact that I can read all that material having only to reach for the dictionary once every two or three pages. I took an English/Spanish dictionnary, selected at random a few pages, counted how many words I knew on each page, and doing a quick extrapolation to the rest of the dictionary. I know about 12,000 English words, and those words allow me to read more or less advanced material. I’m currently studying German and Chinese (yes, yes, ich bin verruckt) My study of German reinforces my impression that you need to know around 12,000 words to fluently read in a foreign language.

  21. I am a Mandarin native speaker from Malaysia. One thing I had to admit is that we need to expose ourselves to the lingo we learn if we wish to reach a fluency in that language. I myself, read newspaper ( Mandarin ) since my childhood, almost daily. Besides, I also read a lot of books especially of literary and religious when I was in high school. One thing should I stress is I RARELY USE THE DICTIONARY, but I still able to understand the words that I read. I estimate that I learned at least 80% of the words without looking into the dictionary. However, when I begin to learn English and Malay, as the subjects are taught in the languages in national secondary school, I was forced to study the meaning of the words with the assistance of a dictionary, word by word, and that’s really a toil. In addition, to enhance my English, I had already decide to restrict myself from using other languages for about one year and to study 100 vocabularies and read for hours daily as I was truly annoyed when I couldn’t comprehensively understand what the columnists wrote on the paper.

  22. Since you counted run and runs as two separate words, and you counted walk, walked, walking as three separate words, your total count of 4139 “separate” characters actually shows that one only need a English lemmas (head words, meaning, run and runs would be counted as 1 head word) of about 3000 or so to read these newspaper articles.

    As for Chinese, I think if you really know the meaning of 2500 most frequently used characters, and you know the most commonly used words (a combination of 1-3 characters to form words) formed by these 2500 characters, you can read very well in Chinese. The reason it takes you a long time to read the first a few articles in a foreign language is because you have not spend enough time reading real articles when you study the language, it is not not because you do not know enough vocabulary. Isn’t this a good news for you?

    I passed TOFEL and GRE with flying colors when I was in China, but when I came to the US, it took me 30min to understand a 4 line paragraph in the first textbook I read. Now living in the US for over 20 years, I forgot many of those big GRE words, but I can read like natives.

    If you want to know more about language acquisition, read Dr. Stephen Krashen’s work.

  23. There are macros for Microsoft Word which will take a text document up to several hundred thousand characters in length, select the individual words, eliminate the most common words, and produce a list of the unique words sorted in alphabetical order or by word frequency. It takes a minute to run on a document such as a novel. Do a google search for “microsoft word frequency list vba” It does require fiddling with VBA, or getting a little help on the web.

  24. First of all I have to admit I really enjoyed reading this article. Unlike many other commentators, I agree that words such as “work”, “woks”, “working”, “worked”, etc. should have been counted separately. Right now I’m in the process of studying german, and my God, that language is hard. For example, for every single noun you have to learn the proper article (because often their genders make no sense), the genitive case suffix and the plural. Also, in every language I’ve learnt so far (ok, I’ve learnt only a several indoeuropean languages), there were many irregular verbs, so one has to learn how to conjugate them by heart.
    As well, every single language has specific idoms and phrases, which traslated literally into one’s mother tongue would make no sense. I believe I don’t have to mention that every language has some words with multiple meanings. In my country, students studying english often joke with sentences such as “Can you please translate me to the second page of the street.”
    Now, speaking of amount of words used in daily newspapers and todays certain-grade-literacy, I disagree. The estimation and tests conducted in 40s and 50s most certainly cannot be applied today. The language constantly develops, but the children today are less and less literate. Half of todays “facebook generation” doesn’t know the difference between “their”, “they’re” and “there”; which is something every non native english speaker knows after only a semester of studying it.

    The newspapers articles also vary a lot. Of course there was no fruit mentioned, but it doesn’t mean one doesn’t have to study it. I believe there’s a better chance one will be going to the market in Taiwan than discussing current political situation in Latvia. (For example, I can discuss processes ongoing in nuclear power plants in english, but I’d have a lot of difficulties trying to follow a simple cookbook recipe).

    I most certainly hope that one fine day someone will write an approximation of how many words would accord each level in a certain groupe of languages. I can’t find any division of such that would accord to beginners, intermediate, advanced and fluent level.

  25. I am facing some problems while speaking in English. Please give me some tips.

  26. my opinion is that one should do lots of extensive reading and thus be prapared for any other unknown/foreign texts by applying this ..3000 words or characters are never enough to grab-and-read a newspaper ..you must be able to comprehend and familiarize over 150.000 words (don’t be discouraged )..in addition to this , considering English language includes over 100.000.000 words (that is another truth )and an avarage English-speaking native retains over 150.000 words, again you should exceed the level of 100.000-word capacity..this is never an impossible thing to achieve..since there is no article or plural rules in English language, it is quite easy to learn over 150 words in a day ..that makes 4500 words in a month and 54.000 in a year..in two years you might not goint to be a fluent speaker but I guarantee you would be a good word-knower and comprehension master.. so grab a 1000-page S.King novel and start to read it as you highlight the unknown words you encounter..afterwards dig those words via electronic dictionary and see the results in 1 month ..Since I ‘ve been learning german I realized that learning english words is much more faster because if you want to learn a german word you shoul also memorize the article,plural and genitive form of that word..In english you only retain the plain form of that word.. start it off right away and feel good friends :)

  27. A few years ago I was staffing a table at a bilingual job fair. Applicants passed by and dropped off resumes. I remember one American guy who talked to us. He had “fluent in Spanish” written on his resume. Having learned the language myself to an advanced level and being a love of the language-learning process, I always like to ask people about their process. I asked him where he learned Spanish. He responded bluntly: 50,000 flash cards, as if to tell me, no sweat, I’m better than everyone else and can do it all on my own. We gave him a couple of chances to work some Spanish into the conversation but he didn’t bite. I could only suspect that his spoken Spanish was limited. Maybe it’s helpful in Chinese, but in my experience, focusing on word counting is just going to hold you back. I’d rather learn 50 words on week and actually go out and practice them (and perhaps remember a half or third of them) than to memorize 500 words in a week and eventually forget them all. Living and practicing in the language is worlds better than memorizing vocab. That has to be done to some extent (especially in the beginning), but the quicker you can forget about that and just go out and have fun, the quicker you will learn.

  28. How exactly are you counting new characters in Chinese? For example, I know the equivalent of about 2000 words in English in Chinese (not nearly enough to read newspapers/magazines btw). How many characters does this equate to though? When you say 3000 characters do you mean 3000 different individual characters that form components of different words? Because if that is what you’re referring to I assume it is enough for newspapers + literature.

    Louis

  29. Dear Fellow Antonio, I have been researching word counts from my favorite books for the last few months. The one tool that made it possible for me to accomplish this was a program called TextWrangler. Once you get the words sorted alphabetically into a column, A single click will remove all duplicates.

    If you get the Times electronically, you should be able to process each edition in few minutes. Good luck.

  30. “This question is particularly relevant for people who are studying Chinese, where each word is a character, and most students know the exact number of characters that they can read. ”

    Most words in Chinese are composed of two characters. There is not a character for each word, that would limit them to a vocabulary of around 4000 words.

  31. I see a lot of people criticizing me for suggesting you could learn a language by memorizing vocabulary. At no point did I suggest that. I am simply counting how many words is fluent, as the title suggests. No matter how you learn Chinese, part of any evaluation is an estimate of how many characters you know. If you read ANY of my other articles on language acquisition, it should be clear that I would NEVER advocate memorizing vocabulary to learn a language.

  32. I also can’t believe how many comments I got like this one: Comment:
    “This question is particularly relevant for people who are studying Chinese, where each word is a character, and most students know the exact number of characters that they can read. ”

    Most words in Chinese are composed of two characters. There is not a character for each word, that would limit them to a vocabulary of around 4000 words.’

    Yes, I agree, most Chinese words are composed of more than one character. That in NO WAY has any bearing on the point of this article. If you have some constructive research to share, please do. But I am not sure what your point is.

  33. In terms of counting unique words, if you have access to a computer running linux or a mac, you can do this in a trivial fashion. If you have a windows computer you can download a free package named GNUWin32.

    Once you have your raw text in a file, let’s say words.txt, simply do this at a command line:

    sed -e “s/[ \t,.?!]/\n/g” words.txt | sort -u | wc -l

    Subtracting one from this will tell you the number of unique words in the document.

    The way this works is:
    sed -e “s/[ \t,.?!]/\n/g” words.txt
    splits the words into multiple lines such that there is one word per line. It also strips some of the common punctuation so that the word “now” in the middle of a sentence, and the word “now.” at the end of a sentence is not treated as different words. To make this completely accurate you might have to add other punctuation to the expression to make sure it gets removed. This will also add a blank line to the list of words.

    The next part of the expression:
    | sort -u
    takes the output of the previous step and sorts it, and gets rid of any duplicates.

    The last part of the expression:
    | wc -l
    counts the number of lines, which corresponds to the number of words, except that you will manually need to subtract 1 from the result to make up for the blank line that was inserted into the first part.

    I have made some simplifying assumptions here, such as not worrying about hyphens, not worrying about if the count is off by 1, not dealing with every type of punctuation, etc. Hopefully if you ever decide to do something like this again, this can save you 99.9% of the time that you spent on this in the first place.

Trackbacks/Pingbacks

  1. Quora - Can fluency in a language be defined by a certain number of words learned?... I am going to second Kaicheng Liang's …
  2. Em in Asia! » Blog Archive » Words Words Words - [...] part of me wonders if I’ll ever get to “fluency,” however you define that. I read an interesting article …