How many data do you have?

There are some people who will correct the phrase “the data is” to “the data are” claiming that the word data is plural. The basis of the claim is that in Latin the word data is the plural of the unit datum and that our English word is derived from the Latin. You may guess that I don’t buy this or I wouldn’t be writing this post.

It pained me so much to write the phrase “the data are” that I did a little bit of research. And by research I mean that I googled it, found some people that agreed with me, and quit looking. Despite my biased approach to research I do believe there is a logical argument for “the data is”.

First, English is not Latin and it’s not good enough to accept a grammar rule on the basis of Latin alone. Since the question is about the plurality of the word, let’s note the different kinds of nouns. There are two different kinds of nouns in English that are non-singular, count nouns and mass nouns. Count nouns are things like pencils and books for which the singular is a single unit of the object. Mass nouns are things like water that don’t have a natural unit and require a unit in order to count them (liters of water can be counted).

Let’s highlight some more differences between count nouns and mass nouns. Since count nouns are things that can be counted, we can answer questions like “how many pencils do you have?” and expect a reasonable answer. We can’t ask “how many” for mass nouns, “how many water do you have” doesn’t make sense, since there is no unit to count. Instead we can ask “how much water do you have”. Another difference is one of Brian’s peeves, you can have fewer pencils and less water, but you can’t have less pencils or fewer water.

So at this point, if we want data to be a plural noun, we have to be prepared to answer the title question of this post “How many data do you have?”. Note that this is not “how many data points” or “how many bytes of data” since both of those include an additional unit, but simply “how many data”. We also have to be prepared to say things like “Mark has fewer data than Susan” rather than “less data”.

If you’re still concerned that we’re breaking from Latin, let’s consider the word stamina. My understanding is that the Latin was a plural. But in English, stamina is not a plural. We adapt words from other languages for English, but we’re not bound by their grammar. I suggest getting ahead of the game and using data as a mass noun rather than a plural.

6 Responses to How many data do you have?

  1. bpt2 says:

    Dutch grammar makes more of an effort to preserve Latinate plurals in borrowed words. Dutch speakers of English often transfer these forms to English, probably because they don’t realize we’ve Anglicized the plural. As a result, I regularly hear about interesting musea in Holland.

    Brad, not only needn’t we be bound by foreign grammar, we also needn’t be bound by foreign definitions! My favorite German word is still the one for cell phone, which is an English borrowing: “das Handy”.

  2. brianbunton says:

    I’ve mainly heard “data” used as a plural from two groups of people: 1) British, and 2) those who end emails with “Ciao”.

  3. adawes says:

    Ok, I agree about the pencils and water, but isn’t it true that you can have less pencil? That’s when you’ve sharpened it too much and you have to get a new one. Conversely, you can have fewer waters, that’s when you’re counting H2O molecules and you start to run out. The plurality of the word effects weather it is a mass or count noun, so I don’t think you can use that as an argument. I would say that I have less data or I have fewer data points. But then again, I’m one of those “the data are” types.

    English is fluid: do whatever you like with it, but expect complaints. The bottom line for us is that if the copy editor for PRL scratches out “the data is” and writes “the data are” then it is best to go with it.

  4. bmarts says:

    Yes, you can use “fewer” with a mass noun if you give it a unit, fewer data _points_ or fewer water _molecules_. But that doesn’t make the original noun plural. I read somewhere that US journals are using “the data is” while European journals are using “the data are”.

  5. Raphael says:

    A team is a collection of players.
    A data set is a collection of data.

    The player is …
    The datum is …
    The players are …
    The data are …
    The team is …
    The data set is …

    If I do anything statistical, I do it on a data set, not the data. It is not proper to say, “After an ordinary least square regression, the data produce a trend line.” Rather, it would be, “After an OLS regression, the data set produces a trend line.”

    The confusion derives from lazy and/or unknowing people substituting “data” for the proper “data set.” 99 times out of 100 you will not go wrong if you train yourself to use the “the data set is …” instead of the “The data (to be).” But beware because you do have that occasional proper use of dataum and its plural. “The data form a data set.”

  6. telescoper says:

    I blogged on this myself before coming across this item. You can find my thoughts here.

    My point is that, like many nouns, “data” can derives from both count and non-count forms. Take the example “hair”. This is a non-count (or mass) noun when referring to the hair on someone’s head but a count noun when referring to individual strands. The example you give of “water” also has both count and non-count forms. You can refer to water as stuff in which case it is non-count, but you can certainly ask for “a water” or “two waters” in a bar. In the latter case the unit of measurement is understood from the context, probably a glassful.

    In its guise as a count noun, “data” is the plural of “datum” and it means “readings”, “measurements” or even “numbers”. In this usage it refers to enumerable things each of which has well-defined properties. You can have done datum and you can have fewer data.

    So I think the plural form of the count noun “datum” has a valid specific use, especially in scientific papers.

    The case of its use as a non-count noun is more problematic, although the meaning is clear. If data means an undifferentiated, unspecified or unlimited mass of information then it is a mass noun. The question “how much data do you have? ” involves this usage; the answer should give a unit (e.g. Gbytes). Although many mass nouns have only singular forms, it does NOT follow that because it is a mass noun it must be singular. Take the word “clothes”. This is non-count noun (you can’t have one clothe or two clothes) but exists only in the plural form. On the other hand, the non-count form “statistics” looks like a plural but takes a singular verb (“statistics is a difficult subject”); the count form is straightforward.

    There consequently isn’t an obvious case for “data” in this usage to be either plural or singular so I go with what sounds best, which I think is the singular, just like “statistics”.

    I don’t really like “data point” and “data set” which seem very clumsy to me. “Data point” uses a second word to avoid using the singular of the first word. Why not say “datum”? Data set allows you a so-called metonymic shift to talk about the set (singular) rather than the data but what have gained but a circumlocution?

    Language should and does evolve but, accepting that English is a dynamic thing is not the same as allowing for it to degenerate through carelessness. Language isn’t just useful, it is also beautiful and should be cherished.

