I blog like a male - and you?
I kind of stumbled into this funny program this morning, while I was loitering around the net, and the first question I asked myself when I found this was: can we even believe it? So, of course, I run a test with myself first, and then I decided to run a couple of experiments, just to see what popped up. I would not really attach huge value to the "experiments", as I did not control for text length much, my sample population was minute - to use a euphemism - and I used Google Translate. But I think I did find something interesting nonetheless. Which is, the program and the algorithm have an inherent language bias. Or at least, seem to have.
More below the fold.
As the program works best with passages that are 500 words long or longer, I decided to first try out with one of my long posts about peer-reviewed research. And of course, the programs decided that I was, by a large advantage margin, writing like a male. So then I decided to try something else - just to see how sure the "program's judgement" was - and I picked something else, also talking about research, but not about a specific paper. Again, I have some serious balls according to the Genie. Here is a snapshot of the results.
The Genie is based on an algorithm which basically counts certain "gender-specific" words in a passage, calculates a score for each of them, and then gives you a cumulative score. I am using inverted comas because these are very basic words, but the authors behind the algorithm have noticed that one gender uses some of these significantly more often than the other gender does, and therefore this special word count could be used to estimate how "male" or "female" the writing is. Now, I want you to look at the numbers - the table right under the score - and notice which words are the most frequent in my article. Here is a list of those words that got a score equal to or higher than 100: "with", "not" - on the feminine side - "are", "as", "is", "the" - on the masculine side. The male score was 1182, and the female score 781, total word count 607. We are cutting it pretty close, but if you look at the table you will see how much more frequent the "male" words seem to be.
Which is all good. I would have to run the program with my writing, say, 100 times, controlling for length and maybe topic, to really get an idea of how consistent this is. I tried in total three times, with articles longer than 500 words, and the Genie showed no doubts whatsoever.
But one trend resulted to be clear. All the most frequent words in my articles - and I mean, in absolute terms - are articles, specifically "a" and "the". There also seem to be a lot of descriptive and/or passive sentences, which explain the "are" and "is". These are all very frequently used words in the language of the Country of Jokes. So I decided to be naughty, and run a tiny little experiment.
For my experiment, I picked a male writer and a female writer from the Country of Jokes. I also picked passages longer than 500 words, as advised by the Genie. Then, I run them through Google Translate. You'll think I am crazy, right? Because literal translation with Google Translate, we all know it, makes hardly any sense at all. But what it does, is that it keeps everything pretty much as it is - including the amount of articles and passive sentences. Of course, I do not use literal translation when I write in English, but I do retain certain tendencies from my mother tongue, tendencies I often have to work intently to overcome.
I run the male writer, after translating everything to English. And the Genie, of course, got it right. Now I want you to look at the screen shot.
Let's look at the most frequent words from both sides, above 100 points. On the feminine side, we have "if", "not", "and"; on the masculine side, we have "are", "who", "the" - and interestingly, "as" got 92. Not completely identical, I grant you, but the frequency of "are" and "the" was quite high.
I then switched to the female writer. Here is her screenshot. And, of course....she was male, according to the Genie.
Although the male score was almost certainly not significant, let's look again at the words with more than 100 points. On the feminine side, we find "with", "not", "and"; on the masculine side, "are", "who", "is", "the", "a". The "the" alone gathered 700 points.
Noticed anything by now? Writers from the Country of Jokes seem to use language in a very similar way regardless of gender...or, do they? The most important step in all this is the translation, or the transposition from one language to another. The results do not necessarily imply that there is no gender bias. They simply imply that writers from the Country of Jokes use a lot of articles and passive or descriptive sentences. Which is undoubtedly the case in real life.
This shows, even with all the limitations of this tiny experiment, that the Genie probably has an inherent language bias. Maybe a program one day will be really able to tell male from female from word counts; but it should be adjusted for the different languages, and even so...why would you leave it to a program to tell you something about your gender?
Humans, I swear, are the only things I will never understand.
View blog reactions






5 of you rambled:
With a little bit of Baysian reasoning, it shouldn't be too difficult to correctly predict most blog author's gender. At least in the scientific blogosphere, I would guess that more than 80% of authors are male. I tested the server with three different long posts from three female science bloggers, and all of them were predicted to be males.
Where does your "80%" guess come from?
just a guess. Do you have a better figure?
No, I don't.
And I'd like to say that my article suggests that the Genie is language-biased, not that it can correctly "guesstimate" that the majority of bloggers out there are males. In fact, I do not even know if that's the case.
You also said you can use Bayesian reasoning to predict a blogger's gender...would you mind explaining how?
I managed to get the result "female" with the text "I love to talk with many people, even if they are strangers." - I classed it as fiction (appropriately) and got a female score of 99 and a male score of 36. Of course, a short piece of text like that would be unreliable for any real statistical test, but it's amusing nonetheless.
Post a Comment