Figures 1, 2, and 3 show accuracy measurements for the token unigrams, token bigrams, and normalized character 5-grams, for all three systems at various numbers of principal components. For the unigrams, SVR reaches its peak Interestingly, it is SVR that degrades at higher numbers of principal components, while TiMBL, said to need fewer dimensions, manages to hold on to the recognition quality.

LP peaks much earlier However, it does not manage to achieve good results with the principal components that were best for the other two systems. Furthermore, LP appears to suffer some kind of mathematical breakdown for higher numbers of components.

Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal. We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism.

For the bigrams Figure 2 , we see much the same picture, although there are differences in the details. SVR now already reaches its peak TiMBL peaks a bit later at with And LP just mirrors its behaviour with unigrams. LP keeps its peak at 10, but now even lower than for the token n-grams However, all systems are in principle able to reach the same quality i.

Even with an automatically selected number, LP already profits clearly Recognition accuracy as a function of the number of principal components provided to the systems, using token bigrams. And TiMBL is currently underperforming, but might be a challenger to SVR when provided with a better hyperparameter selection mechanism.

We will focus on the token n-grams and the normalized character 5-grams. As for systems, we will involve all five systems in the discussion.

However, our starting point will always be SVR with token unigrams, this being the best performing combination. We will only look at the final scores for each combination, and forgo the extra detail of any underlying separate male and female model scores which we have for SVR and LP; see above. When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus.

The exception also leads to more varied classification by the different systems, yielding a wide range of scores. SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1.

Figure 4 shows that the male population contains some more extreme exponents than the female population. The most obvious male is author , with a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: From this point on in the discussion, we will present female confidence as positive numbers and male as negative.

Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams. All systems have no trouble recognizing him as a male, with the lowest scores around 1 for the top function words.

If we look at the rest of the top males Table 2 , we may see more varied topics, but the wide recognizability stays. Unigrams are mostly closely mirrored by the character 5-grams, as could already be suspected from the content of these two feature types. For the other feature types, we see some variation, but most scores are found near the top of the lists. Feature type Unigram 1: Top Function 4: On the female side, everything is less extreme.

The best recognizable female, author , is not as focused as her male counterpart. There is much more variation in the topics, but most of it is clearly girl talk of the type described in Section 5. In scores, too, we see far more variation. Even the character 5-grams have ranks up to 40 for this top Another interesting group of authors is formed by the misclassified ones. Taking again SVR on unigrams as our starting point, this group contains 11 males and 16 females.

We show the 5 most Confidence scores for gender assignment with regard to the female and male profiles built by SVR on the basis of token unigrams. The dashed line represents the separation threshold, i. The dotted line represents exactly opposite scores for the two genders. Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr with various feature types. Top Function 9: With one exception author is recognized as male when using trigrams , all feature types agree on the misclassification.

This may support ourhypothesis that allfeature types aredoingmore orlessthe same. But it might alsomean that the gender just influences all feature types to a similar degree. In addition, the recognition is of course also influenced by our particular selection of authors, as we will see shortly. Apart from the general agreement on the final decision, the feature types vary widely in the scores assigned, but this also allows for both conclusions.

The male which is attributed the most female score is author On re examination, we see a clearly male first name and also profile photo. However, his Twitter network contains mostly female friends.

This apparently colours not only the discussion topics, which might be expected, but also the general language use. The unigrams do not judge him to write in an extremely female way, but all other feature types do. When looking at his tweets, we This has also been remarked by Bamman et al. There is an extreme number of misspellings even for Twitter , which may possibly confuse the systems models. The most extreme misclassification is reserved for a female, author This turns out to be Judith Sargentini, a member of the European Parliament, who tweets under the 14 Although clearly female, she is judged as rather strongly male In this case, it would seem that the systems are thrown off by the political texts.

If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it once , as compared to 21 male authors with up to 9 uses. Apparently, in our sample, politics is a male thing. We did a quick spot check with author , a girl who plays soccer and is therefore also misclassified often; here, the PCA version agrees with and misclassified even stronger than the original unigrams versus.

In later research, when we will try to identify the various user types on Twitter, we will certainly have another look at this phenomenon. Are they mostly targeting the content of the tweets, i. In this section, we will attempt to get closer to the answer to this question. Again, we take the token unigrams as a starting point. However, looking at SVR is not an option here. Because of the way in which SVR does its classification, hyperplane separation in a transformed version of the vector space, it is impossible to determine which features do the most work.

Instead, we will just look at the distribution of the various features over the female and male texts. Figure 5 shows all token unigrams. The ones used more by women are plotted in green, those used more by men in red.

The position in the plot represents the relative number of men and women who used the token at least once somewhere in their tweets. However, for classification, it is more important how often the token is used by each gender.

We represent this quality by the class separation value that we described in Section 4. As the separation value and the percentages are generally correlated, the bigger tokens are found further away from the diagonal, while the area close to the diagonal contains mostly unimportant and therefore unreadable tokens.

On the female side, we see a representation of the world of the prototypical young female Twitter user. And also some more negative emotions, such as haat hate and pijn pain. Next we see personal care, with nagels nails , nagellak nail polish , makeup makeup , mascara mascara , and krullen curls.

Clearly, shopping is also important, as is watching soaps on television gtst. The age is reconfirmed by the endearingly high presence of mama and papa. As for style, the only real factor is echt really. The word haar may be the pronoun her, but just as well the noun hair, and in both cases it is actually more related to the Identity disclosed with permission. And by TweetGenie as well. An alternative hypothesis was that Sargentini does not write her own tweets, but assigns this task to a male press spokesperson.

However, we received confirmation that she writes almost all her tweets herself Sargentini, personal communication.

Percentages of use of tokens by female and male authors. The font size of the words indicates to which degree they differentiate between the gender when also taking into account the relative frequencies of occurrence. We do also see more expressions of self, with ik I and first person verbs such as wil want and heb have , but these are much less distinguishing. On the male side, we see a rather different world. Apart from bier beer , we see an enormous number of soccer-related words, with fifa at the most extreme, and then club names, scores, competitions, playing, winning, losing, etc.

On the right edge of the plot, though, we do also observe some function words. The location adverbs daar there and waar where appear to be a more male thing, as well as some prepositions like per per , bij at, near and voor for, before. Finally, mentioning other users is apparently more often done by men.

For both genders, the tokens are dominated by the young person s world. It is no wonder that classifying different types of authors, such as politicians and financial bloggers, is more problematic. Percentages of use of the most frequent function words and punctuation by female and male authors.

Although most distinguishing tokens appear to be related to content, we do observe some stylerelated tokens. In Figure 6, we show a plot for the top function words or rather tokens , which was the only feature type focusing on style in our experiments.

We can now observe various distinguishing tokens which were so far lost in the dense cloud of words. They correspond to what earlier research see Section 2 has observed. Females show more personal pronouns, such as the already mentioned ik I , but also me me and jou you , as well as the reduced possessive pronoun mn mijn; my.

Males write more objective structures, with the mentioned prepositions, andalsoarticleslikede the andeen a. Herewefindalsofindmorethirdpersonconstructions, with is is , hij he and zijn his, or plural are. Looking at the bigrams, which we will not plot here, we see a few more style-related constructions appearing. On the female side, we see niet meer not anymore , ik mis I miss and the On the male side, there are also mostly combinations of already observed unigrams, but also the more pragmatic ending of tweets with the word man, in man!

All in all, there appear to be quite a few features related to style after all. Furthermore, the top function words are doing quite well, with On the other hand, we cannot escape the impression that even these style features are more often related to what is being tweeted about, than to personal writing style. Conclusion and Future Work We have investigated how well the gender of authors on Twitter can be determined on the basis of token or character n-grams.

We find that recognition is possible with a high accuracy, up to Furthermore, some of the errors are probably related to the fact that the authors in question are different from the typical Twitter users dominating our data set.

The best feature type for recognition appears to be the token unigrams, with the most distinguishing tokens linked to the typical activities of the dominant Twitter users. As for classification systems, Support Vector Regression clearly performs best with all feature types. During our investigation into gender recognition, we have also experimented with the use of Principal Component Analysis as a preprocessing step to classification. It was already known that this step was necessary for k-nn learning.

We found that SVR is actually hampered rather than helped by the preprocessing. Its accuracy degrades when using PCA, although often not significantly. TiMBL, even with PCA, does not reach the same accuracy level, and only accomplishes scores similar to SVR s scores for token skip bigrams and unnormalized character trigrams.

However, TiMBL s lower quality is mostly a matter of hyperparameter selection. The number of principal components provided to the learners was determined automatically on the basis of development data. When we examined the systems accuracy for fixed numbers of principal components, TiMBL was often at the same accuracy level as SVR, and it was LP that was falling behind.

It has remained unclear to which degree gender can be recognized on the basis of style features. Although the use of all unigrams for classification yields far better results than the use of the most frequent function words, the latter are certainly not doing badly. Furthermore, our closer examination in Section 5. We will revisit this question when we have larger n-gram sets available which can be assumed to be largely domain-independent. Not only did we predict just one user trait, but we also considered just a very select class of users, namely individual users with a significant tweet volume.

We will still need to test the minimum number of words on which the classifier can maintain its current high quality. Furthermore, we will need to build classifiers to distinguish between individual user accounts, shared user accounts, accounts controlled by boards of editors, and tweetbots.

It may also be useful to distinguish between different uses of Twitter, such as professional communication and social chitchat, and build separate metadata estimators for these different uses. Even more importantly, we will need to look beyond very specific lexical features. If we base metadata on a limited number of such features, we will never be able to use the resulting data for studying language use or social behaviour.

If we would try, we would fall victim to circular reasoning, such as observing that only men ever play soccer, We are currently laying the basis for the construction of such sets in other work van Halteren and Oostdijk Submitted. Therefore, if we ever want to automatically add metadata, it will have to be with as many information sources as possible, preferably only using that metadata on which various sources agree.

