The Readability of Blogs
You need 11.9 years of formal education to easily understand this site. Well, that's if you believe a readability test called the Gunning-Fog Index. The Gunning-Fog Index is basically an algorithm that analyzes text for sentence length, syllables per word, and word complexity. After crunching the numbers it comes up with a readability score that is supposed to predict how easily people will be able to digest the text. The Wikipedia article for the Gunning-Fog Index mentions that comic book text typically has a score around six, Reader's Digest typically scores around eight, Newsweek scores around ten, and so on. This puts onfocus.com on par with the readability of Time magazine.
The first time I ran into a demo of readability tests was at this page: Juicy Studio Readability Test. You can plug in a URL, and get back a Gunning-Fog Index score, and some other scores. I thought it was interesting and moved on. But for some reason it's been sticking in the back of my mind.
I'm bringing this up because I've been thinking quite a bit about the ways we measure blogs. And most of our measurement tools are fairly blunt. If you ask blog-measurement site Technorati what it "thinks" about your favorite blogs, you'll get machine answers like the number of inbound and outbound links. You'll get some info about traffic over time and Technorati's computed rank compared to other blogs. You'll see post-frequency and a list of common topics culled from RSS categories and Technorati tagging.
On the other hand, if I were to ask you some questions about your favorite blogs, you could probably tell me exactly why you like them. And it wouldn't have anything to do with inbound links or the other machine-based metrics. I'm guessing most of your answers would involve the writing style, tone, the topics the author covers, the fact that everyone else reads it, or maybe your personal relationship with the author.
You can't quantify something like tone, so you can't put computers to work analyzing tone. (I'd love to have a snark score for blogs.) But readability scores are a step toward a more human-style metric, and the scores can be crunched, analyzed, graphed, and averaged by computers. And I like the idea that the readability scores are laying there dormant within the sentences themselves, waiting to be tapped.
I'm not a linguist so I don't know how accurately these scores reflect readability. But I was interested enough in readability as a metric to do some digging around. A search on CPAN turned up the module
So I cooked up a little Perl script that takes a list of RSS feeds, loops through the posts, strips out HTML, and calculates readability scores. If you want to run it yourself, you can grab the code here:
reading_levels.pl
In addition to the
perl reading_levels.pl > reading_levels.txt
Once finished, the file reading_levels.txt will have a report with the individual reading levels for the sites, and an average for the group.
Caveats: this isn't a very robust feed parser, some feeds only have excerpts rather than full posts, and some feeds simply don't work with this script. I used the full feed posts if multiple feeds were available, and I skipped any sites that didn't parse.
So, what did I find? Well, here's the report for the top several English-language blogs as reported today by Technorati:
reading_levels.txt
(I skipped Post Secret because there's not much text to analyze.) The average Gunning-Fog Index score was off the "wide audience" charts at 14. That means the average person would need over 14 years of formal education to understand these blogs easily. The average Flesch Reading Ease score was 46.9, on a scale of 100. That's on par with state insurance form requirements. (seriously!) And the Flesch-Kincaid grade-level score was 11.8, meaning that it's appropriate for high school seniors, high on the scale. The most "ideal" site for a wide audience was Daily Kos, with a low Flesch Kincaid Grade level (9.05) and an above average Flesch Reading Ease score of 56.48.
So, what does this mean? I have no idea. My prediction that the most popular blogs would have very good readability scores didn't quite hold up. I can't pinpoint a "sweet spot", but maybe blog readers enjoy more densely layered text. (Think Time instead of Newsweek, but not quite Harvard Law Review.) I might take a look at sentence length and percentage of complex words next and see how those measure up.
I still think measuring readability has promise. Earlier today Anil was talking about TL;DR syndrome, and I think the popular blogs capitalize on this with short, frequent posts. But I also wonder if text density plays a roll. So in addition to saying, "too long; didn't read," I think there's the possibility of "too dense; didn't read". (insert joke here.)
The first time I ran into a demo of readability tests was at this page: Juicy Studio Readability Test. You can plug in a URL, and get back a Gunning-Fog Index score, and some other scores. I thought it was interesting and moved on. But for some reason it's been sticking in the back of my mind.
I'm bringing this up because I've been thinking quite a bit about the ways we measure blogs. And most of our measurement tools are fairly blunt. If you ask blog-measurement site Technorati what it "thinks" about your favorite blogs, you'll get machine answers like the number of inbound and outbound links. You'll get some info about traffic over time and Technorati's computed rank compared to other blogs. You'll see post-frequency and a list of common topics culled from RSS categories and Technorati tagging.
On the other hand, if I were to ask you some questions about your favorite blogs, you could probably tell me exactly why you like them. And it wouldn't have anything to do with inbound links or the other machine-based metrics. I'm guessing most of your answers would involve the writing style, tone, the topics the author covers, the fact that everyone else reads it, or maybe your personal relationship with the author.
You can't quantify something like tone, so you can't put computers to work analyzing tone. (I'd love to have a snark score for blogs.) But readability scores are a step toward a more human-style metric, and the scores can be crunched, analyzed, graphed, and averaged by computers. And I like the idea that the readability scores are laying there dormant within the sentences themselves, waiting to be tapped.
I'm not a linguist so I don't know how accurately these scores reflect readability. But I was interested enough in readability as a metric to do some digging around. A search on CPAN turned up the module
Lingua::EN::Fathom
which accepts arbitrary text and returns the Gunning-Fog Index score, along with several other scores including Flesch Reading Ease score, and the Flesch-Kincaid grade level. I thought it might be fun to plug in the top ten or so English language blogs as reported on Technorati popular to see if there's a "sweet spot" reading level among the most popular blogs. Of course many factors go into a blog's success, but I thought readability could be a reason some blogs hit the top of the tail and others don't. If nothing else, I figured I could find out if blog readers are more of a Reader's Digest sort of audience, or more of a Time magazine sort of audience.
So I cooked up a little Perl script that takes a list of RSS feeds, loops through the posts, strips out HTML, and calculates readability scores. If you want to run it yourself, you can grab the code here:
reading_levels.pl
In addition to the
Lingua::EN::Fathom
module, you'll need LWP::Simple
for fetching feeds, XML::RSS::Parser
for parsing them, and Math::Round::Var
for rounding the scores. Add a list of feed URLs you want to analyze to the top of this file, and then run it on the command line, like this:
perl reading_levels.pl > reading_levels.txt
Once finished, the file reading_levels.txt will have a report with the individual reading levels for the sites, and an average for the group.
Caveats: this isn't a very robust feed parser, some feeds only have excerpts rather than full posts, and some feeds simply don't work with this script. I used the full feed posts if multiple feeds were available, and I skipped any sites that didn't parse.
So, what did I find? Well, here's the report for the top several English-language blogs as reported today by Technorati:
reading_levels.txt
(I skipped Post Secret because there's not much text to analyze.) The average Gunning-Fog Index score was off the "wide audience" charts at 14. That means the average person would need over 14 years of formal education to understand these blogs easily. The average Flesch Reading Ease score was 46.9, on a scale of 100. That's on par with state insurance form requirements. (seriously!) And the Flesch-Kincaid grade-level score was 11.8, meaning that it's appropriate for high school seniors, high on the scale. The most "ideal" site for a wide audience was Daily Kos, with a low Flesch Kincaid Grade level (9.05) and an above average Flesch Reading Ease score of 56.48.
So, what does this mean? I have no idea. My prediction that the most popular blogs would have very good readability scores didn't quite hold up. I can't pinpoint a "sweet spot", but maybe blog readers enjoy more densely layered text. (Think Time instead of Newsweek, but not quite Harvard Law Review.) I might take a look at sentence length and percentage of complex words next and see how those measure up.
I still think measuring readability has promise. Earlier today Anil was talking about TL;DR syndrome, and I think the popular blogs capitalize on this with short, frequent posts. But I also wonder if text density plays a roll. So in addition to saying, "too long; didn't read," I think there's the possibility of "too dense; didn't read". (insert joke here.)