Baby’s first semantic blog citation metrics #oer16

Note: If you want to see semantometrics done for real to have a read about the Jisc-supported work by Petr Knoth and Drahomira Herrmannova. My approach is loosely inspired by their research – which I’ve been fortunate enough to see presented on a couple of occasions – but in terms of reliability, technical finesse and generally having a clue I’m very much a Robbie Dupree to their Doobie Brothers. I should also be clear that what follows is my own efforts entirely.

In all honesty, and if we look at Vivien Rolfe’s superb systematic reviews of open education, we’d have to conclude that literature in the area leaves a fair bit to be desired. Faced with this I’ve heard it said on a number of occasions that “the good stuff is in the blogs”, and I decided that the time had come to test this.

If open education blogs do have academic merit, I would expect them to be cited in the more traditional literature, both around the subject each covers and further afield. This might seem circular – but as there are clearly gaps in the literature one might reasonably expect blog posts to be filling these.

For the purposes of this experiment (as presented at OER16) I looked at five blogs that I felt were consistently high quality, and that I had seen frequently referenced in conference presentations and similar:

Now the “scholarly graph” (the ways in which publications are interconnected by a network of citations) is notoriously hard to mine, unless you happen to be Thompson Reuters or are in a position to give them money. I am neither, so I needed to use the tools I had available to me, which are necessarily incomplete and variable in coverage.

Google Scholar is far from perfect, but it does let me search for references in a round-about way. What I did was searched for the root domain of the blog as a text string, to generate a corpus of literature that included (most likely) a link or citation to a specific page of the blog. I then went to export details of the question and found that a simple export to .csv is not possible. You can export individual records to a small selection of bibliographic software, but not as a whole.

I eventually found the marvelously named “Harzing’s Publish or Perish” which appears to use some Tony Hirst-esque page-scraping magic to automate what would otherwise be a labourious task. I cleaned out duplicate records and self-blog citation (pages from the blog itself returned, which happened quite a lot for popular individual posts), then rendered to the spreadsheet available here. (I also had a special issue with David Wiley – hey, don’t we all! – as many early papers cited his Open Publication license as as a means of licensing their work. I got round this by searching only for citations to his “blog” directory.)

[I also searched for “edupunk” just out of interest – as this was a 2008 term coined to deal with a range of activities that included blogging instead of formal publication. To be honest, it didn’t tell me very much other than that the majority of publications on edupunk are in Spanish.]

Publish or perish also returns a bunch of those research power statistics that occasionally come up in conversation – and it was tickled to play with the h-index for each corpus of papers I returned. The h- (or Hirsch) index is generally seen as an author-level metric, but it can be calculated to characterise any group of papers. In this instance it gives a reasonable measure of the kind of influence that the papers that cite each blog have.

  • David Wiley – 23
  • George Siemens – 38(! – this is very high for an education related subject)
  • – 29
  • Martin Weller – 20
  • Audrey Watters – 12

All of these are – if you are the kind of person who cares about your h-index – pretty respectable showings. In particular it should be noted that Audrey is a journalist (and a damned fine independent one whom you should support!) who makes no pretense to be writing academic research or comment.

[I should note here that I also looked at commonality between pairs of blogs – are any particular two likely to be cited together. The short answer is yes – people who cite George Siemens’ blog also tend to cite David Wiley’s blog. The data is on the spreadsheet.]

My next step was to find the five most common words used in the titles of the papers citing each blog, and compare these to the top five words used in the blog itself (using the standard list of English stop-words built into voyant tools.) A proper look at this topic might take more words from each source, and employ a weighting based on the ratios between word counts, but at this point it was Sunday evening before the conference.


It’s possible to do a manual sort to look for any interesting patterns – in this case what struck me is how often citing papers talked about technology-related issues or *shudder* MOOCs, whereas the blogs were more likely to consider students or use terms like OER.


I decided to automatically compare titles to the list of common terms from the blog in question using a simple excel formula – I used an average of the number of instances where the title did contain one of the common terms to create an index of semantic prediction – basically a higher number (with 1 being the highest possible) meant that the terms commonly used in the blog were terms likely to be found in titles of papers citing the blog. Here’s how that stacks up:

  • Wiley – 0.70
  • Siemens – 0.77
  • Connectivism – 0.58
  • Weller – 0.58
  • Watters – 0.12

Now Petr Knoth and Drahomira Herrmannova postulate that a greater semantic distance between the citing paper and the cited resource implies a greater contribution to knowledge – works that bridge disparate areas of research could be seen as more valuable as new knowledge has been synthesised via a new connection formed (kinda rhizomatic, wouldn’t you say?)

By this measure Audrey’s work can be said to have the greatest contribution to knowledge, with David and George much more likely to be cited in the field in which they write. Whether this tells us anything meaningful in the grand scheme of things is open to question, not least because of the general shonkiness of my methods – this being just a first look which has much scope for improvement.

It also made me think about Cameron Neylon‘s concerns about our poor understanding of the nature of citation. Citation is one of those things that seems simple at first thought, but has a huge layer of social and cultural practices built on top of it. Unless we have a better understanding of the reasons for each citation (as quote source, acknowledgement, attribution of ideas or methods, cultural norm in a domain of research….) we can’t really assume that all citations carry equal weight – though most serious citation metrics do so. Some of this may be categoriseable using Knoth and Herrmannova’s deep text analysis alongside a carefully designed system of categories and indicators (another reason I am watching that project and other citation experiments with huge interest.)

So – to answer my initial questions:

  • is all the good stuff in the blogs? There is clearly a lot of good stuff in blogs, which is frequently cited by literature that itself is highly cited. I’d love to look at other areas of research with similarly important blogs to compare – any suggestions would be welcome!
  • are blogs cited only within the domains they write? broadly no, though some blogs are more often cited in the domains they most closely identify than others. Though Audrey’s blog citations showed a low level of semantic prediction, this may be because there were lower citations overall, and I would like to refine the metric to account for this (possibly via some form of sampling?)
  • is this interesting enough to look at further? Absolutely!

Here’s the slides from mine and Viv’s presentation at OER16

39 thoughts on “Baby’s first semantic blog citation metrics #oer16”

  1. As I said on Twitter, I was sorry to miss your presentation at OER16. First, thanks so much for introducing me to semantometrics though with formulae like that it may take us some time to become better acquainted:)
    I was intrigued by your hypothesis Is all the good stuff in the blogs? More than 4 years I blogged about a writing project (that sadly was never completed) on the relationships between blogs and academic papers in writing development I have just reread the comment stream and it’s another case of that being better than my post.
    Where is the ‘good stuff’ on OER? Obviously there’s lots of good stuff in blogs and we saw evidence of brilliant work at OER16 last week but I am curious about relationships between blogs, presentations and academic papers. The work published in 2015 that had the most impact on me was this special issue and was curious to see if it popped up at OER16. I wasn’t surprised that Edwards popped up in Catherine Cronin’s keynote because of discussions we have had since the Special Issue was published. Inspired by your work, I searched the OER16 site and realised that I could search the programme doc for author names from the SI. The result was interesting. The only other citation to that Special Issue was from Jeremy Knox who was an author in the SI and cited Edwards in his presentation so that was a link already present so to speak. Edwards’ paper was one of 2 that was Open Access in the SI and it was mentioned in the popular opening keynote so hopefully it will have some traction (positive or negative) in OER circles. One barrier is the paywall, although at least one author did their best to circumvent this
    Of course there could have been a lot of blogging on that Special Issue that I haven’t seen and research papers citing it (including 2 of my own) may still be working their way through peer review but I was a bit saddened by what I didn’t find.

    1. Thanks for this. I was wondering if there was any interest in being critical about openness and the history of these debates. The fact my article is OA means anyone can get it and I have tweeted a link beforehand.

  2. Impressive. Really a good presentation. Although I missed that presentation but searching this kind of article which provides all the highlight about that presentation. Thanks a lot for sharing.

  3. Pingback: catherinecronin

Leave a Reply

Your email address will not be published. Required fields are marked *