Note: If you want to see semantometrics done for real to have a read about the Jisc-supported work by Petr Knoth and Drahomira Herrmannova. My approach is loosely inspired by their research – which I’ve been fortunate enough to see presented on a couple of occasions – but in terms of reliability, technical finesse and generally having a clue I’m very much a Robbie Dupree to their Doobie Brothers. I should also be clear that what follows is my own efforts entirely.
In all honesty, and if we look at Vivien Rolfe’s superb systematic reviews of open education, we’d have to conclude that literature in the area leaves a fair bit to be desired. Faced with this I’ve heard it said on a number of occasions that “the good stuff is in the blogs”, and I decided that the time had come to test this.
If open education blogs do have academic merit, I would expect them to be cited in the more traditional literature, both around the subject each covers and further afield. This might seem circular – but as there are clearly gaps in the literature one might reasonably expect blog posts to be filling these.
For the purposes of this experiment (as presented at OER16) I looked at five blogs that I felt were consistently high quality, and that I had seen frequently referenced in conference presentations and similar:
- David Wiley – Iterating towards openness
- George Siemens – Elearnspace, and Connectivism.ca (the latter was deleted earlier this year, but was an important point of reference)
- Martin Weller – Edtechie (note that Martin’s marvellous blog is now elsewhere)
- Audrey Watters – Hack Education
Now the “scholarly graph” (the ways in which publications are interconnected by a network of citations) is notoriously hard to mine, unless you happen to be Thompson Reuters or are in a position to give them money. I am neither, so I needed to use the tools I had available to me, which are necessarily incomplete and variable in coverage.
Google Scholar is far from perfect, but it does let me search for references in a round-about way. What I did was searched for the root domain of the blog as a text string, to generate a corpus of literature that included (most likely) a link or citation to a specific page of the blog. I then went to export details of the question and found that a simple export to .csv is not possible. You can export individual records to a small selection of bibliographic software, but not as a whole.
I eventually found the marvelously named “Harzing’s Publish or Perish” which appears to use some Tony Hirst-esque page-scraping magic to automate what would otherwise be a labourious task. I cleaned out duplicate records and self-blog citation (pages from the blog itself returned, which happened quite a lot for popular individual posts), then rendered to the spreadsheet available here. (I also had a special issue with David Wiley – hey, don’t we all! – as many early papers cited his Open Publication license as as a means of licensing their work. I got round this by searching only for citations to his “blog” directory.)
[I also searched for “edupunk” just out of interest – as this was a 2008 term coined to deal with a range of activities that included blogging instead of formal publication. To be honest, it didn’t tell me very much other than that the majority of publications on edupunk are in Spanish.]
Publish or perish also returns a bunch of those research power statistics that occasionally come up in conversation – and it was tickled to play with the h-index for each corpus of papers I returned. The h- (or Hirsch) index is generally seen as an author-level metric, but it can be calculated to characterise any group of papers. In this instance it gives a reasonable measure of the kind of influence that the papers that cite each blog have.
- David Wiley – 23
- George Siemens – 38(! – this is very high for an education related subject)
- Connectivism.ca – 29
- Martin Weller – 20
- Audrey Watters – 12
All of these are – if you are the kind of person who cares about your h-index – pretty respectable showings. In particular it should be noted that Audrey is a journalist (and a damned fine independent one whom you should support!) who makes no pretense to be writing academic research or comment.
[I should note here that I also looked at commonality between pairs of blogs – are any particular two likely to be cited together. The short answer is yes – people who cite George Siemens’ blog also tend to cite David Wiley’s blog. The data is on the spreadsheet.]
My next step was to find the five most common words used in the titles of the papers citing each blog, and compare these to the top five words used in the blog itself (using the standard list of English stop-words built into voyant tools.) A proper look at this topic might take more words from each source, and employ a weighting based on the ratios between word counts, but at this point it was Sunday evening before the conference.
It’s possible to do a manual sort to look for any interesting patterns – in this case what struck me is how often citing papers talked about technology-related issues or *shudder* MOOCs, whereas the blogs were more likely to consider students or use terms like OER.
I decided to automatically compare titles to the list of common terms from the blog in question using a simple excel formula – I used an average of the number of instances where the title did contain one of the common terms to create an index of semantic prediction – basically a higher number (with 1 being the highest possible) meant that the terms commonly used in the blog were terms likely to be found in titles of papers citing the blog. Here’s how that stacks up:
- Wiley – 0.70
- Siemens – 0.77
- Connectivism – 0.58
- Weller – 0.58
- Watters – 0.12
Now Petr Knoth and Drahomira Herrmannova postulate that a greater semantic distance between the citing paper and the cited resource implies a greater contribution to knowledge – works that bridge disparate areas of research could be seen as more valuable as new knowledge has been synthesised via a new connection formed (kinda rhizomatic, wouldn’t you say?)
By this measure Audrey’s work can be said to have the greatest contribution to knowledge, with David and George much more likely to be cited in the field in which they write. Whether this tells us anything meaningful in the grand scheme of things is open to question, not least because of the general shonkiness of my methods – this being just a first look which has much scope for improvement.
It also made me think about Cameron Neylon‘s concerns about our poor understanding of the nature of citation. Citation is one of those things that seems simple at first thought, but has a huge layer of social and cultural practices built on top of it. Unless we have a better understanding of the reasons for each citation (as quote source, acknowledgement, attribution of ideas or methods, cultural norm in a domain of research….) we can’t really assume that all citations carry equal weight – though most serious citation metrics do so. Some of this may be categoriseable using Knoth and Herrmannova’s deep text analysis alongside a carefully designed system of categories and indicators (another reason I am watching that project and other citation experiments with huge interest.)
So – to answer my initial questions:
- is all the good stuff in the blogs? There is clearly a lot of good stuff in blogs, which is frequently cited by literature that itself is highly cited. I’d love to look at other areas of research with similarly important blogs to compare – any suggestions would be welcome!
- are blogs cited only within the domains they write? broadly no, though some blogs are more often cited in the domains they most closely identify than others. Though Audrey’s blog citations showed a low level of semantic prediction, this may be because there were lower citations overall, and I would like to refine the metric to account for this (possibly via some form of sampling?)
- is this interesting enough to look at further? Absolutely!
Here’s the slides from mine and Viv’s presentation at OER16