What The Hell Do You Even Do?

3 minute read

I guess it’s a valid question, considering how this blog started. At a high level, my PhD looked at entity linking as applied to cultural heritage documents. But what does that mean?

An excellent idea that has been implemented over the past several years is the digitization of our archives. Newspapers, census records, letters, diaries, journals, depositions—some hundreds or even thousands of years old—are being photographed or scanned and uploaded into massive digital archives that can be made available online. There are two major advantages to all this work

  1. These documents are digitally preserved. Should anything ever happen to the original, we at least have a digital copy to fall back on
  2. Access to these documents is now much easier and more open for everyone. Espectially in the case of documents where the original copy is too fragile to be handled, providing a digital copy means that anyone can see the item and read it without endangering the original.

At its simplest, digitization simply involves storing a picture of the item in question. However, as Optical Character Recognition (OCR) and Handwriting Character Recognition (HCR) models improve, more and more of the actual text on these pages is being transcribed and made available alongside the images. Billions of words which talk about our shared history, all available online for anyone to read. How exciting!

The problem is that it is too much data to traverse without having some way to organise it. Traditional archival methods are all well and good, but what if a computer could provide a more fine-grained structure? What if, for instance, the computer could recognize people and places being discussed in these documents? What if it could link references across collections and archives, creating a web of knowledge derived from text that has travelled from medieval Europe to the present day?

There are numerous ways in which you could achieve this. I looked specifically at the challenge of getting a computer to recognise people and places in 17th century English language documents. The computer would read the text, spot sequences of words that looked like an entity, and then attempt to connect these mentions to a graph of entities which could then be used to build links between different texts.

As PhDs go, it was a tough one. But I made my contributions, and have continued to expand on the work since my dissertation was submitted. Some day I may write a blog post which talks about this research in detail, but that’s beyond the scope of what I want to talk about here.

Since then I’ve done a lot of work in modelling archival content. I worked extensively on the Virtual Treasury of Ireland, reconstructing 700 years of lost Irish history. That project produced enormous quantities of high quality text from recovered documents, all of which can now be accessed online and used to do Interesting NLP Stuff™.

I also worked a lot in the identification of misinformation and disinformation online. Here, the kind of work that I do has a slightly different application. We’re not connecting people across history. Rather we’re trying to spot narratives and stories which perpetuate harmful false narratives. Again, things are challenging here, and it was at this point that I made my escape to sea.

Now that I’m looking to return to academia, I’m taking a little bit of time to pick through my previous work and decide which avenues I want to keep pursuing, and which ones I will leave behind. There’s a lot to be talked about while I do this. Not only the research, but also how one builds a career in academia. This is something that I am very much learning as I go along, and this blog may be an appropriate place to document and talk about it.

My plan is to write stream-of-consciousness articles quite regularly (daily, if possible) on this site. The subject of these articles will heavily depend on what I have spent the day doing. Sometimes it may be technical, if I have spent a day working on code. It may be a discussion of a research paper. It may be plans for designing a new teaching module. Whatever the case, I want to document my progress as I attempt to reach the next rung of the academic ladder. That’s no small feat, but I hope that by writing about it here, other people will learn from whatever mistakes and successes I have.