Skip to main content

Jenifer Dodd: Data Curation: Simple Tools for Starting Projects

Posted by on Monday, January 15, 2018 in DH Center Blog, News.

The first step for much digital humanities work is data curation: collecting data, putting it into a format that makes sense for your project, and making sure it doesn’t contain any mistakes. While some DH technologies have steep learning curves, data curation can be done by anyone.

Data Curation: Simple Tools for Starting Projects

ArcGIS, Python, R… These are the kinds of tools that come to mind when people think of the digital humanities. But these are also technologies with fairly steep learning curves, which can put people off from embarking upon DH projects. However, most DH projects start off with much simpler tools—often, nothing more sophisticated than an Excel spreadsheet. In this blog, I’m going to talk about some of the simpler tools that can be beneficial in the early stages of a DH project. I work with text, so these tools will be useful primarily to people who also work with text. The tools in question are used as part of the initial data curation step in a DH project, where you collect and format your data before using other tools to visualize or analyze it.

The first question for a text-based project is often, “Is my text machine-readable?” Machine-readable simply means that the text can be recognized and processed by a computer; if you can copy and paste the text from your PDF into a document, it’s machine readable. If you’re working with contemporary sources or anything born digital, the answer is probably yes. But if you’re working with older texts—either physical texts or older, scanned texts—they may not be.

This leaves you with two options: manually type in the information you need from that text, or OCR the text. OCR stands for “optical character recognition,” and simply means to convert an image of text to a machine-readable text. There are several free online tools to do the latter, and Adobe Acrobat can also be used to OCR text. As well, there’s a program called ABBY FineReader that allows more control over the OCR process. For instance, with ABBY FineReader, you can select specific parts of the page to OCR, and leave out headers, page numbers, or other extraneous information. You can also OCR graphs and tables. ABBY FineReader does come with some limitations: it doesn’t work with handwritten texts, it’s not free (though we do have it in the Digital Humanities Lab! Come visit!) and it does have a little bit of a learning curve. Here, you might ask yourself how helpful learning the software will be in the long run. Depending on the length of the texts you’re working with, it may be quicker to hand type the information than OCR it. However, if you anticipate working extensively with texts that aren’t already machine-readable, taking the time to learn ABBY FineReader may be well-worth it. And again, the learning curve with ABBY FineReader isn’t as steep as with something like Python; this is software you can learn sufficiently well in a day to reasonably OCR most texts.

Once you have your text, you’ll want to curate it in some way—most likely, in a spreadsheet of some kind. While this step is pretty straightforward, even the most careful scholar will occasionally mistype words (or paste mis-OCR’d words). You may also decide, after entering half your data, that you want to rename an item or make other changes. This is where OpenRefine, an open source application, comes in.

OpenRefine allows you to import a spreadsheet or database and alter the data in various ways. You can see a list of all items that occur in a given column, and thus see whether there are instances where a term that’s repeatedly used has been misspelled and easily change the misspelling. You can also use OpenRefine to split or combine columns (if, for instance, you imported data that listed something like “Nashville, TN” and want to change that to a city column and a state column), to see rows where a column is empty (you may have missed some data), and so on. Much of this isn’t entirely different from Excel or other spreadsheet or database software. The chief benefits of OpenRefine are that: (1) the software doesn’t store formulas in the cells, which saves users a step, (2) the software offers easy ways to step back and forward in edits more easily than in a program like Excel, and (3) the software allows you to cluster (or “facet,” as it’s called in OpenRefine) data using multiple factors, which makes the data easier to search and alter. As well, OpenRefine is a more visual format than Excel, which makes it easier to navigate, particularly with larger datasets. OpenRefine has several other features that may be useful for various projects—for instance, you can batch fetch URLs, which may benefit digital humanists working with social media or location data. Like ABBY FineReader, there’s a slight learning curve here and you’ll want to play around with OpenRefine for awhile to find the methods that work for you and your data. However, this is also software that can be meaningfully figured out in a day or two.

Once you have a solid dataset, you can move on to visualization or analytic techniques. But the first step is data curation: collecting your data, putting it into a format that makes sense for your project, and making sure your data doesn’t contain any mistakes. While many humanists avoid the digital humanities because they don’t have prior experience with computer programming or software like ArcGIS and assume that every digital humanities project requires the use of such tools, data curation is something that can be done by anyone. If you think you have a project suited for digital work, consider starting out with the tools discussed in this blog and then thinking about ways to visualize or analyze your data once it’s laid out in a comprehensible format. While some projects really require the use of ArcGIS or Python, many can utilize simpler tools once the data has been curated. Others can involve collaboration with scholars who have more technological know-how. There are many ways around the question of learning software or programming languages, but each of these solutions still requires well-curated data.

Letter block images by Leo Reynolds, shared at https://www.flickr.com/people/lwr/ under Creative Commons license.