September 29, 2016

Mining the Big Sky's Big Data


Tammy Troup
Digital Services Manager
Montana Historical Society

Datasets are a treasure trove of information for historians and social scientists who draw on relatively recently developed methods of historical analysis to support theories, develop new interpretations, and think deeply about the implication of patterns. While a blog post is too short to delve deeply into this topic, the MHS extends notice of datasets in our collections and we encourage the use and analysis of big data.
MHS Datasets
MHS recently shared three datasets on the Socrata data portal currently supported by the State Information Technology Services Division (SITSD). Since the software, people, and commitments are outside the organizational control of the MHS, researchers should assume links may change and should prepare citations which reference the fact that the dataset is held by the MHS. The MHS will maintain copies of the datasets and we will commit to ensuring access, we will also provide data accuracy and integrity statements. Datasets are presented under a public domain license, which permits researchers to export, use, and append the dataset.


Current datasets
Preparing the dataset
Historical datasets can be complicated to develop since historical data is not always structured consistently and handwritten data can be difficult to read. When data is structured for machine readability, it is fairly easy to map data into new fields, parse information, or aggregate data. Standardized information sets such as a handwritten table are also fairly easy to structure, but unstructured data must be hand-entered and the dataset creator must make decisions about field names, content standards, and normalization. In practical terms, this means that the dataset of a handwritten ledger (Figure 1) will easily map to a table or XML file (Figure 2). However, the dataset creator of military enlistment cards (Figure 3) will need to make the following decisions:
  • Field names – i.e., metadata terms, local terms or drawn from a professional authority;
  • Data content standards – if none are present, a standard will need to be defined or developed. Content standards are simply the rules for data entry which ensure consistency.
  • Data normalization –the process of organizing and cleaning data in order to reduce redundancy.
Figure 1. Enlistment records from Fort Assiniboine
(identified as Assinniboine in original ledger)
in a structured table, from MC 46.
Figure 2. Table of data in Excel (left) and structured data in xml
format (right)

 







Figure 3. Enlistment card from the
digitized Military Enlistments (Montana) 1890-1918



Example Methodology – State Prison Records

The State Prison Records dataset is drawn from digitized prison records which are presented on the Montana Memory Project in the collection Montana State Prison Records, 1869-1974. A team of stalwart volunteers—Marie McAlear and Anthony Schrillo—led by staff member Caitlin Patterson spent eight years digitizing, collecting metadata, and uploading the materials from the highly used public documents. Information about intriguing and unusual cases is recorded elsewhere on this blog. In order to understand larger patterns, though, researchers need access to the dataset created through metadata development.
We normalized the dataset by reducing the ~28,000+ lines of metadata down to ~15,000 unique records, standardized the content, parsed columnar data, and quantified some of the information. By presenting the metadata as a dataset, researchers may filter fields – Crime, Location, Gender, Descent, Occupation, and Religion—and may look for spatial or temporal patterns using Location or Incarceration Date.
However, simply filtering for a crime or demonstrating a pattern will result in flabby analysis. Trends identified in datasets need to be comparatively analyzed using state and local demographics, labor and culture statistics, and/or national crime data. Broad patterns of movement and human activity must be known and taken into consideration. Secondary sources read in order to understand historical context, original records reviewed, military enlistment cards searched, newspaper accounts studied, and researchers might even visit the Old Montana Prison and Montana towns to reflect on the social, economic, cultural, and environmental conditions which lead to crime and incarceration. It’s also important to look for the impact of incarceration and perhaps use a network analysis to look for generational trends, recidivism, and the haunting social impact of incarceration.
Big data analysis is a powerful tool for historical research, but it is not an end. Look at the numbers, but feel for the pulse.
Please contact Tammy Troup, ttroup@mt.gov for more information.