About |
|||||||||
|
|||||||||
|
Home > About |
|||||||||
About Muninn WWI
The Muninn Project is a multidisciplinary, multinational, academic research project investigating millions of records pertaining to the First World War in archives around the world. Our aim is to take as many digital images of documents as we can, extract the written data using advanced computer technology, and turn the resulting information into structured databases in order to support a number of different research agendas. For an idea of how all these elements fit together, please see this flow chart. Muninn is currently at an early stage. We are still adding members to our team and applying for the funds we need to begin the process of document data extraction. Having said that, our progress to date has been very promising indeed. We have already collected an international team of experts from a broad range of fields, and we will shortly be in a position to announce the first of our data acquisition deals which should give us access to many terabytes of digitised documentary sources. Documentary SourcesOur initial goal is to extract the raw data from every documentary source relating to the First World War that we can find, so long as they fulfil two criteria: a) they must be available as a digital image, literally a photograph or scanned image of the document and b) they must be written on some kind of standard, pre-printed form. In practice, this mostly means personnel records, although we are very interested in other sources such as regimental diaries, deck logs, muster rolls and so on. If you want to see what these look like, then I strongly suggest you go to the National Archive of Australia's excellent web page Mapping Our Anzacs, where you will find every Australian WWI service record in beautiful, crisp scans. These kinds of records are ideal for a study such as ours. Data ExtractionThese images of documentary sources will be electronically read using the Sharcnet computer array in Southern Ontario. Sharcnet is very, very big: the main cluster is, we're told, the size of a good-sized playing field with its own electrical substation and thousands of rack-mounted computers working together. Nevertheless, even with this phenomenal computing power at its disposal, this project will still need to confront two significant hurdles. The first is that a very high percentage of our documents are manuscript, hand written as opposed to typescript. The optical character recognition of hand-written documents is vastly more difficult than the equivalent processes for type script. The second problem is, simply, one of scale. The past few years have seen major digitisation initiatives on the part of archives around the world. This means that the documents we hope to process will be numbered in the tens of millions. We are very fortunate in that we are probably going to be given data processing equipment which can handle this volume of data. But just going through the logistics of moving archive collections onto hard-drives and having them shipped to a data centre is an interesting problem. We estimate that the Australian data, mentioned above, is somewhere in the neighbourhood of seven terabytes in size, and the Australian archive is one of the smaller ones. “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” —Tanenbaum, Andrew S. (1996). Computer Networks. New Jersey: Prentice-Hall. pp. 83. [source] Information OrganisationThe data we extract from each document will only tell a tiny part of a much bigger story. We plan on rolling all this raw information together into a series of huge, complex, database products which will be the basis for all the social science and humanities research to come out of the project. These databases will not just be indices indicating what data lies within in each document. They will also use intelligent cross-referencing to add information to each document. For example, it might be possible to combine personnel records with data from unit diaries to get information on which individuals were in which location at a given point time. This ability to combine locational, organisation and personal data will allow us to model a whole series of interesting research questions, from the shapes of the institutions themselves to how people, resources, and even diseases flowed through those institutions and across the landscapes of space and time. Research ProjectsBuilding our data sets is, of course, only the first step. The extracted Muninn WWI datasets will be made available to any researcher who wants to use them, but we are also undertaking a series of 'official Muninn' projects which will be funded through the Muninn grants, and which will begin even as the process of data extraction is still taking place. These initial research projects are an integral part of the Muninn WWI project, each equal in importance to our development work on data extraction and organisation. But they will also feed back in to the data extraction work in a virtuous circle. Our humanities and social science researchers will be able to help guide the computer scientists and statisticians in producing the highest quality and most useful information that they can. In turn, the scientists will provide the arts researchers with the raw material for countless exciting advances. The current list of Muninn sub-projects are, currently, as follows. But keep in mind that we are still very open to new ideas and proposals.
For more information, see our research page here. Problems and LimitationsIn order to conduct our research accurately and responsibly, we must be fully aware of the limitations of our sources. In our case, there are two principal limitations of which we should be aware. 1. Data extraction Systems of Limited AccuracyNo computer can read a handwritten document as accurately as a human being, and our data extraction process cannot hope to give us a completely accurate reading of the data in any given document. We estimate that, at best, we can only currently achieve 80% accuracy. Having said that, 80% of millions is still a lot. Our partially accurate data set will, we hope, be good enough to look at medium- and large-scale phenomena such as epidemics and institutions. It would not be appropriate for small-scale studies, such as histories of individual people or small military units below the battalion level. There are some tricks that we can use to ramp up the accuracy a little. For example, we hope to get access to some hand-entered data which has been compiled, over the years, by historians making indices of historical documents the old-fashioned way. Our work will never be able to replace these hand-coded datasets, but we might be able to use them to train the computer to recognise information more accurately by allowing it to compare its 'guess' as to what a given document contains against a known 'right answer'. 2. Non-Availability of Documents Due to DestructionThe second big problem is the destruction of documents. British documents, in particular, have been subject to heavy attrition, both from administrative destruction, and from bomb damage sustained during the Blitz. This greatly reduces the pool of available documents, and makes some forms of social and demographic history difficult. This destruction, combined with the inaccuracy of the data extraction process and the non-availability of some records which have yet to be digitised, means that many possible research goals will never be realised. We must always keep in mind the fact that this project necessarily works with a sample of the available documentation, not the whole corpus. It may be that this sample is a very large sample indeed, but it is still a sample, and our research must bear this in mind when we examine possible biases in the documentary record. Our Next StepsMuch work remains to be accomplished before Muninn WWI can become a fully-fledged working project. We are still interested in speaking to researchers who might find these data sources useful to their work. In particular, we are very interested in making further contacts within the community of WWI historians. As historians are less likely to be interested in forming research units than researchers in other disciplines, we are experimenting with the idea of establishing a 'committee of advisers', historical researchers who can observe the data as it is produced without feeling the need to be more firmly integrated into the research project in such a way that would impair their research independence. FundingWe are also seeking the funding which will make much of the project possible. On 15 July 2009, we will be submitting a grant application to the tri-national Digging into Data competition. If our application to this competition is successful, it would allow us to begin work in academic year 2010-11. However, there is much to do and we are anxious to start sooner if we can. We are, therefore, also in the process of seeking out seed money to begin our work in the upcoming academic year (2009-10).
|
|||||||||
|
|||||||||