I wanted to work on projects that would benefit society and found Google Summer of Code (GSoC) to be the perfect opportunity. I worked with Free UK Genealogy, an organization that aims at the transcription of family data and explored Named Entity Recognition (NER) more deeply. I created an end-to-end framework that extracts text from wills, using OCR, and seeds them into a database with named entities. A big constraint in the project was a lack of adequate training samples to build an accurate NER model. To solve this, I used a bi-LSTM-CRF architecture that uses a small training set and unlabelled corpora. To train the NER system, I made use of SpaCy’s language model. For NER, the paper “Neural architectures for named entity recognition” formed the basis of my NER model at Google Summer of Code. The system developed through GSoC has enabled the organization to extract structured data from scanned documents and build a new database for its users.
The hybrid tagging structure was very helpful for labeling overlapping entity sets present in the data. The application developed through GSoC has enabled the organization to retrieve structured data from wills from 18th century England and launch a new database.
Project link: https://github.com/FreeUKGen/ProbateParsing