NLP: Understanding Unstructured Text

Key Outcomes

  • Demonstrated ability to research deep learning literature and translate this into value for an internal product offering.
  • Led design, implementation, and productionisation of a geo-information service, aiding in increasing coverage of previously ill-reported data on customer operating locations.
  • Designed and deployed a tool to facilitate manual labelling for named entity recognition tasks. Reduced manual labelling time and effort across the team.

More Information

At Xero, my team’s focus was on developing machine learning systems to enrich our understanding of our customers and help enhance business functions within the marketing, sales, product, and platform areas. As part of this, I took ownership of designing, implementing, and productionising a system that would extract data to provide an understanding of where our customer’s operated. The developed service was integrated into a wider existing machine learning service the team provided internally.

Approach

As stated, the main goal for the service was to provide an understanding of where our customer’s operated. The data we had that contained this information was messy and unstructured, extracted from sources such as free field form data, websites, and text extracted from images and invoices. The team had developed an existing deep learning model, written in PyTorch, for running specific classification tasks on a similar dataset. With the goal of identifying locations in passages of text, I adapted this model to work for named entity recogntion tasks, following and adapting approaches outlined in various academic papers; End-to-end Sequence Labelling via Bi-directional LSTM-CNNs-CRF, Named Entity Recognition with Bidirectional LSTM-CNNs, and Neural Architectures for Named Entity Recognition.

This chosen approach required labelled data to train the model on, however due to the specific nature of the problem, a labelled dataset was not readily available. As such, I designed and deployed an internal solution to allow efficient labelling of training data for named entity recognition tasks. Defining the infrastructure using Terraform, I deployed the application on our AWS infrastructure, and a pipeline was successfully created to enable the data labelling tasks to be tracked, read from, and saved to Amazon S3.

Using transfer learning, the adapted model was fine-tuned on location extraction tasks. The extracted locations were usable, however still quite meaningless for larger data processing tasks due to the complex nature of addresses, as explored very aptly in this post: Falsehoods programmers believe about addresses. A rules based parsing approach would be insufficient to cover the breadth of possible cases, especially considering that the target dataset included addresses from countries all over the world. Researching a few different approaches to this problem uncovered many API solutions, however these would prove to be too expensive given the scale at which we were hoping to run the service. Looking into open-source solutions, I found libpostal an open-source C library that uses statistical NLP methods to parse and normalise street addresses from all over the world. Using this, it was possible to retrieve address components (such as street number, street name, suburb, and city).

After parsing these address components, the final stage of the service involved making this data available in a format that could be easily ingested to produce insights. Geocoding addresses would allow for location pin pointing and comparisons to other locations. However, again a paid API solution would not be cost effective at scale. Instead, I developed a custom geocoder by creating an ElasticSearch cluster and populating it with address data and associated geo-coordinates from various open source datasets (for example OpenStreetMaps).

Consequently, this work produced an efficient and cost-effective machine learning pipeline to extract location information for various businesses and retrieve approximate geocoded coordinates that were saved and available for analytical and reporting purposes.