Leveraging Structured Data for Natural Language Understanding

Structured data and text documents have traditionally lived on separate planets. We help organizations bring the two worlds together using advanced AI methods, tools and technologies.

Searching and analyzing text documents and messages has come a long way since the days of simple full-text search engines. Modern machine learning techniques are being used successfully to capture more of the meaning in natural language texts, identify types of entities, find associations and disambiguate terms far more accurately than ever before.

The basic principle of these techniques is to learn from what was said before. The more often a term is used in various different ways the more complete the picture becomes. That is why these methods benefit greatly from very large amounts of text data.

The problem is that a lot of the domain knowledge that provides the frame of reference for employees and customers of a particular organization or the members of a particular community is not available in the form of extensive text corpora. Some of it is defined in relatively few authoritative documents such as product documentation, contracts, regulations or technical specifications. A lot of it is formally modelled (or at least somewhat structured) and stored in databases, database schemas, spreadsheets or communications- and document metadata.

Enabling question answering and data analysis spanning all of these heterogeneous sources has traditionally incurred substantial data integration cost. We are developing tools to reduce that cost using advanced AI methods and a pragmatic approach of gradual enhancement, carefully avoiding many of the pitfalls of large data integration projects.


Finance: Start Making Sense

Financial reporting is one area where expert knowledge has been standardized (IFRS, GAAP) and formalized (XBRL) for many decades. We use this wealth of structured data to make sense of natural language text in SEC filings, earnings calls and market commentary.

Accounting data provides a snapshot of what has happened and how much of it has happened up to a particular point in time. What it does not tell us is why things have happened and what is supposed to happen next. Some of that can be gleaned from the textual parts of regulatory filings and other public communication.

The language used in these texts is of course closely related to quantitative information in the reports. This goes beyond accounting data. Each industry has its own quantitative indicators such as cost per click in advertising or net interest margin in banking.

These concepts have intrinsic logical properties and relationships that are never directly explained in the reports themselves or in the questions asked by analysts and investors. We use this background knowledge to improve our ability to automatically learn the language around these concepts and the implications of what is being said.

In spite of recent advances in AI, we are far away from teaching machines to understand language nearly as well as humans do. But spotting trends in tens of thousands of reports across sectors, industries and geographical regions exceeds our mental capacity and strains our organizational capability.

Topolyte provides analysis tools that help analysts and investors sort through large amounts regulatory filings and financial commentary on a higher semantic level than has typically been the case.


Training, Consulting and Software Development

We provide training and consulting as well as software development services in areas related to our core data science and data engineering specialty.