Is the PDF the cul-de-sac of data?

Martin Pickrodt, Chief Information Officer, Mesitis

Martin Pickrodt, Chief Information Officer, Mesitis

The information age has given a new meaning to Francis Bacon’s “Knowledge is Power”. Information and data are becoming ever cheaper to create, store and transmit; data is everywhere and a fundament for business decisions, loan approvals and marketing strategies. For us at Canopy, we have a need to analyse financial data and statements for aggregation purposes which is what we will focus on here.

While most companies have embraced this digital age, old habits die hard and we can still find a lot of paper trails: orders, invoices, bank statements, investment analysis etcetera. As a user of the data this can be very frustrating; we need to process our data for purposes of analysis, validation, accounting or even regulatory requirements.

Therefore, our dream is open standards and the willingness of counterparties to provide us with ‘our data’ in a proper format. Ravi Menon, Managing Director at the Monetary Authority of Singapore (MAS) emphasized in a recent speech the importance of open data standards: Common standards help against fragmentation, inefficiency and inconvenience; seamless data sharing will enable higher quality of data which is free of error and commonly understood. It will also allow the aggregation of data and make it intelligible, meaning: machine readable and machine useable.

Yet companies still work with paper or its modern day derivative: the PDF file.

PDF documents are great: every computer can open them and the reader can see exactly what the sender of information intended to show. A major advantage over any text document format which may or may not take the liberty of changing the formatting upon opening and wreak havoc to any nicely crafted design. PDFs are therefore stable display platforms avoiding any issues relating to PEBKAC –

a popular term describing ‘Problem Exists Between Keyboard And Chair’.

And yet, PDF documents are terrible: they do not fulfil the requirement of being machine readable. While you can open the document for visual inspection or printing, you are not able to further process that information. The millstone around the neck of your data; you may have it but using it in any substantial way is a chore. That is because it is a document format and not a data exchange standard. The recipient is therefore stuck with data that is not machine useable.

Why are PDF documents so popular then? Apart from the stability of the viewing experience, senders of information see the non-machine-readability as an advantage. As the data cannot easily be used any further a perception of safety is created. Comparisons are harder to draw, insights are harder to gleam and you will find it difficult to bring your dataset to a competitor. It’s the placeholder for analog technology in a digital world, a neat replacement for paper.

What can be done with the quantities of data that we would like to use? We need to bring it back into a real electronic format, that much is clear. Many companies choose the hard way: manually re-enter the relevant items. Users of large data amounts, especially when faced with repeat processes like monthly statements, revert to outsourcing: a back office in a remote part of the world that will do the grunt work. A solution that will fail the test of true scalability as well as accuracy, not to mention security concerns.

" PDF extraction is only a temporary solution on the path of making data feeds ubiquitous."

At Canopy, we are heavy users of PDF data and we have an overarching requirement for accuracy as well as privacy. The inevitable solution is then to electronically read out and re-interpret the pdf statement itself. We call it cracking the statement. Extracting the data is the easy part. PDF conversion solutions are plentiful and can do a good job in the initial extraction. From here the hard part starts: This pile of data needs to be structured and transformed. Table headlines and columns need to be recognized, headers and disclaimers need to be ignored and the content of fields need to be filled. Not very straight forward but with a few bright programming minds the formats can be cracked. Once done there is no more need for the labour intensive outsourcing solution half way around the world. Data is machine readable again, errors become extinct and we can establish straight-through-processing. Additionally, speed of transformation improves significantly and the amount of data points can be increased without worrying if the back office needs expansion. Data becomes information again.

PDF extraction is only a temporary solution on the path of making data feeds ubiquitous. We hope that companies and especially financial institutions will listen to the gentle nudge from regulators and the call from clients in this matter. Until then, automated extraction is a powerful application for bank customers and beyond. Not only is it hugely accurate and fast, it also saves cost.