Continual Development of Neural Nets

Machine Learning applied to digitizing documents
There’s an interesting catch-22 in machine learning: in order to train a neural net to do a particular task, you need enough data that represents doing that task. In some areas, this data is readily available – people freely annotate pictures as cats, dogs, people, and so on. In other areas, you can allow the neural net to generate its own labels or to train against another neural net. Our task doesn’t fall in either of these categories, as we’re trying to achieve a transformation from an image to a set of labelled categories.

When you’re solving a transformation problem, it’s ideal if you can leverage the output of the early iterations of the NN to bootstrap the next version. There’s a non-trivial amount of validation that needs to be done, though, as if the raw output is used as training data, the new versions will repeat the mistakes of the older versions. We’ve accumulated a significant amount of annotated data, so our future iterations can be more accurate and more efficient, likely surpassing the accuracy of human transcribers in certain areas so that manual review is unnecessary.

One of the key insights we’ve applied to developing our Nspect system is that a large portion of the validation can be done by dumb systems. If the good outputs can be separated from the data that needs to be fixed, the human validation efforts are that much more efficient. As others have noted, the biggest advantage of automation is not that it takes humans out of the loop, but that it works as a force multiplier on human work. Efficiency per unit of human labor is the yardstick against which the value of automation can be measured.

If the validation can be improved by dumb systems, so can the original output. After all, if the efficiency of human work is vastly improved by including a neural net, why not that of simple algorithms? It turns out that by combining advanced machine learning techniques with simple heuristics we not only gain the ability to flag only those files which require review but we can also flag exactly which fields we expect to need review so that special care is paid to the areas most prone to error. Once the errors are corrected we can use both values (errors and corrections) to improve the accuracy of the neural net as it learns from its mistakes.

Using these techniques (alongside some other proprietary techniques) we manage to successfully read nearly all of the fields correctly in one pass. That includes files with coffee stains or stamps obscuring some of the data, files where errors were made in the original and then corrections written in the margins, and files from the 1940s without any standards for how the data is written out.

The more data Nspect consumes from projects, the more accurate it becomes. This allows us to accurately transcribe scribbles from the past century into immediately usable data in record time. If you’d like that next project to be with you, contact us to arrange a demo.

#AI #Machinelearning #Digitizing #nspect #nthds