Parsing and normalizing street addresses around the world
Street addresses are among the most difficult
artifacts of human language for any piece of parsing software. If you have ever tried it yourself, you know what i am talking about. Having to parse addresses from around the world where each country’s addressing system has its own set of conventions and peculiarities makes this even more difficult.
Searching the net for solutions to this problem i stumbled upon libpostal
: a multilingual street address parsing/normalization library, written in C, that can handle addresses all over the world.
Libpostal uses machine learning and is informed by tens of millions of real-world addresses from OpenStreetMap
. It currently supports normalization's in 60 languages and can parse
addresses in more than 100 countries.
Since my veins are filled with caffeine i needed a Java JNI bridge to span my platform gap. There are bindings available under jpostal
, but having to hassle with all the compiler stuff and the native library handling in your project afterwards makes this not an attractive option.
To ease Java integration there is a great JNI library called JavaCPP
. A tool that can not only generate JNI code but also build native wrapper
library files from an appropriate interface file written entirely in
Java. It can also parse automatically C/C++ header files to produce the
required Java interface files. As a topping it generates a Maven artifact containing the whole setup including the specific platform native lib and loading capabilities, ready for use in your Maven projects.
I have used some presets of this project several times before and i believe it is a perfect fit for libpostal, giving the opportunity to add a complete new preset
to this great project.
A big thanks to Samuel Audet
for the support he provided throughout the process, providing the needed native platform knowledge i was lacking.
The 1.4 release also incorporates the new Systems preset
, providing low level API access for Linux, MacOS an Windows.