Elasticsearch and duplicate documents

I often have issues with duplicates when I am working with Legacy systems. Many legacy systems are batch by nature. Processing and indexing data in realtime and streaming it were not generally available.  When you want to interface these types of systems to real time streaming services there are complex issues with polling and duplicate data.  For example you might poll a legacy system and the poll will fail so you have to deal with duplicates the next time around. I come across this often when I try to move legacy data to a common backplane, an important metaphor for  a big data project to make the data available to all with common tools. I just had another issue come up and it had all the above properties.  I was surprised to find that Elasticsearch has no automatic dedup ability. This seems like a major deficiency, there is so much duplicate data out there. Advanced file systems like ZFS have dedup and similar block reuse to deal with this common issue. I started looking at writing my own and was disappointed with the common approaches that are so rooted in the past, scan and delete the items. It would be a good addition to add dedup to Elasticsearch.  I put in the following hack in the pejorative sense to solve the problem. I compute the hash of the object and has the following issues with the hasCode() method in Java. It only does 2^32 which is a small number so I will definitely lose data. I was happy to see that the hashcode function is published so it is not specific to Java.  This is all because I did not want another external dependency on another library. I think computer languages should include hashing as a builtin in the future. I decreased the possibility of collisions by adding a field to the data that is pretty specific to the record being hashed. I still have concerns about the .toString() method returning the same string with other languages. Simple things get so complicated due to products being buitl without enough attention to how users will use them. Duplicate data is a given in any Big Data project.

Leave a Reply

Your email address will not be published. Required fields are marked *

*