The Panama Papers How did they pull off historys biggest data leak?

Skeptic - Blogger
Find out how Data to Value’s Graph Data software partners Neo4j and Linkurious have been used in the Panama Papers investigation.

Recently there has been a lot of interest around the newly published Panama papers. This giant trove of data that is said to contain a whopping 11.5 million documents or 2.6TB of data. This completely dwarfs
pervious leaks like the 1.7GB WikiLeaks scandal or the 30GB Ashley Madison leak. It took two years, more than 400 journalists and cutting edge technology solutions to process all of this information and gain valuable insight.

The data was leaked from one of the world’s leading firms in incorporation offshore entities – Mossack Fonseca. The data was then gradually transferred to a German journalist that worked in the Sddeutsche Zeitung (SZ) via encrypted chat. The real work began shortly after the data started pouring in, as the SZ was not able to make sense of data that size and got in contact with the International Consortium of Investigative Journalists (ICIJ) to find a way of handling these millions of documents. The ICIJ were very efficient and very prudent when handling this data. The data and its copies were stored in encrypted drives using open-source software – VeraCrypt. The choice was made to use Apache Solr – as the main search server coupled with Apache Tika, a toolkit that detects and extracts metadata and text from over a thousand different file types. This made it possible for a seamless and near real-time way of searching different file types, such as PDFs, Word documents and emails. A custom UI developed by Blacklight was put on top of the solution for ease of use. Once built one of more than 400 journalists needed a link and a randomly generated password to start discovering interesting data.

To make sense of the highly connected and complex data the investigators decided to ask the help of two of our software partners - Neo4j and Linkurious. Using Neo4j, the world’s leading graph
database, made it easy to find and analyse complex connections as graphs use special structures incorporating nodes, properties and edges to define and store data. Linkurious, a graph visualisation platform helped the journalists to navigate through this ocean of data uncovering unique insights into the offshore banking world, showing the relationships between banks, clients, offshore companies and their lawyers.

The entire dataset of the Panama Papers is expected to be released early May.


Last edited by a moderator:


Prime Minister (20k+ posts)
The leak so far is perhaps 10% of the total. As noted above, the entire data will be out in May. Cant wait.