DataSploit tutorial

DataSploit Logo

DataSploit is a collection of python scripts which automate open source intelligence searches about domain names, email addresses, IP addresses and usernames.

To use DataSploit, you’ll need:

  • Python 2.7
  • a basic understanding of the command line.

In addition, knowing your way around python versions, dependencies and virtual environments will definitely help should the script throw out errors.

Why is DataSploit useful?

DataSploit searches several services at once. This speeds up the research process, as you don’t have to perform searches on each service separately. DataSploit also allows to search several targets in one go.

More …

theHarvester: find email addresses from a domain

theHarvester is a Python script that uses several search engines to find emails matching a certain domain name.

This has several use cases:

  • find emails of a company’s employees, if you know the company’s website.
  • find the email of someone if you know the website of its company or its personal website.
  • find the format of email addresses of a company. A lot of companies usually use a common format for its employees’ emails, such as surname.name@examplecompany.com. If this is the case, you can easily infer the email address of employees from their names.
More …

PDF text extraction

Several tools to extract text from a PDF.

PyPDF2

Install the module with pip.

pip3 install PyPDF2

In Python:

>>> import PyPDF2
>>> pdfFileObj = open('file.pdf', 'rb')
>>> pdfReaderpageObj.extractText() = PyPDF2.PdfFileReader(pdfFileObj)
>>> pageObj = pdfReader.getPage(0)
>>> textPage = pageObj.extractText()

See PyPDF2 on Github, docs

Tika

Install with pip

pip3 install tika

Make sure java is installed:

sudo apt update
sudo apt install openjdk-8-jdk

In Python:

>>> from tika import parser
>>> parsedPDF = parser.from_file("file.pdf")
>>> pdfText = parsedPDF["content"]

See tika-python on Github

PDFMiner

Install with pip (Python 2):

pip install PDFMiner

In bash:

pdf2txt.py -o output.txt file.pdf

See PDFMiner on Github

xpdf / pdftotext

Install with apt:

In bash:

pdftotext file.pdf output.txt

or

pdftotext -layout file.pdf output.txt

See xpdf official website, pdftotext manual

Getting up-to-date data on French MPs

Say we need to quickly get up-to-date data on French MPs for a project.

Everything we need is on the National Assembly or on the Senate websites, but they provide no structured way to get the data. Fortunately, Regards Citoyens provides up-to-date data on both French Parliament houses in multiple formats, through several websites:

Info on current lower-house MPs

For example, if we need information on all lower-house MPs currently in office, we get the corresponding JSON data from nosdeputes.fr :

import requests
import json

url ='https://www.nosdeputes.fr/deputes/enmandat/json'
response = requests.get(url)
response.raise_for_status()
jsonData = json.loads(response.text)

We can then iterate over the 577 lower-house MPs to get the info we want.

for num in range(0, 576):
	  d = jsonData['deputes'][num]['depute'] # easier to read
    # rest of the code

Inside this for loop, we can now access data on MPs to store it in the format we want (CSV or Excel table, python dictionary, etc.). For example, we access the surname of the MP with d['nom']. Have a look at the JSON data in your web browser to find all the available attributes.

Some attributes have multiple values, such as email addresses: many MPs have more than one registered. We can store them in a list, like so:

	mails = []
	for dic in d['emails']:
		newEmail = dic['email']
		mails.append(newEmail)

Example: fetching all email addresses

Putting all this together, we can write a little script that outputs all the email addresses of current lower-house MPs:

import requests
import json

url ='https://www.nosdeputes.fr/deputes/enmandat/json'
response = requests.get(url)
response.raise_for_status()
jsonData = json.loads(response.text)

allEmails = [] # We'll store everything in here

for num in range(0, 576):
	  d = jsonData['deputes'][num]['depute']

    for dic in d['emails']:
  	    newEmail = dic['email']
  		  allEmails.append(newEmail)

print('\n'.join(allEmails))