Skip to content

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

Notifications You must be signed in to change notification settings

lucasayres/url-feature-extractor

Repository files navigation

URL Feature Extractor

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

This repo includes the implementation of our paper:

Lucas Dantas Gama Ayres, Italo Valcy S Brito and Rodrigo Rocha Gomes e Souza. Using Machine Learning to Automatically Detect Malicious URLs in Brazil. In Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC 2019) - 2019, Gramado - RS - Brazil.

The paper is available here: https://sol.sbc.org.br/index.php/sbrc/article/view/7416

DOI: https://doi.org/10.5753/sbrc.2019.7416

Install

$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt

How to use

Before running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the config.ini file.

Now, run:

$ python run.py <input-urls> <output-dataset>

Features implemented

LEXICAL
Count (.) in URL Count (-) in URL Count (_) in URL Count (/) in URL
Count (?) in URL Count (=) in URL Count (@) in URL Count (&) in URL
Count (!) in URL Count ( ) in URL Count (~) in URL Count (,) in URL
Count (+) in URL Count (*) in URL Count (#) in URL Count ($) in URL
Count (%) in URL URL LengthL TLD amount in URL Count (.) in Domain
Count (-) in Domain Count (_) in Domain Count (/) in Domain Count (?) in Domain
Count (=) in Domain Count (@) in Domain Count (&) in Domain Count (!) in Domain
Count ( ) in Domain Count (~) in Domain Count (,) in Domain Count (+) in Domain
Count (*) in Domain Count (#) in Domain Count ($) in Domain Count (%) in Domain
Domain Length Quantidade de vogais in Domain URL domain in IP address format Domain contains the key words "server" or "client"
Count (.) in Directory Count (-) in Directory Count (_) in Directory Count (/) in Directory
Count (?) in Directory Count (=) in Directory Count (@) in Directory Count (&) in Directory
Count (!) in Directory Count ( ) in Directory Count (~) in Directory Count (,) in Directory
Count (+) in Directory Count (*) in Directory Count (#) in Directory Count ($) in Directory
Count (%) in Directory Directory Length Count (.) in file Count (-) in file
Count (_) in file Count (/) in file Count (?) in file Count (=) in file
Count (@) in file Count (&) in file Count (!) in file Count ( ) in file
Count (~) in file Count (,) in file Count (+) in file Count (*) in file
Count (#) in file Count ($) in file Count (%) in file File length
Count (.) in parameters Count (-) in parameters Count (_) in parameters Count (/) in parameters
Count (?) in parameters Count (=) in parameters Count (@) in parameters Count (&) in parameters
Count (!) in parameters Count ( ) in parameters Count (~) in parameters Count (,) in parameters
Count (+) in parameters Count (*) in parameters Count (#) in parameters Count ($) in parameters
Count (%) in parameters Length of parameters TLD presence in arguments Number of parameters
Email present at URL File extension
BLACKLIST
Presence of the URL in blacklists Presence of the IP Address in blacklists Presence of the domain in Blacklists
HOST
Presence of the domain in RBL (Real-time Blackhole List) Search time (response) domain (lookup) Domain has SPF? Geographical location of IP
AS Number (or ASN) PTR of IP Time (in days) of domain activation Time (in days) of domain expiration
Number of resolved IPs Number of resolved name servers (NameServers - NS) Number of MX Servers Time-to-live (TTL) value associated with hostname
OTHERS
Valid TLS / SSL Certificate Number of redirects Check if URL is indexed on Google Check if domain is indexed on Google
Uses URL shortener service

Contributing

Any contribution is appreciated.

Submitting a Pull Request (PR)

  1. Clone the project:
$ git clone https://github.com/lucasayres/url-feature-extractor.git
  1. Make your changes in a new git branch:
$ git checkout -b my-branch master
  1. Add your changes.

  2. Push your branch to Github.

  3. Create a PR to master.

About

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages