SePass: Semantic Password Guessing using k-nn Similarity Search in Word Embeddings

A password guessing method that utilizes word embeddings to discover and exploit semantic correlations in password lists.

Package 3 stars GitHub

SePass: Semantic Password Guessing using k-nn Similarity Search in Word Embeddings

We introduce SePass, a novel password guessing method that utilizes word embeddings to discover and exploit semantic correlations in password lists. Our tool here is made for research purposes and is intended to be used for further research. It is therefore still a work in progress and not designed for usability at the moment.

Overview

Commonly used tools for password guessing work with passwords leaks and use these lists for candidate generation based on handcrafted or inferred rules. These methods are often limited in their capability of producing entirely novel passwords, based on vocabulary not included in the given password lists. SePass, is a novel tool that utilizes word embeddings to discover and exploit semantic correlations in order to guess novel base words for passwords deliberately.

License

Installation

We suggest running SePass on a Unix system and using a virtual environment to install the requirements for SePass. You can create a virtual environment in python in the following manner:

python -m venv env 
source env/bin/activate

Afterwards install the required libraries:

pip install -r requirements.txt

For rule detection it is necessary that the Enchant C library is installed:

apt install enchant-2-2

In order to run SePass you need pretrained word embeddings, that are compatible by gensim. We suggest using the fasttext models for 157 langugages in order to choose which languages should be . The models can be downloaded here.

Reproducibility

If you are looking to reproduce the results from our corresponding paper (unpublished, in review) you can find detailed instructions and scripts in the evaluation folder and our train and test password lists in the data folder

Usage

The gensim models need to be in a directory named models in the same directory as SePass.py to be found.

 SePass.py [-h] [--debug] [-len LIST_LENGTH]
               [--models MODELS [MODELS ...]] [-mwl MIN_WORD_LENGTH]
               [-o OUT] [-rr RELEVANT_RULESET_RATIO] [-sr SEMANTIC_RATING]
               [-rv RESTRICT_VOCAB] [--mode MODE]
               pwlist
               
positional arguments:
  pwlist                path of password list to be analyzed

optional arguments:
  -h, --help            show this help message and exit
  --debug               print debug messages
  -len LIST_LENGTH, --list_length LIST_LENGTH
                        number of password suggestions to be created.
                        Default=1000000
  --models MODELS [MODELS ...]
                        name of the fasttext word embeddings models to use
                        (stored in models/ folder)
  -mwl MIN_WORD_LENGTH, --min_word_length MIN_WORD_LENGTH
                        length of the smallest word that should be searched
                        for. Default=4.
  -o OUT, --out OUT     Output path
  -rr RELEVANT_RULESET_RATIO, --relevant_ruleset_ratio RELEVANT_RULESET_RATIO
                        this parameter will determine the percentage of
                        (different) rules to be taken into account.
                        Default=0.1
  -sr SEMANTIC_RATING, --semantic_rating SEMANTIC_RATING
                        higher semantic rating will increase the influence of
                        word semantic in sorting process
  -rv RESTRICT_VOCAB, --restrict_vocab RESTRICT_VOCAB
                        this parameter restricts the number of words of each
                        model to the most frequent k
  --mode MODE           this parameter defines the mode for semantic
                        expansion. Default = k-NN

Back to Password Cracking