Skip to content

This project uses a Bi-directional LSTM model πŸ“§πŸ€– to classify emails as spam or legitimate, utilizing NLP techniques like tokenization, padding, and stopword removal. It aims to create an effective email classifier πŸ’»πŸ“Š while addressing overfitting with strategies like early stopping 🚫.

License

Notifications You must be signed in to change notification settings

sergio11/spam_email_classifier_lstm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“§ Spam Email Classification using LSTM πŸ€–

This project was an exciting opportunity to apply what I learned in a Deep Learning course, where I tackled the problem of classifying Spam and Legitimate (Ham) emails using Long Short-Term Memory (LSTM) networks. Throughout this journey, I followed a hands-on approach to learn Natural Language Processing (NLP) techniques and explore how Deep Learning can be used for real-world text classification problems.

The goal was to build a classifier that could effectively distinguish between spam and legitimate emails based on the content of the messages. To begin, I spent a lot of time understanding the process of data preprocessing β€” cleaning and transforming the raw email text so that the LSTM model could make sense of it. This included steps like tokenization, stopword removal, and stemming.

Once the data was ready, I created and trained an LSTM model to learn from the patterns and context in the emails. Along the way, I fine-tuned the model to improve its accuracy and efficiency. Throughout this project, I learned the importance of hyperparameter tuning, the use of embedding layers for text data, and the significance of recurrent neural networks (RNNs) for sequence-based problems.

The project was heavily influenced by a fantastic Deep Learning course I followed from Santiago HernΓ‘ndez, an expert in Cybersecurity and AI. His teachings provided me with the foundational knowledge needed to build the neural network architecture for this project.

By the end of the project, the LSTM model was able to predict whether an email was spam or legitimate with a high level of accuracy, and I gained invaluable experience in text classification, Deep Learning, and NLP techniques.

πŸ™ I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

We would like to express our gratitude to purusinghvi for creating and sharing the Spam Email Classification Dataset - Combined Spam Email CSV of 2007 TREC Public Spam Corpus and Enron-Spam Dataset on Kaggle. This dataset, which contains detailed information about spam and legitimate emails, has been invaluable in building and training the machine learning model for spam detection.

🌟 The dataset can be found on Kaggle. Your contribution is greatly appreciated! πŸ™Œ

πŸ“Œ Additionally, this project was inspired by amazing contributions from the Kaggle community:

⚠️ Disclaimer

This project was developed for learning and research purposes only. It is an educational exercise aimed at exploring Natural Language Processing (NLP) techniques and Deep Learning modelsβ€”specifically Long Short-Term Memory (LSTM) networksβ€”for spam email classification.

The model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.

🌟 Explore My Other Cutting-Edge AI Projects! 🌟

If you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:

Key Steps in the Process πŸ› οΈ

1. Data Collection & Preprocessing πŸ“Š

  • Loading the Data: The dataset consists of emails labeled as Spam (1) or Legitimate (0).
  • Text Normalization: We start by converting text to lowercase and removing unnecessary characters, such as numbers, punctuation, and special symbols.
  • Stopword Removal: Common words that do not contribute to meaningful classification (like "the", "and", etc.) are removed.
  • Hyperlink Removal: URLs and hyperlinks in the text are deleted as they do not provide useful information for classification.
  • Tokenization: We split the email text into individual words (tokens) for easier processing.

2. Exploratory Data Analysis (EDA) πŸ”

  • Visualizing the Data: The notebook includes visualizations such as word clouds and n-gram analysis, which help in understanding the most common terms used in spam and legitimate emails.
  • Class Distribution: The dataset is explored to understand the distribution of spam vs. legitimate emails, which helps in deciding model evaluation strategies.

3. Feature Engineering βš™οΈ

  • Text Tokenization: The email text is tokenized into sequences, and the vocabulary is built.
  • Padding: As the text data varies in length, padding is applied to ensure that all input sequences have the same size, making them suitable for model input.
  • Label Encoding: The target labels (spam or legitimate) are encoded into numeric values (0 or 1) using LabelEncoder.

4. Model Construction πŸ—οΈ

  • Bi-directional LSTM: We use a Bi-directional LSTM model to process the sequence of words in both forward and backward directions. This helps capture contextual information from both past and future words.
  • Dense Layer: A fully connected layer with ReLU activation is added to capture non-linear relationships between features.
  • Dropout: A dropout layer is included to prevent overfitting and help the model generalize better.

5. Model Training πŸš€

  • The model is trained on the preprocessed training data using binary cross-entropy loss and the Adam optimizer.
  • Early Stopping is implemented to monitor the validation loss and stop training once the model starts overfitting.
  • Evaluation: The model is evaluated on a separate test set to determine its accuracy and ability to generalize to unseen data.

6. Model Evaluation and Results πŸ“Š

  • Training Metrics: The model's performance during training is tracked by monitoring the loss and accuracy.
  • Validation Metrics: The validation loss and accuracy provide insight into how well the model generalizes.
  • Overfitting: If the validation accuracy starts to drop while training accuracy continues to rise, it indicates overfitting. This is addressed by using techniques like early stopping.

Goals of the Project 🎯

  • To classify emails as Spam or Legitimate using deep learning.
  • To explore NLP techniques for text preprocessing and sequence classification.
  • To evaluate the model's performance on both training and validation sets, and improve it through strategies like early stopping and dropout.

Results πŸ“ˆ

  • The model’s performance on the training data is typically high, with 99% accuracy.
  • On the validation data, accuracy usually reaches around 97%, though slight fluctuations are observed due to overfitting.

Conclusion πŸŽ“

By the end of this project, you will have a functional Bi-LSTM model for spam email classification that can be further fine-tuned, deployed, or integrated into a larger system for filtering unwanted emails. Techniques like early stopping are crucial to prevent overfitting and ensure the model’s generalizability.

πŸ“š References

⚠️ Disclaimer

This project was developed for learning and research purposes only. It is an educational exercise aimed at exploring Natural Language Processing (NLP) techniques and Deep Learning modelsβ€”specifically Long Short-Term Memory (LSTM) networksβ€”for spam email classification.

The model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.

πŸ™ Acknowledgments

A huge thank you to purusinghvi for providing the dataset that made this project possible! 🌟 The dataset can be found on Kaggle. Your contribution is greatly appreciated! πŸ™Œ

πŸ™ I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.

πŸ“Œ Additionally, this project was inspired by amazing contributions from the Kaggle community:

Your contributions have been incredibly helpful in refining and optimizing this project. Thank you! πŸ™πŸš€

Visitors Count

License βš–οΈ

This project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. πŸ› οΈ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. πŸ“„

Please note the following limitations:

  • The software is provided "as is", without any warranties, express or implied. πŸš«πŸ›‘οΈ
  • If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. πŸ“‘
  • The license allows for commercial use, but you cannot claim ownership over the software itself. 🏷️

The goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.

MIT License

Copyright (c) 2024 Dream software - Sergio SΓ‘nchez 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

About

This project uses a Bi-directional LSTM model πŸ“§πŸ€– to classify emails as spam or legitimate, utilizing NLP techniques like tokenization, padding, and stopword removal. It aims to create an effective email classifier πŸ’»πŸ“Š while addressing overfitting with strategies like early stopping 🚫.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published