This project was an exciting opportunity to apply what I learned in a Deep Learning course, where I tackled the problem of classifying Spam and Legitimate (Ham) emails using Long Short-Term Memory (LSTM) networks. Throughout this journey, I followed a hands-on approach to learn Natural Language Processing (NLP) techniques and explore how Deep Learning can be used for real-world text classification problems.
The goal was to build a classifier that could effectively distinguish between spam and legitimate emails based on the content of the messages. To begin, I spent a lot of time understanding the process of data preprocessing β cleaning and transforming the raw email text so that the LSTM model could make sense of it. This included steps like tokenization, stopword removal, and stemming.
Once the data was ready, I created and trained an LSTM model to learn from the patterns and context in the emails. Along the way, I fine-tuned the model to improve its accuracy and efficiency. Throughout this project, I learned the importance of hyperparameter tuning, the use of embedding layers for text data, and the significance of recurrent neural networks (RNNs) for sequence-based problems.
The project was heavily influenced by a fantastic Deep Learning course I followed from Santiago HernΓ‘ndez, an expert in Cybersecurity and AI. His teachings provided me with the foundational knowledge needed to build the neural network architecture for this project.
By the end of the project, the LSTM model was able to predict whether an email was spam or legitimate with a high level of accuracy, and I gained invaluable experience in text classification, Deep Learning, and NLP techniques.
π I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.
We would like to express our gratitude to purusinghvi for creating and sharing the Spam Email Classification Dataset - Combined Spam Email CSV of 2007 TREC Public Spam Corpus and Enron-Spam Dataset on Kaggle. This dataset, which contains detailed information about spam and legitimate emails, has been invaluable in building and training the machine learning model for spam detection.
π The dataset can be found on Kaggle. Your contribution is greatly appreciated! π
π Additionally, this project was inspired by amazing contributions from the Kaggle community:
- Detecting Spam in Emails with LSTMs (99% accuracy) by hrhuynguyen, which provided valuable insights into applying LSTMs for spam detection.
- Spam Email - XGBoost - 99% by rem4000, which demonstrated the power of ensemble learning methods for this task.
This project was developed for learning and research purposes only. It is an educational exercise aimed at exploring Natural Language Processing (NLP) techniques and Deep Learning modelsβspecifically Long Short-Term Memory (LSTM) networksβfor spam email classification.
The model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.
If you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:
-
π Advanced Classification of Disaster-Related Tweets Using Deep Learning π¨
Uncover how social media responds to crises in real time using deep learning to classify tweets related to disasters. -
π° Fighting Misinformation: Source-Based Fake News Classification π΅οΈββοΈ
Combat misinformation by classifying news articles as real or fake based on their source using machine learning techniques. -
π‘οΈ IoT Network Malware Classifier with Deep Learning Neural Network Architecture π
Detect malware in IoT network traffic using Deep Learning Neural Networks, offering proactive cybersecurity solutions. -
π§ Spam Email Classification using LSTM π€
Classify emails as spam or legitimate using a Bi-directional LSTM model, implementing NLP techniques like tokenization and stopword removal. -
π³ Fraud Detection Model with Deep Neural Networks (DNN) Detect fraudulent transactions in financial data with Deep Neural Networks, addressing imbalanced datasets and offering scalable solutions.
-
π§ π AI-Powered Brain Tumor Classification
Classify brain tumors from MRI scans using Deep Learning, CNNs, and Transfer Learning for fast and accurate diagnostics. -
ππ Predicting Diabetes Diagnosis Using Machine Learning
Create a machine learning model to predict the likelihood of diabetes using medical data, helping with early diagnosis. -
ππ LLM Fine-Tuning and Evaluation
Fine-tune large language models like FLAN-T5, TinyLLAMA, and Aguila7B for various NLP tasks, including summarization and question answering. -
π° Headline Generation Models: LSTM vs. Transformers
Compare LSTM and Transformer models for generating contextually relevant headlines, leveraging their strengths in sequence modeling. -
π©Ίπ» Breast Cancer Diagnosis with MLP
Automate breast cancer diagnosis using a Multi-Layer Perceptron (MLP) model to classify tumors as benign or malignant based on biopsy data. -
Deep Learning for Safer Roads π Exploring CNN-Based and YOLOv11 Driver Drowsiness Detection π€ Comparing driver drowsiness detection with CNN + MobileNetV2 vs YOLOv11 for real-time accuracy and efficiency π§ π. Exploring both deep learning models to prevent fatigue-related accidents π΄π‘.
- Loading the Data: The dataset consists of emails labeled as Spam (1) or Legitimate (0).
- Text Normalization: We start by converting text to lowercase and removing unnecessary characters, such as numbers, punctuation, and special symbols.
- Stopword Removal: Common words that do not contribute to meaningful classification (like "the", "and", etc.) are removed.
- Hyperlink Removal: URLs and hyperlinks in the text are deleted as they do not provide useful information for classification.
- Tokenization: We split the email text into individual words (tokens) for easier processing.
- Visualizing the Data: The notebook includes visualizations such as word clouds and n-gram analysis, which help in understanding the most common terms used in spam and legitimate emails.
- Class Distribution: The dataset is explored to understand the distribution of spam vs. legitimate emails, which helps in deciding model evaluation strategies.
- Text Tokenization: The email text is tokenized into sequences, and the vocabulary is built.
- Padding: As the text data varies in length, padding is applied to ensure that all input sequences have the same size, making them suitable for model input.
- Label Encoding: The target labels (spam or legitimate) are encoded into numeric values (0 or 1) using LabelEncoder.
- Bi-directional LSTM: We use a Bi-directional LSTM model to process the sequence of words in both forward and backward directions. This helps capture contextual information from both past and future words.
- Dense Layer: A fully connected layer with ReLU activation is added to capture non-linear relationships between features.
- Dropout: A dropout layer is included to prevent overfitting and help the model generalize better.
- The model is trained on the preprocessed training data using binary cross-entropy loss and the Adam optimizer.
- Early Stopping is implemented to monitor the validation loss and stop training once the model starts overfitting.
- Evaluation: The model is evaluated on a separate test set to determine its accuracy and ability to generalize to unseen data.
- Training Metrics: The model's performance during training is tracked by monitoring the loss and accuracy.
- Validation Metrics: The validation loss and accuracy provide insight into how well the model generalizes.
- Overfitting: If the validation accuracy starts to drop while training accuracy continues to rise, it indicates overfitting. This is addressed by using techniques like early stopping.
- To classify emails as Spam or Legitimate using deep learning.
- To explore NLP techniques for text preprocessing and sequence classification.
- To evaluate the model's performance on both training and validation sets, and improve it through strategies like early stopping and dropout.
- The modelβs performance on the training data is typically high, with 99% accuracy.
- On the validation data, accuracy usually reaches around 97%, though slight fluctuations are observed due to overfitting.
By the end of this project, you will have a functional Bi-LSTM model for spam email classification that can be further fine-tuned, deployed, or integrated into a larger system for filtering unwanted emails. Techniques like early stopping are crucial to prevent overfitting and ensure the modelβs generalizability.
- Keras Documentation
- TensorFlow Documentation
- NLP with Disaster Tweets Challenge
- https://www.kaggle.com/code/hrhuynguyen/detecting-spam-in-emails-with-lstms-99-accuracy
- https://www.kaggle.com/code/rem4000/spam-email-xgboost-99
This project was developed for learning and research purposes only. It is an educational exercise aimed at exploring Natural Language Processing (NLP) techniques and Deep Learning modelsβspecifically Long Short-Term Memory (LSTM) networksβfor spam email classification.
The model and findings presented in this project should not be used for real-world email filtering or commercial applications, as they have not been rigorously tested for deployment. Additionally, this project leverages publicly available datasets and references existing research contributions for educational insights.
A huge thank you to purusinghvi for providing the dataset that made this project possible! π The dataset can be found on Kaggle. Your contribution is greatly appreciated! π
π I would like to extend my heartfelt gratitude to Santiago HernΓ‘ndez, an expert in Cybersecurity and Artificial Intelligence. His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.
π Additionally, this project was inspired by amazing contributions from the Kaggle community:
- Detecting Spam in Emails with LSTMs (99% accuracy) by hrhuynguyen, which provided valuable insights into applying LSTMs for spam detection.
- Spam Email - XGBoost - 99% by rem4000, which demonstrated the power of ensemble learning methods for this task.
Your contributions have been incredibly helpful in refining and optimizing this project. Thank you! ππ
This project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. π οΈ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. π
Please note the following limitations:
- The software is provided "as is", without any warranties, express or implied. π«π‘οΈ
- If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. π
- The license allows for commercial use, but you cannot claim ownership over the software itself. π·οΈ
The goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.
MIT License
Copyright (c) 2024 Dream software - Sergio SΓ‘nchez
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.