- Multiple vectorization and transformation techniques were applied to the raw data.
- The raw data was balanced.
- Data split was performed before data transformation o prevent Data Leakage. Thus, the train data is not influenced by the mean and standard deviation of the test data.
- Multinomial Naive Bayes classifiers satisfactory performance metrics.
Precision Recall F1-score Support Ham 0.96 0.93 0.94 190 Spam 0.94 0.96 0.95 201 Accuracy 0.95 391 Macro avg 0.95 0.95 0.95 391 Weighted avg 0.95 0.95 0.95 391
Spam refers to unwanted or unsolicited messages sent over the internet, predominantly emails, but it could also be in the form of text messages, social media messages, or comments. The issue of spam presents several challenges:
Volume: The volume of spam is overwhelming. As of my knowledge cutoff in September 2021, more than half of all email traffic was estimated to be spam. This massive volume of spam messages can clutter inboxes, making it harder for users to find legitimate messages.
Security Risks: Spam messages often contain phishing attempts, malware, or scams. These messages pose security risks to individuals and organizations, potentially leading to theft of sensitive information or other forms of cybercrime.
Resource Consumption: The transmission, storage, and processing of spam messages consume significant network and computing resources. This includes the bandwidth used to transmit the messages, the storage space used to hold them, and the processing power needed to filter them.
The Naive Bayes Classifier is a popular statistical technique used in machine learning for spam detection due to its simplicity, efficiency, effectiveness, and fast inference. Here is a summary of the approach:
Data Collection: The first step is gathering a dataset of messages already labeled as spam or ham (not spam). This dataset will be used to train and test the model.
Preprocessing: Text data in the messages is then preprocessed. This typically includes lower casing all the text, removing punctuation, tokenization (splitting the message into individual words), and stop words removal (removal of common words such as “is”, “the”, “and” that do not contribute significantly to the message’s spam/ham classification).
Feature Extraction: After preprocessing, each message is converted into a vector of numerical features that the machine learning model can understand. This often involves creating a bag-of-words model, where each unique word in the text corresponds to a feature in the vector.
Model Training: The Naive Bayes algorithm is then used to train the model. The algorithm calculates the probability of a message being spam or ham based on the frequency of each word in the message. The Naive Bayes classifier uses the Bayes theorem and assumes that each feature (word) is independent of each other, which simplifies the computation.
Classification: Once trained, the model can be used to classify new, unlabeled messages. For each new message, the model calculates the probability of it being spam or ham based on the words it contains, and it classifies the message into the category with the higher probability.
Evaluation and Optimization: The performance of the model is assessed using metrics such as accuracy, precision, recall, and F1-score. If performance is not satisfactory, the model might need to be tweaked by tuning its parameters or changing the preprocessing or feature extraction steps.
End Users: The primary benefit for end-users is a cleaner, safer, and more efficient experience when using communication channels. Effective spam filtering reduces the number of irrelevant or harmful messages, making communication more efficient and less time-consuming.
Product Owners or Businesses: Spam detection reduces bandwidth and storage costs associated with managing unwanted messages. It also increases user satisfaction and retention, enhancing the reputation and competitiveness of the product. This could lead to more subscribers or users and thus increased revenue.
Advertisers/Marketers: By understanding the workings of spam filters, they can design their messages to be compliant, ensuring they reach the intended audience without being incorrectly flagged as spam, thus enhancing the effectiveness of their campaigns.