Personalized E-mail Spam Filtering using Support Vector Machine by Gopi Sanghani
Material type:
- TT000071 SAN
Item type | Current library | Collection | Call number | Status | Date due | Barcode | Item holds | |
---|---|---|---|---|---|---|---|---|
![]() |
NIMA Knowledge Centre | Reference | TT000071 SAN (Browse shelf(Opens below)) | Not For Loan | TT000071 | |||
![]() |
NIMA Knowledge Centre | Reference | TT000071 SAN (Browse shelf(Opens below)) | Not For Loan | TT000071-1 | |||
![]() |
NIMA Knowledge Centre | Reference | TT000071 SAN (Browse shelf(Opens below)) | Not For Loan | TT000071-2 |
Guided by: Dr. K Kotecha With Synopsis and CD 11EXTPHDE79
ABSTRACT:
Communication through e-mails remains to be highly formalized, official and essential
method for the exchange of information even after increasing use of social netvlorking
applications. \Vith the evolution of technology and globali;,mtion, the need for official
and personal communication has greatly increased. The advanced transformation of
communication medium has enabled the users to promote their associations at both
the professional as well as at the personal level. Electronic mail, commonly referred to
as e-mail has become one the most widely used communication facilities due to its
usage convenience, quick delivery, low cost, and no constraints of time and location.
Though e-mails arc considered to be the most reliable medium of comm1mication, a
huge number of unsolicited e-mails are delivered to Internet users every day ·without
any personal or commercial level of interest. The inevitable downside of e-mail service
is a continuously growing ratio of unwanted and useless e-mails called spam e-mails.
An ever-increasing ratio of spam e-mails raises a major issue with tvw different
perspectives: the unsolicited content and the individual user's consent. The content of
spam e-mails changes over a time due to the adversarial nature of spam e-mails. The
content of legitimate e-mails is influenced by the communication pattern of an
individual user. So, the content of an e-mail, being the most discriminating measure,
e-mail filtering is a prominent application of content-based binary text classification
problems. The discrimination of an e-mail as spam of legitimate highly depends on
individual user's consent. The user consent is generally influenced by his personal and professional preferences for discriminating e-mails as spam or legitimate. As a result,
distribution shift occurs in the case of e-mail filtering.
In this thesis, ,ve propose a distinctive approach to develop an incremental
personalized e-mail spam filter. The research v.mrk focuses on three main aspects. A
novel term frequency difference and a category ratio based feature selection function,
an incremental learning model using support. vector machine and a heuristic function
to dynamically detect the feature shift and update the feature set accordingly. The
performance of content-based classification system substantially depends on the
select.ion of representative features. The major contribution is the development of a
feature select.ion function that generates the subset of features from the training data
,;1,rith the strong discriminating ability irrespective of a number of samples in each class.
To handle the drifting concepts when the distribution of data is not uniform in
training and testing sets: an increment.al learning model is proposed and implemented
using support vector machine. The relevance of features varies over the time due to
the change of data distribution. The novel heuristic function is proposed that.
determines nev{ features with higher discrimination ability from the set of new e-mails.
Subsequently, the feature set is updated before initiating incremental learning so as to
effectively address the feature distribution shift.
The proposed filter is evaluated on 4 benchmark datasets consisting of total :30
personalized e-mail folders v.,rit.h varying characteristics. The profound comparison of
proposed feature select.ion function with other 5 well-known feature select.ion functions,
information gain, chi-square, gain ratio, gini index: and correlation feature selection, is
performed. The proposed function shows consistently the better performance. The filter
performance comparison using the increment.al learning model and the batch model is
also carried out. The increment.al approach substantially outperforms the batch
learning approach. The experimental results demonstrate the applicability of the
proposed filter as a robust and scalable personalized e-mail spam filter.
There are no comments on this title.