Web Page Categorization Using Artificial Neural Networks
PProceedings of the 4 th International Conference on Electrical Engineering & 2 nd Annual Paper Meet26-28 January, 2006
Web Page Categorization Using Artificial Neural Networks
S. M. Kamruzzaman
Department of Computer Science and Engineering Manarat International University, Dhaka, Bangladesh E-mail: [email protected], [email protected]
ABSTRACTWeb page categorization is one of the challenging tasks in the world of ever increasing web technologies. There are many ways of categorization of web pages based on different approach and features. This paper proposes a new dimension in the way of categorization of web pages using artificial neural network (ANN) through extracting the features automatically. Here eight major categories of web pages have been selected for categorization; these are business & economy, education, government, entertainment, sports, news & media, job search, and science. The whole process of the proposed system is done in three successive stages. In the first stage, the features are automatically extracted through analyzing the source of the web pages. The second stage includes fixing the input values of the neural network; all the values remain between 0 and 1. The variations in those values affect the output. Finally the third stage determines the class of a certain web page out of eight predefined classes. This stage is done using back propagation algorithm of artificial neural network. The proposed concept will facilitate web mining, retrievals of information from the web and also the search engines. 1. INTRODUCTION amruzzaman et. al : Proceedings of the 4 th ICEE & 2 nd APM, January 2006
2. THE PROPOSED APPROACH
There are several categorization approaches definedby various researchers in different courses of time toface the demand of users. Because of rapid growth ofweb pages per day, an efficient technique is proposedhere to classify the web pages based on the five features extracted from a page and the categorization is done in three successive stages. Through analysis of about 500 web pages it is found that all the webdevelopers and the designers try to express the mottoand the theme of the organization. The theme isexpressed by the total structure of the home page. Theintention of the designer is always to make the visitorspent more time in his site. So he tries to design thehome page with extra care and build the home pagestructure to fulfill the intention. So the five featuresare picked up that make the site different from other types. The features are home page structure, which isthe ratio of internal and external links, amount ofdynamic/static pages, frequency of images,availability of animations and the predefinedbuzzwords. In this proposed approach, eight majorclasses are selected from different web directory. Theneural network are trained and tested by using those classes. The proposed approach is done through the following stages: 1) Automatic features extraction throughanalyzing the home page source. 2) Fixing the values for the input nodes of thenetworks.3) Classifying web pages by the neuralnetworks.
In the proposed approach features are extracted byanalyzing the source of a web page. By going through the tags of the source file the features can easily get.
The site structure is defined by the internal or external links used in the page. By analyzing the reference or information pages (science/education/job site) moreexternal links are found than the commercial web sites (business and economy, news and media, sports sites).
The proposed approach defines some buzzwords thatmake a site to be a certain class. The frequency ofbuzzwords from same class increases the probabilityof that site to be of that class. The values of input layer of ANN are determined by calculating thebuzzwords.
This is one of the major considerations that usedimages in a web site reflect the theme of home page.In the proposed approach it is not concern to imagecategorization methods but of finding the area covered in the home page by the images. More number of images proves that the web page is more colorful.
Table 1:
Selected buzzwords.
Classes Buzz Words
Business,Economy Business, Trade, Investment, Credit, Cash, Trade, Commerce, Loan, Support, Product, Service, Offer.Education Career, Student, Faculty, Degree, Graduate,Education, Research, Admission, Prospects.Government Policy, Ministry, President, Government, Activity. News, Media News, Media, Editor, Culture, Archives, Latest, Update, Current, Affairs.Entertainment Music, Entertainment, Dating, Fun, Love, Artist,Free, Match, Friendship.Science Science, Research, Technology.Sports Team, Sports, Matches, Schedule, Scores.Job Site Career, Experience, Job, Seek, Vacancy, Resume,Application, Location, Employment, Offer.
Animations are mostly used in the business sites andadvertising sites. A company logo is animated and anumber of logos are used in the job sites. Flash animations, script animations fall into this. If we visita web site of a well-known university we find thatanimations are rarely used there. Science sites also fall into this category.
The number of static or dynamic pages used in a sitealso gives a clue to classify the site. It is investigatedthat the news and media, sports sites are mostfrequently updated. The news pages are updated dailyand there are some sites those are updated in everyfive seconds. Again the business or information pagesare rarely updated. So the more number of dynamicpages increases the probability to be news site or job site or sports site whether the fewer number ofdynamic pages indicates that the page is either a science or a government or education site.
Extracted features are setting as input of the ANN. The site structure provides the first input value. This is found by dividing the number of hyperlinks to out side of the domain by the number of hyperlinks toinside the domain. It is seen that a business web site rarely provides any link to out side of the domain.They always try to focus themselves and try to keepthe visitors stay long in their site whether a job site provides more links to out side of their domain as they amruzzaman et. al : Proceedings of the 4 th ICEE & 2 nd APM, January 2006 can be said as an advertising site. They provide thelinks to the web sites of various companies. A fun orentertainment site also provides links to their sponsors.Buzzwords counted by analyzing the frequency ofcertain words in the home page of a web site. This gives us the use of frequency of keywords used in thehome page. Several key words are selected for desiredcategories, which are searched in the body of the pageand the most frequent keyword, and its value gives usanother input of the ANN. Crowd of images means the frequency of image usedin a page. If we visit the standard pages of various classes we see an educational institute use less colors and images than a job site or a news/sports site. The probability of use of images in a science and engineering or a personal home page is very low. The more image used gives the impression that the page ismore colorful. It is generally known that theinformation/research pages are less colorful than a commercial page. The use of Image is also lower.Availability of animations means how much area is covered by animations in the home page of a web site.Basically animations are mostly used to attract the attention of a visitor to the product of somemanufacturing companies, which fall into the business and economy category. A university website generallyconsists of fewer animations than a web site of a paint manufacturing company.Web page type provides a value for the input layer ofANN. The more dynamic pages used in a site meansthe more possibility to be a news site or sports site. And the less use of dynamic page means the possibility to be a science and engineering site. Thisvalue remains from 0 to 1 depending on the percentage of dynamic pages. Table 2 shows the inputvalues to ANN for dynamic pages.
Fig. 1:
A complete dynamic web site.
For the network we used a 5-5-3 architecture. Theinput vector of this network consists of 5 elementswhere each neuron represents one element. In this architecture one hidden layer with 5 neurons are used.Output of the network consists of 3 neurons, to show the output pattern based on our eight classes. By threeneurons we can classify the web pages into eightcategories. There are lots of possibilities to be fractionvalue to come as output. To overcome this situationwe are converting the values from 0.0 to 0.49 into 0 and the values from 0.50 to 1.0 into 1. The targetedclasses in binary manner are shown in Table 3.
Table 2:
Input values for dynamic pages. % of Dynamic pages Input into ANN
90% - 100% 1.080% - 90% 0.970% - 80% 0.860% - 70% 0.750% - 60% 0.640% - 50% 0.530% - 40% 0.420% - 30% 0.310% - 20% 0.20% - 10% 0.1No dynamic page 0.0
Table 3:
Output patterns.
Classes of the web pages Output Pattern
Business and Economy 0 0 0 Education 0 0 1 Government 0 1 0 News and Media 0 1 1 Sports 1 0 0 Job Search 1 0 1 Entertainment 1 1 0 Science 1 1 1
Input Layer Hidden Layer Output Layer
Fig. 2:
Architecture of the proposed ANN.
A good number of training examples needed to getefficient response from the system. But this is really amruzzaman et. al : Proceedings of the 4 th ICEE & 2 nd APM, January 2006 hard to get available training example. For this purpose 500 of home pages of different web sites are collected. All the samples were split into two groups:the training set and the testing set. The training set was comprised of 40 % of the total samples and therest of the samples are used for testing the system. Trained networks needed to be tested to measure theperformance of the network. The evolution of the true performance of any system depends on the organization of testing data set. The testing data set should be enough strong to reflect the real worldsituation. For this, the testing data set demandsvarious types of input pattern that may be arise in the real world situation. We tried to make our testing dataset from different categories and also from domains of different countries.
3. EXPERIMENTAL RESULTS
To show the effectiveness of the proposed approachthe experiment has done by testing the system in different ways and then comparing their result to thefinal result. The system is tested firstly using theknown pages of 200 pages and then for 300 unknown pages. The following table shows the results of the system when network was tested:
Table 4:
Experimental results.
4. CONCLUSION
Automated categorization of web pages can lead tobetter web retrieval tools with the added convenience of selecting among properly organized directories. Inthis paper a theme based web page categorization is proposed which extract the features automaticallythrough analyzing the html source, and categorize theweb pages into eight major classes using back propagation algorithm. The web pages are categorizedbased on five major characteristics and similarities ofdifferent pages of same types.
REFERENCES
SIGIR2000 , Athens, Greece, 2000. [9] H.Yu, J.Han, and K.C.C.Chang.Pebl, “Positive-example based learning for web page categorizationusing SVM”, In KDD, Edmonton, Alberta, Canada, 2002.[10] Hui Yang & Tat-Seng Chua " Effectiveness of web page categorization on Finding List Answer ",National University of Singapore.[11] Hwanjo Yu,kevin chen chuan chang,Jiawei Han, “Heterogeneous learner for web page categorization”, University of Illinois at Urbana-Champaign.[12] J.Hayes and W.S.P." A system for content-basedindexing of a database of news stories". In Proceedings of Second Annual Conference onInnovatative Applications of Artificial Intelligence,pages 1–5, 1990.
Types of pages No. of pages right classified No. of pages wrong classified
Business & Economy 35 14Education 23 6Government 12 8News & Media 29 5Sports 18 7Entertainment 31 18Job Search 26 11Science 38 19
Total
212 88 [13] J.Yi and N.Sudershesan, “A classifier for semi-structured documents”, In KDD 2000, Boston, MAUSA, 2000. [14] John M. Pierre, "On the Automated Categorizationof Web Sites", Linkoping University ElectronicPress Linkoping, Sweden.[15] K.Matsuda and T.Fukushima.“Task-oriented worldwide web retrieval by document typecategorization”, In