Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Petar Petrovski is active.

Publication


Featured researches published by Petar Petrovski.


international semantic web conference | 2014

The WebDataCommons Microdata, RDFa and Microformat Dataset Series

Robert Meusel; Petar Petrovski

In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013. Altogether, the datasets consist of almost 30 billion RDF quads. The most recent of the datasets contains amongst other data over 211 million product descriptions, 54 million reviews and 125 million postal addresses originating from thousands of websites. The availability of the datasets lays the foundation for further research on integrating and cleansing the data as well as for exploring its utility within different application contexts. As the dataset series covers four years, it can also be used to analyze the evolution of the adoption of the markup formats.


international world wide web conferences | 2014

Integrating product data from websites offering microdata markup

Petar Petrovski; Volha Bryl

Large numbers of websites have started to markup their content using standards such as Microdata, Microformats, and RDFa. The marked-up content elements comprise descriptions of people, organizations, places, events, products, ratings, and reviews. This development has accelerated in last years as major search engines such as Google, Bing and Yahoo! use the markup to improve their search results. Embedding semantic markup facilitates identifying content elements on webpages. However, the markup is mostly not as fine-grained as desirable for applications that aim to integrate data from large numbers of websites. This paper discusses the challenges that arise in the task of integrating descriptions of electronic products from several thousand e-shops that offer Microdata markup. We present a solution for each step of the data integration process including Microdata extraction, product classification, product feature extraction, identity resolution, and data fusion. We evaluate our processing pipeline using 1.9 million product offers from 9240 e-shops which we extracted from the Common Crawl 2012, a large public Web corpus.


international conference on electronic commerce | 2016

The WDC Gold Standards for Product Feature Extraction and Product Matching

Petar Petrovski; Anna Primpeli; Robert Meusel

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over \(75\,000\) correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.


Linked Open Data | 2014

Knowledge Base Creation, Enrichment and Repair

Sebastian Hellmann; Volha Bryl; Lorenz Bühmann; Milan Dojchinovski; Dimitris Kontokostas; Jens Lehmann; Uroš Milošević; Petar Petrovski; Vojtěch Svátek; Mladen Stanojević; Ondřej Zamazal

This chapter focuses on data transformation to RDF and Linked Data and furthermore on the improvement of existing or extracted data especially with respect to schema enrichment and ontology repair. Tasks concerning the triplification of data are mainly grounded on existing and well-proven techniques and were refined during the lifetime of the LOD2 project and integrated into the LOD2 Stack. Triplification of legacy data, i.e. data not yet in RDF, represents the entry point for legacy systems to participate in the LOD cloud. While existing systems are often very useful and successful, there are notable differences between the ways knowledge bases and Wikis or databases are created and used. One of the key differences in content is in the importance and use of schematic information in knowledge bases. This information is usually absent in the source system and therefore also in many LOD knowledge bases. However, schema information is needed for consistency checking and finding modelling problems. We will present a combination of enrichment and repair steps to tackle this problem based on previous research in machine learning and knowledge representation. Overall, the Chapter describes how to enable tool-supported creation and publishing of RDF as Linked Data (Sect. 1) and how to increase the quality and value of such large knowledge bases when published on the Web (Sect. 2).


Semantic Web | 2018

A machine learning approach for product matching and categorization

Petar Ristoski; Petar Petrovski; Peter Mika; Heiko Paulheim

Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.


Proceedings of the International Conference on Web Intelligence | 2017

Extracting attribute-value pairs from product specifications on the web

Petar Petrovski

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.


LD4IE'14 Proceedings of the Second International Conference on Linked Data for Information Extraction - Volume 1267 | 2014

Learning regular expressions for the extraction of product attributes from E-commerce microdata

Petar Petrovski; Volha Bryl


Social Work | 2018

A machine learning approach for product matching and categorization: Use case: Enriching product ads with semantic structured data

Petar Ristoski; Petar Petrovski; Peter Mika; Heiko Paulheim


Archive | 2015

Web Data Commons - Gold Standard for Feature Extraction

Petar Petrovski; Anna Primpeli; Robert Meusel


Archive | 2015

Web Data Commons - Product Data Corpus

Petar Petrovski; Robert Meusel; Anna Primpeli

Collaboration


Dive into the Petar Petrovski's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Volha Bryl

University of Mannheim

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge