Vladimir Dobrynin
Saint Petersburg State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vladimir Dobrynin.
international acm sigir conference on research and development in information retrieval | 2008
Konstantin Avrachenkov; Vladimir Dobrynin; Danil Nemirovsky; Son K. Pham; Elena Smirnova
Clustering hypertext document collection is an important task in Information Retrieval. Most clustering methods are based on document content and do not take into account the hyper-text links. Here we propose a novel PageRank based clustering (PRC) algorithm which uses the hypertext structure. The PRC algorithm produces graph partitioning with high modularity and coverage. The comparison of the PRC algorithm with two content based clustering algorithms shows that there is a good match between PRC clustering and content based clustering.
european conference on information retrieval | 2004
Vladimir Dobrynin; David W. Patterson; Niall Rooney
In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.
Knowledge Based Systems | 2008
David W. Patterson; Niall Rooney; Mykola Galushka; Vladimir Dobrynin; Elena Smirnova
In this paper, we present a novel textual case-based reasoning system called SOPHIA-TCBR which provides a means of clustering semantically related textual cases where individual clusters are formed through the discovery of narrow themes which then act as attractors for related cases. During this process, SOPHIA-TCBR automatically discovers appropriate case and similarity knowledge. It then is able to organize the cases within each cluster by forming a minimum spanning tree, based on their semantic similarity. SOPHIAs capability as a case-based text classifier is benchmarked against the well known and widely utilised k-Means approach. Results show that SOPHIA either equals or outperforms k-Means based on 2 different case-bases, and as such is an attractive approach for case-based classification. We demonstrate the quality of the knowledge discovery process by showing the high level of topic similarity between adjacent cases within the minimum spanning tree. We show that the formation of the minimum spanning tree makes it possible to identify a kernel region within the cluster, which has a higher level of similarity between cases than the cluster in its entirety, and that this corresponds directly to a higher level of topic homogeneity. We demonstrate that the topic homogeneity increases as the average semantic similarity between cases in the kernel increases. Finally having empirically demonstrated the quality of the knowledge discovery process in SOPHIA, we show how it can be competently applied to case-based retrieval.
Information Processing and Management | 2006
Niall Rooney; David W. Patterson; Mykola Galushka; Vladimir Dobrynin
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.
international conference of the ieee engineering in medicine and biology society | 2005
Vladimir Dobrynin; David W. Patterson; Mykola Galushka; Niall Rooney
The ability to perform an exploratory search and retrieval of relevant documents from a large collection of domain-specific documents is an important requirement both in the field of medicine and other areas. In this paper, we present a unsupervised distributional clustering technique called SOPHIA. SOPHIA provides a semantically meaningful visual clustering of the document corpus in conjunction with an intuitive interactive search facility. We assess the effectiveness of SOPHIAs cluster-based information retrieval for the MEDLINE testset collection known as OHSUMED.
Discrete Mathematics | 2004
Vladimir Dobrynin
Abstract Let X be a real symmetric matrix indexed by the vertices of a graph G such that all its diagonal entries are 1, X ij =0 whenever vertices i , j are non-adjacent and | X ij |⩽1 for all other entries of X. Let r ( G ) be the minimum possible rank of the matrix X. Then α(G)⩽r(G)⩽ χ (G) . It is well known that there is no upper bound on χ (G) in terms of α ( G ). For every natural k ⩾2 there exists graph G such that α ( G )=2 and χ (G)=k . So it is interesting to find out whether there is an upper bound on χ (G) in terms of r ( G ). It is proved here that r ( G )= i iff d ( G )= i for i ⩽3. Here d ( G ) is the minimum dimension of the orthonormal labellings of G. Hence, if r ( G )⩽3 then χ (G)⩽2 r(G)−1 .
international conference stability and control processes | 2015
Vladimir Dobrynin; Y Balykina; Michael Kamalov
The paper describes a process of clustering of article abstracts, taken from the largest bibliographic life sciences and biomedical information MEDLINE database into categories that correspond to types of medical interventions - types of patient treatments. Experiments were carried out to evaluate the quality of clustering for the following algorithms: K-means; K-means++; Hierarchical clustering, SIB (Sequential information bottleneck) together with the LSA (Latent Semantic Analysis) methods and MI (Mutual Information) which allow selecting feature vectors. Best results of clustering were achieved by K-means++ together with LSA then 210-dimensional space was chosen: Purity = 0.5719, Entropy = 1.3841, Normalized Entropy = 0.6299.
federated conference on computer science and information systems | 2015
Vladimir Dobrynin; Julia Balykina; Michael Kamalov; A Kolbin; Elena Verbitskaya; Munira Kasimova
The paper is devoted to classification of MEDLINE abstracts into categories that correspond to types of medical interventions - types of patient treatments. This set of categories was extracted from Clinicaltrials.gov web site. Few classification algorithms were tested includingMultinomial Naive Bayes, Multinomial Logistic Regression, and Linear SVM implementations from sklearn machine learning library. Document marking was based on the consideration of abstracts containing links to the Clinicaltrials.gov Web site. As the result of an automatical marking 3534 abstracts were marked for training and testing the set of algorithms metioned above. Best result of multinomial classification was achieved by Linear SVM with macro evaluation precision 70.06%, recall 55.62% and F-measure 62.01%, and micro evaluation precision 64.91%, recall 79.13% and F-measure 71.32%.
constructive nonsmooth analysis and related topics | 2017
Bandit Problem; Mikhail Kamalov; Vladimir Dobrynin; Y Balykina
Nowadays the online learning area is actively developing as a part of machine learning. In this regard, there arises the problem of choosing an algorithm that solves the optimization problem with regard to online data processing. Since currently one of the active areas of online learning is ranking, the comparison of several state of art online optimization algorithms for the multi-armed bandit problem in case of online ranking is presented.
Electronic Notes in Discrete Mathematics | 2000
Vladimir Dobrynin
Abstract In this paper we investigate the upper bound on the smallest number of cliques that cover the vertices of graph G in terms of the rank of a matrix associated with it. Let X be real matrix that is indexed by vertices of G such that all its diagonal entries are non-zero and Xij = 0 whenever vertices i,j are non-adjacent. Then χ (G)≤ 3 rk(X)/2