Ehud Aharoni
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ehud Aharoni.
intelligent systems in molecular biology | 2008
Michal Rosen-Zvi; Andre Altmann; Mattia Prosperi; Ehud Aharoni; Hani Neuvirth; Anders Sönnerborg; Eugen Schülter; Daniel Struck; Yardena Peres; Francesca Incardona; Rolf Kaiser; Maurizio Zazzi; Thomas Lengauer
Motivation: Optimizing HIV therapies is crucial since the virus rapidly develops mutations to evade drug pressure. Recent studies have shown that genotypic information might not be sufficient for the design of therapies and that other clinical and demographical factors may play a role in therapy failure. This study is designed to assess the improvement in prediction achieved when such information is taken into account. We use these factors to generate a prediction engine using a variety of machine learning methods and to determine which clinical conditions are most misleading in terms of predicting the outcome of a therapy. Results: Three different machine learning techniques were used: generative–discriminative method, regression with derived evolutionary features, and regression with a mixture of effects. All three methods had similar performances with an area under the receiver operating characteristic curve (AUC) of 0.77. A set of three similar engines limited to genotypic information only achieved an AUC of 0.75. A straightforward combination of the three engines consistently improves the prediction, with significantly better prediction when the full set of features is employed. The combined engine improves on predictions obtained from an online state-of-the-art resistance interpretation system. Moreover, engines tend to disagree more on the outcome of failure therapies than regarding successful ones. Careful analysis of the differences between the engines revealed those mutations and drugs most closely associated with uncertainty of the therapy outcome. Availability: The combined prediction engine will be available from July 2008, see http://engine.euresist.org Contact: [email protected]
PLOS ONE | 2008
Andre Altmann; Michal Rosen-Zvi; Mattia Prosperi; Ehud Aharoni; Hani Neuvirth; Eugen Schülter; Joachim Büch; Daniel Struck; Yardena Peres; Francesca Incardona; Anders Sönnerborg; Rolf Kaiser; Maurizio Zazzi; Thomas Lengauer
Background Analysis of the viral genome for drug resistance mutations is state-of-the-art for guiding treatment selection for human immunodeficiency virus type 1 (HIV-1)-infected patients. These mutations alter the structure of viral target proteins and reduce or in the worst case completely inhibit the effect of antiretroviral compounds while maintaining the ability for effective replication. Modern anti-HIV-1 regimens comprise multiple drugs in order to prevent or at least delay the development of resistance mutations. However, commonly used HIV-1 genotype interpretation systems provide only classifications for single drugs. The EuResist initiative has collected data from about 18,500 patients to train three classifiers for predicting response to combination antiretroviral therapy, given the viral genotype and further information. In this work we compare different classifier fusion methods for combining the individual classifiers. Principal Findings The individual classifiers yielded similar performance, and all the combination approaches considered performed equally well. The gain in performance due to combining methods did not reach statistical significance compared to the single best individual classifier on the complete training set. However, on smaller training set sizes (200 to 1,600 instances compared to 2,700) the combination significantly outperformed the individual classifiers (p<0.01; paired one-sided Wilcoxon test). Together with a consistent reduction of the standard deviation compared to the individual prediction engines this shows a more robust behavior of the combined system. Moreover, using the combined system we were able to identify a class of therapy courses that led to a consistent underestimation (about 0.05 AUC) of the system performance. Discovery of these therapy courses is a further hint for the robustness of the combined system. Conclusion The combined EuResist prediction engine is freely available at http://engine.euresist.org.
empirical methods in natural language processing | 2015
Ruty Rinott; Lena Dankin; Carlos Alzate Perez; Mitesh M. Khapra; Ehud Aharoni; Noam Slonim
Engaging in a debate with oneself or others to take decisions is an integral part of our day-today life. A debate on a topic (say, use of performance enhancing drugs) typically proceeds by one party making an assertion/claim (say, PEDs are bad for health) and then providing an evidence to support the claim (say, a 2006 study shows that PEDs have psychiatric side effects). In this work, we propose the task of automatically detecting such evidences from unstructured text that support a given claim. This task has many practical applications in decision support and persuasion enhancement in a wide range of domains. We first introduce an extensive benchmark data set tailored for this task, which allows training statistical models and assessing their performance. Then, we suggest a system architecture based on supervised learning to address the evidence detection task. Finally, promising experimental results are reported.
meeting of the association for computational linguistics | 2014
Ehud Aharoni; Anatoly Polnarov; Tamar Lavee; Daniel Hershcovich; Ran Levy; Ruty Rinott; Dan Gutfreund; Noam Slonim
We describe a novel and unique argumentative structure dataset. This corpus consists of data extracted fro m hundreds of Wikipedia articles using a meticulously monitored manual annotation process. The result is 2,683 argument elements, collected in the context of 33 controversial topics, organized under a simp le claim-evidence structure. The obtained data are publicly available for academic research.
international world wide web conferences | 2016
Haggai Roitman; Shay Hummel; Ella Rabinovich; Benjamin Sznajder; Noam Slonim; Ehud Aharoni
This work presents a novel claim-oriented document retrieval task. For a given controversial topic, relevant articles containing claims that support or contest the topic are retrieved from a Wikipedia corpus. For that, a two-step retrieval approach is proposed. At the first step, an initial pool of articles that are relevant to the topic are retrieved using state-of-the-art retrieval methods. At the second step, articles in the initial pool are re-ranked according to their potential to contain as many relevant claims as possible using several claim discovery features. Hence, the second step aims at maximizing the overall claim recall of the retrieval system. Using a recently published claims benchmark, the proposed retrieval approach is demonstrated to provide more relevant claims compared to several other retrieval alternatives.
Ibm Journal of Research and Development | 2011
Ehud Aharoni; Shai Fine; Yaara Goldschmidt; Ofer Lavi; Oded Margalit; Michal Rosen-Zvi; Lavi Shpigelman
Modern computer systems generate an enormous number of logs. IBM Mining Effectively Large Output Data Yield (MELODY) is a unique and innovative solution for handling these logs and filtering out the anomalies and failures. MELODY can detect system errors early on and avoid subsequent crashes by identifying the root causes of such errors. By analyzing the logs leading up to a problem, MELODY can pinpoint when and where things went wrong and visually present them to the user, ensuring that corrections are accurately and effectively done. We present the MELODY solution and describe its architecture, algorithmic components, functions, and benefits. After being trained on a large portion of relevant data, MELODY provides alerts of abnormalities in newly arriving log files or in streams of logs. The solution is being used by IBM services groups that support IBM xSeries® servers on a regular basis. MELODY was recently tested with ten large IBM customers who use zSeries® machines and was found to be extremely useful for the information technology experts in those companies. They found that the solutions ability to reduce extensively large log data to manageable sets of highlighted messages saved them time and helped them make better use of the data.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2011
Ehud Aharoni; Hani Neuvirth; Saharon Rosset
The common scenario in computational biology in which a community of researchers conduct multiple statistical tests on one shared database gives rise to the multiple hypothesis testing problem. Conventional procedures for solving this problem control the probability of false discovery by sacrificing some of the power of the tests. We suggest a scheme for controlling false discovery without any power loss by adding new samples for each use of the database and charging the user with the expenses. The crux of the scheme is a carefully crafted pricing system that fairly prices different user requests based on their demands while keeping the probability of false discovery bounded. We demonstrate this idea in the context of HIV treatment research, where multiple researchers conduct tests on a repository of HIV samples.
Ibm Journal of Research and Development | 2016
Ehud Aharoni; Ron Peleg; Shmuel Regev; Tamer Salman
Every day, massive amounts of system events from software agents deployed at endpoint devices across the world are received by the IBM Trusteer security group. The software associated with each event is verified with respect to third-party malware inspection services such as VirusTotal. Unfortunately, many events are associated with software that is unrecognized by inspection services. As a result, it is impossible to manually investigate and react to all of them. Traditional quantitative analysis is nearly useless because benign anomalies and attacks are indiscernible. We developed a system that continuously and automatically processes streaming data to help identify suspicious activity. The data comprises low-level traces of process activity. Each streamed activity is augmented with a signature that heuristically biases the degree of suspicion associated with the activity. The system then flags activities that are unknown to inspection services and likely to be malicious. It extracts behavioral and statistical information from the events, builds a predictive model based on supervised learning, and ranks the events suspected of being malicious. We tested the system using VirusTotal on three months of historical data. The results showed we were able to predict more than two thirds of the malicious events unknown at that time, with less than a 2% false positive rate.
Genetic Epidemiology | 2014
Saharon Rosset; Ehud Aharoni; Hani Neuvirth
Issues of publication bias, lack of replicability, and false discovery have long plagued the genetics community. Proper utilization of public and shared data resources presents an opportunity to ameliorate these problems. We present an approach to public database management that we term Quality Preserving Database (QPD). It enables perpetual use of the database for testing statistical hypotheses while controlling false discovery and avoiding publication bias on the one hand, and maintaining testing power on the other hand. We demonstrate it on a use case of a replication server for GWAS findings, underlining its practical utility. We argue that a shift to using QPD in managing current and future biological databases will significantly enhance the communitys ability to make efficient and statistically sound use of the available data resources.
acm international conference on systems and storage | 2017
Shelly Garion; Hillel Kolodner; Allon Adir; Ehud Aharoni; Lev Greenberg
We use Apache Spark analytics to investigate the logs of an operational cloud object store service to understand how it is being used. This investigation involves going over very large amounts of historical data (PBs of records in some cases) collected over long periods of time retroactively. Existing tools, such as Elasticsearch-Logstash-Kibana (ELK), are mainly used for presenting short-term metrics and can-not perform advanced analytics such as machine learning. A possible solution is to save for long periods only certain aggregations or calculations produced from the raw log data, such as averages or histograms, however these must be decided in advance, and cannot be changed retroactively since the raw data has already been discarded. Spark allows us to gain insights going over historical data collected over long periods of time and to apply the historical models on online data in a simple and efficient way.