Avijit Hazra | Researchain

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Avijit Hazra is active.

Explore More

Publication

Featured researches published by Avijit Hazra.

Indian Journal of Dermatology | 2016

Biostatistics series module 3: Comparing groups: Numerical variables

Avijit Hazra; Nithya Gogtay

Numerical data that are normally distributed can be analyzed with parametric tests, that is, tests which are based on the parameters that define a normal distribution curve. If the distribution is uncertain, the data can be plotted as a normal probability plot and visually inspected, or tested for normality using one of a number of goodness of fit tests, such as the Kolmogorov–Smirnov test. The widely used Students t-test has three variants. The one-sample t-test is used to assess if a sample mean (as an estimate of the population mean) differs significantly from a given population mean. The means of two independent samples may be compared for a statistically significant difference by the unpaired or independent samples t-test. If the data sets are related in some way, their means may be compared by the paired or dependent samples t-test. The t-test should not be used to compare the means of more than two groups. Although it is possible to compare groups in pairs, when there are more than two groups, this will increase the probability of a Type I error. The one-way analysis of variance (ANOVA) is employed to compare the means of three or more independent data sets that are normally distributed. Multiple measurements from the same set of subjects cannot be treated as separate, unrelated data sets. Comparison of means in such a situation requires repeated measures ANOVA. It is to be noted that while a multiple group comparison test such as ANOVA can point to a significant difference, it does not identify exactly between which two groups the difference lies. To do this, multiple group comparison needs to be followed up by an appropriate post hoc test. An example is the Tukeys honestly significant difference test following ANOVA. If the assumptions for parametric tests are not met, there are nonparametric alternatives for comparing data sets. These include Mann–Whitney U-test as the nonparametric counterpart of the unpaired Students t-test, Wilcoxon signed-rank test as the counterpart of the paired Students t-test, Kruskal–Wallis test as the nonparametric equivalent of ANOVA and the Friedmans test as the counterpart of repeated measures ANOVA.

Indian Journal of Dermatology | 2016

Biostatistics series module 4: Comparing groups - categorical variables

Avijit Hazra; Nithya Gogtay

Categorical variables are commonly represented as counts or frequencies. For analysis, such data are conveniently arranged in contingency tables. Conventionally, such tables are designated as r × c tables, with r denoting number of rows and c denoting number of columns. The Chi-square (χ2) probability distribution is particularly useful in analyzing categorical variables. A number of tests yield test statistics that fit, at least approximately, a χ2 distribution and hence are referred to as χ2 tests. Examples include Pearsons χ2 test (or simply the χ2 test), McNemars χ2 test, Mantel–Haenszel χ2 test and others. The Pearsons χ2 test is the most commonly used test for assessing difference in distribution of a categorical variable between two or more independent groups. If the groups are ordered in some manner, the χ2 test for trend should be used. The Fishers exact probability test is a test of the independence between two dichotomous categorical variables. It provides a better alternative to the χ2 statistic to assess the difference between two independent proportions when numbers are small, but cannot be applied to a contingency table larger than a two-dimensional one. The McNemars χ2 test assesses the difference between paired proportions. It is used when the frequencies in a 2 × 2 table represent paired samples or observations. The Cochrans Q test is a generalization of the McNemars test that compares more than two related proportions. The P value from the χ2 test or its counterparts does not indicate the strength of the difference or association between the categorical variables involved. This information can be obtained from the relative risk or the odds ratio statistic which is measures of dichotomous association obtained from 2 × 2 tables.

Indian Journal of Dermatology | 2016

Biostatistics series module 6: Correlation and linear regression

Avijit Hazra; Nithya Gogtay

Correlation and linear regression are the most commonly used techniques for quantifying the association between two numeric variables. Correlation quantifies the strength of the linear relationship between paired variables, expressing this as a correlation coefficient. If both variables x and y are normally distributed, we calculate Pearsons correlation coefficient (r). If normality assumption is not met for one or both variables in a correlation analysis, a rank correlation coefficient, such as Spearmans rho (ρ) may be calculated. A hypothesis test of correlation tests whether the linear relationship between the two variables holds in the underlying population, in which case it returns a P < 0.05. A 95% confidence interval of the correlation coefficient can also be calculated for an idea of the correlation in the population. The value r2 denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x and is called the coefficient of determination. Linear regression is a technique that attempts to link two correlated variables x and y in the form of a mathematical equation (y = a + bx), such that given the value of one variable the other may be predicted. In general, the method of least squares is applied to obtain the equation of the regression line. Correlation and linear regression analysis are based on certain assumptions pertaining to the data sets. If these assumptions are not met, misleading conclusions may be drawn. The first assumption is that of linear relationship between the two variables. A scatter plot is essential before embarking on any correlation-regression analysis to show that this is indeed the case. Outliers or clustering within data sets can distort the correlation coefficient value. Finally, it is vital to remember that though strong correlation can be a pointer toward causation, the two are not synonymous.

Indian Journal of Dermatology | 2016

Biostatistics Series Module 2: Overview of Hypothesis Testing.

Avijit Hazra; Nithya Gogtay

Hypothesis testing (or statistical inference) is one of the major applications of biostatistics. Much of medical research begins with a research question that can be framed as a hypothesis. Inferential statistics begins with a null hypothesis that reflects the conservative position of no change or no difference in comparison to baseline or between groups. Usually, the researcher has reason to believe that there is some effect or some difference which is the alternative hypothesis. The researcher therefore proceeds to study samples and measure outcomes in the hope of generating evidence strong enough for the statistician to be able to reject the null hypothesis. The concept of the P value is almost universally used in hypothesis testing. It denotes the probability of obtaining by chance a result at least as extreme as that observed, even when the null hypothesis is true and no real difference exists. Usually, if P is < 0.05 the null hypothesis is rejected and sample results are deemed statistically significant. With the increasing availability of computers and access to specialized statistical software, the drudgery involved in statistical calculations is now a thing of the past, once the learning curve of the software has been traversed. The life sciences researcher is therefore free to devote oneself to optimally designing the study, carefully selecting the hypothesis tests to be applied, and taking care in conducting the study well. Unfortunately, selecting the right test seems difficult initially. Thinking of the research hypothesis as addressing one of five generic research questions helps in selection of the right hypothesis test. In addition, it is important to be clear about the nature of the variables (e.g., numerical vs. categorical; parametric vs. nonparametric) and the number of groups or data sets being compared (e.g., two or more than two) at a time. The same research question may be explored by more than one type of hypothesis test. While this may be of utility in highlighting different aspects of the problem, merely reapplying different tests to the same issue in the hope of finding a P < 0.05 is a wrong use of statistics. Finally, it is becoming the norm that an estimate of the size of any effect, expressed with its 95% confidence interval, is required for meaningful interpretation of results. A large study is likely to have a small (and therefore “statistically significant”) P value, but a “real” estimate of the effect would be provided by the 95% confidence interval. If the intervals overlap between two interventions, then the difference between them is not so clear-cut even if P < 0.05. The two approaches are now considered complementary to one another.

Indian Journal of Dermatology | 2016

Biostatistics Series Module 1: Basics of Biostatistics.

Avijit Hazra; Nithya Gogtay

Although application of statistical methods to biomedical research began only some 150 years ago, statistics is now an integral part of medical research. A knowledge of statistics is also becoming mandatory to understand most medical literature. Data constitute the raw material for statistical work. They are records of measurement or observations or simply counts. A variable refers to a particular character on which a set of data are recorded. Data are thus the values of a variable. It is important to understand the different types of data and their mutual interconversion. Biostatistics begins with descriptive statistics that implies summarizing a collection of data from a sample or population. Categorical data are described in terms of percentages or proportions. With numerical data, individual observations within a sample or population tend to cluster about a central location, with more extreme observations being less frequent. The extent to which observations cluster is summarized by measures of central tendency while the spread can be described by measures of dispersion. The confidence interval (CI) is an increasingly important measure of precision. When we observe samples, there is no way of assessing true population parameters. We can, however, obtain a standard error and use it to define a range in which the true population value is likely to lie with a certain acceptable level of uncertainty. This range is the CI while its two terminal values are the confidence limits. Conventionally, the 95% CI is used. Patterns in data sets or data distributions are important, albeit not so obvious, component of descriptive statistics. The most common distribution is the normal distribution which is depicted as the well-known symmetrical bell-shaped Gaussian curve. Familiarity with other distributions such as the binomial and Poisson distributions is also helpful. Various graphs and plots have been devised to summarize data and trends visually. Some plots, such as the box-and-whiskers plot and the stem-and-leaf plot are used less often but provide useful summaries in select situations.

Indian Journal of Dermatology | 2016

Biostatistics Series Module 5: Determining Sample Size.

Avijit Hazra; Nithya Gogtay

Determining the appropriate sample size for a study, whatever be its type, is a fundamental aspect of biomedical research. An adequate sample ensures that the study will yield reliable information, regardless of whether the data ultimately suggests a clinically important difference between the interventions or elements being studied. The probability of Type 1 and Type 2 errors, the expected variance in the sample and the effect size are the essential determinants of sample size in interventional studies. Any method for deriving a conclusion from experimental data carries with it some risk of drawing a false conclusion. Two types of false conclusion may occur, called Type 1 and Type 2 errors, whose probabilities are denoted by the symbols σ and β. A Type 1 error occurs when one concludes that a difference exists between the groups being compared when, in reality, it does not. This is akin to a false positive result. A Type 2 error occurs when one concludes that difference does not exist when, in reality, a difference does exist, and it is equal to or larger than the effect size defined by the alternative to the null hypothesis. This may be viewed as a false negative result. When considering the risk of Type 2 error, it is more intuitive to think in terms of power of the study or (1 − β). Power denotes the probability of detecting a difference when a difference does exist between the groups being compared. Smaller α or larger power will increase sample size. Conventional acceptable values for power and α are 80% or above and 5% or below, respectively, when calculating sample size. Increasing variance in the sample tends to increase the sample size required to achieve a given power level. The effect size is the smallest clinically important difference that is sought to be detected and, rather than statistical convention, is a matter of past experience and clinical judgment. Larger samples are required if smaller differences are to be detected. Although the principles are long known, historically, sample size determination has been difficult, because of relatively complex mathematical considerations and numerous different formulas. However, of late, there has been remarkable improvement in the availability, capability, and user-friendliness of power and sample size determination software. Many can execute routines for determination of sample size and power for a wide variety of research designs and statistical tests. With the drudgery of mathematical calculation gone, researchers must now concentrate on determining appropriate sample size and achieving these targets, so that study conclusions can be accepted as meaningful.

Indian Journal of Dermatology | 2017

Biostatistics series module 10: Brief overview of multivariate methods

Avijit Hazra; Nithya Gogtay

Multivariate analysis refers to statistical techniques that simultaneously look at three or more variables in relation to the subjects under investigation with the aim of identifying or clarifying the relationships between them. These techniques have been broadly classified as dependence techniques, which explore the relationship between one or more dependent variables and their independent predictors, and interdependence techniques, that make no such distinction but treat all variables equally in a search for underlying relationships. Multiple linear regression models a situation where a single numerical dependent variable is to be predicted from multiple numerical independent variables. Logistic regression is used when the outcome variable is dichotomous in nature. The log-linear technique models count type of data and can be used to analyze cross-tabulations where more than two variables are included. Analysis of covariance is an extension of analysis of variance (ANOVA), in which an additional independent variable of interest, the covariate, is brought into the analysis. It tries to examine whether a difference persists after “controlling” for the effect of the covariate that can impact the numerical dependent variable of interest. Multivariate analysis of variance (MANOVA) is a multivariate extension of ANOVA used when multiple numerical dependent variables have to be incorporated in the analysis. Interdependence techniques are more commonly applied to psychometrics, social sciences and market research. Exploratory factor analysis and principal component analysis are related techniques that seek to extract from a larger number of metric variables, a smaller number of composite factors or components, which are linearly related to the original variables. Cluster analysis aims to identify, in a large number of cases, relatively homogeneous groups called clusters, without prior information about the groups. The calculation intensive nature of multivariate analysis has so far precluded most researchers from using these techniques routinely. The situation is now changing with wider availability, and increasing sophistication of statistical software and researchers should no longer shy away from exploring the applications of multivariate methods to real-life data sets.

Indian Journal of Dermatology | 2017

Biostatistics series module 9: Survival analysis

Avijit Hazra; Nithya Gogtay

Survival analysis is concerned with “time to event“ data. Conventionally, it dealt with cancer death as the event in question, but it can handle any event occurring over a time frame, and this need not be always adverse in nature. When the outcome of a study is the time to an event, it is often not possible to wait until the event in question has happened to all the subjects, for example, until all are dead. In addition, subjects may leave the study prematurely. Such situations lead to what is called censored observations as complete information is not available for these subjects. The data set is thus an assemblage of times to the event in question and times after which no more information on the individual is available. Survival analysis methods are the only techniques capable of handling censored observations without treating them as missing data. They also make no assumption regarding normal distribution of time to event data. Descriptive methods for exploring survival times in a sample include life table and Kaplan–Meier techniques as well as various kinds of distribution fitting as advanced modeling techniques. The Kaplan–Meier cumulative survival probability over time plot has become the signature plot for biomedical survival analysis. Several techniques are available for comparing the survival experience in two or more groups – the log-rank test is popularly used. This test can also be used to produce an odds ratio as an estimate of risk of the event in the test group; this is called hazard ratio (HR). Limitations of the traditional log-rank test have led to various modifications and enhancements. Finally, survival analysis offers different regression models for estimating the impact of multiple predictors on survival. Coxs proportional hazard model is the most general of the regression methods that allows the hazard function to be modeled on a set of explanatory variables without making restrictive assumptions concerning the nature or shape of the underlying survival distribution. It can accommodate any number of covariates, whether they are categorical or continuous. Like the adjusted odds ratios in logistic regression, this multivariate technique produces adjusted HRs for individual factors that may modify survival.

Indian Journal of Dermatology | 2017

Biostatistics series module 8: Assessing risk

Avijit Hazra; Nithya Gogtay

In observational studies, as well as in interventional ones, it is frequently necessary to estimate risk that is the association between an observed outcome or event and exposure to one or more factors that may be contributing to the event. Understanding incidence and prevalence are the starting point in any discussion of risk assessment. Incidence rate uses person-time as the denominator rather than a simple count. Ideally, rates and ratios estimated from samples should be presented with their corresponding 95% confidence intervals (CIs). To assess the importance of an individual risk factor, it is necessary to compare the risk of the outcome in the exposed group with that in the nonexposed group. A comparison between risks in different groups can be made by examining either their ratio or the difference between them. The 2 × 2 contingency table comes in handy in the calculation of ratios. Odds ratio (OR) is the ratio of the odds of an event in the exposed group, to the odds of the same event in the nonexposed group. It can range from zero to infinity. When the odds of an outcome in the two groups are identical, then the OR equals one. OR >1 indicates exposure increases risk while OR <1 indicates that exposure is protecting against risk. The OR should be presented with its 95% CI to enable more meaningful interpretation – if this interval includes 1, then even a relatively large OR will not carry much weight. The relative risk (RR) denotes the ratio of risk (probability) of event in exposed group to risk of same event in the nonexposed group. Its interpretation is similar (but not identical) to the OR. If the event in question is relatively uncommon, values of OR and RR tend to be similar. Absolute risk reduction (ARR) is a measure of the effectiveness of an intervention with respect to a dichotomous event. It is calculated as proportion experiencing the event in control group minus the proportion experiencing the event in treated group. It is often used to denote the benefit to the individual. The reciprocal of ARR is the number needed to treat (NNT), and it denotes the number of subjects who would need to be treated to obtain one more success than that obtained with a control treatment. Alternatively, this could also denote the number that would need to be treated to prevent one additional adverse outcome as compared to control treatment. Extended to toxicity, the NNT becomes a measure of harm and is then known as the number needed to harm (NNH). NNT and NNH are important concepts from the policy makers perspective and ideally should be calculated in all trials of therapeutic or prophylactic intervention.

Indian Journal of Dermatology | 2017

Biostatistics series module 7: The statistics of diagnostic tests

Avijit Hazra; Nithya Gogtay

Crucial therapeutic decisions are based on diagnostic tests. Therefore, it is important to evaluate such tests before adopting them for routine use. Although things such as blood tests, cultures, biopsies, and radiological imaging are obvious diagnostic tests, it is not to be forgotten that specific clinical examination procedures, scoring systems based on physiological or psychological evaluation, and ratings based on questionnaires are also diagnostic tests and therefore merit similar evaluation. In the simplest scenario, a diagnostic test will give either a positive (disease likely) or negative (disease unlikely) result. Ideally, all those with the disease should be classified by a test as positive and all those without the disease as negative. Unfortunately, practically no test gives 100% accurate results. Therefore, leaving aside the economic question, the performance of diagnostic tests is evaluated on the basis of certain indices such as sensitivity, specificity, positive predictive value, and negative predictive value. Likelihood ratios combine information on specificity and sensitivity to expresses the likelihood that a given test result would occur in a subject with a disorder compared to the probability that the same result would occur in a subject without the disorder. Not all test can be categorized simply as “positive” or “negative.” Physicians are frequently exposed to test results on a numerical scale, and in such cases, judgment is required in choosing a cutoff point to distinguish normal from abnormal. Naturally, a cutoff value should provide the greatest predictive accuracy, but there is a trade-off between sensitivity and specificity here - if the cutoff is too low, it will identify most patients who have the disease (high sensitivity) but will also incorrectly identify many who do not (low specificity). A receiver operating characteristic curve plots pairs of sensitivity versus (1 − specificity) values and helps in selecting an optimum cutoff – the one lying on the “elbow” of the curve. Cohens kappa (κ) statistic is a measure of inter-rater agreement for categorical variables. It can also be applied to assess how far two tests agree with respect to diagnostic categorization. It is generally thought to be a more robust measure than simple percent agreement calculation since kappa takes into account the agreement occurring by chance.

Explore More

Collaboration

Dive into the Avijit Hazra's collaboration.

Top Co-Authors

Nithya Gogtay

King Edward Memorial Hospital

View shared research outputs

Explore More

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot

Dive into the research topics where Avijit Hazra is active.

Publication

Featured researches published by Avijit Hazra.

Biostatistics series module 3: Comparing groups: Numerical variables

Biostatistics series module 4: Comparing groups - categorical variables

Biostatistics series module 6: Correlation and linear regression

Biostatistics Series Module 2: Overview of Hypothesis Testing.

Biostatistics Series Module 1: Basics of Biostatistics.

Biostatistics Series Module 5: Determining Sample Size.

Biostatistics series module 10: Brief overview of multivariate methods

Biostatistics series module 9: Survival analysis

Biostatistics series module 8: Assessing risk

Biostatistics series module 7: The statistics of diagnostic tests

Collaboration

Dive into the Avijit Hazra's collaboration.