[PDF] Who Gets the Job and How are They Paid? Machine Learning Application on H-1B Case Data

Abstract

In this paper, we use machine learning techniques to explore the H-1B application dataset disclosed by the Department of Labor (DOL), from 2008 to 2018, in order to provide more stylized facts of the international workers in US labor market. We train a LASSO Regression model to analyze the impact of different features on the applicant's wage, and a Logistic Regression with L1-Penalty as a classifier to study the feature's impact on the likelihood of the case being certified. Our analysis shows that working in the healthcare industry, working in California, higher job level contribute to higher salaries. In the meantime, lower job level, working in the education services industry and nationality of Philippines are negatively correlated with the salaries. In terms of application status, a Ph.D. degree, working in retail or finance, majoring in computer science will give the applicants a better chance of being certified. Applicants with no or an associate degree, working in the education services industry, or majoring in education are more likely to be rejected.

Full PDF

WW HO G ETS THE J OB AND H OW ARE T HEY P AID ?M ACHINE L EARNING A PPLICATION ON

H-1B C

ASE D ATA

A P

REPRINT

Barry Shikun Ke ∗ Applied MathematicsColumbia University

Angela Qiao

Applied MathematicsColumbia UniversityApril 25, 2019 A BSTRACT

In this paper, we use machine learning techniques to explore the H-1B application dataset disclosedby the Department of Labor (DOL), from 2008 to 2018, in order to provide more stylized facts ofthe international workers in US labor market. We train a LASSO Regression model to analyze theimpact of different features on the applicant’s wage, and a Logistic Regression with L1-Penalty asa classiﬁer to study the feature’s impact on the likelihood of the case being certiﬁed. Our analysisshows that working in the healthcare industry, working in California, higher job level contribute tohigher salaries. In the meantime, lower job level, working in the education services industry andnationality of Philippines are negatively correlated with the salaries. In terms of application status, aPh.D. degree, working in retail or ﬁnance, majoring in computer science will give the applicants abetter chance of being certiﬁed. Applicants with no or an associate degree, working in the educationservices industry, or majoring in education are more likely to be rejected.

The United States has always been attractive to international students due to its welcoming culture, quality education anda strong job market. In 2017, there were 1.21 million international students in the country, around 25% of internationalstudents worldwide. After graduation, some of them will choose to stay in the country and work for the U.S. ﬁrm.For these foreign-born professionals, the ﬁrst week of April is an extremely stressful time as the companies they workfor are rushing to ﬁle their H-1B visa applications. Later in April, a random lottery will choose less than a half ofthe applicants and they are allowed to temporarily work in the country. Employers must attest, on a labor conditionapplication (LCA) certiﬁed by the Department of Labor (DOL), that employment of the H-1B worker will not adverselyaffect the wages and working conditions of similarly employed U.S. workers.Research have been done on analyzing the overall impact of H-1B policy. However, we’ve noticed that Labor ConditionApplication ("LCA") disclosure data from the U.S. Department of Labor provides includes comprehensive informationregarding wage, industry, application decision, etc.In this paper, we will ﬁrst do data exploration and look at how wages and number of applications differ by factors suchas job sectors, states, and citizenship. We will then use Lasso Regression to look at how different factors impact the ∗ This paper is built upon the research project for APMA4903: Seminar in Applied Math at Columbia University. We thankProfessor Chris Wiggins for helpful comments and guidance for the project. For replication code, data, and presentation slides pleasevisit https://github.com/BarryKeee/APMA4903 a r X i v : . [ s t a t . A P ] A p r PREPRINT - A

PRIL

25, 2019wage. We will also conduct a logistic regression to predict the status of the application (certify/deny) based on theproﬁle (wage, sector, etc).Our analysis shows that working in the healthcare industry, working in California, higher job level contribute to highersalaries. In the meantime, lower job level, working in the education services industry and nationality of Philippinesare negatively correlated with the salaries. In terms of application result, a Ph.D. degree, working in retail or ﬁnance,majoring in computer science will give the applicants a better chance of being certiﬁed. Applicants with no or anassociate degree, working in the education services industry, or majoring in education are more likely to be rejected.To the best of our knowledge, the H-1B dataset has not been deeply explored by statisticians or economists. Onerecent literature in economics ([1]) founds a negative effect of H-1B cap restriction on H-1B hiring by for-proﬁt ﬁrmsbut does not change the hiring of US-born workers, which implies a low degree of substitution in the labor marketbetween foreign and domestic workers. They also ﬁnd redistribution of H-1Bs towards computer-related occupations,Indian-born workers and ﬁrms with intensive H-1B usage history, which are conﬁrmed by our analysis. However, manycross-sectional features in the dataset are not explored, and we would like to provide more stylized facts of foreignlabor supply and demand in the US by exploring the H-1B applications. Using the results, we are able to know moreabout which factors impact wage distribution and the application decision. Domestic and foreign workers are able togain reference to the average wage of different levels, and further research could be done to investigate whether thereexists wage discrimination against foreign workers. Students and ﬁrms can use the status prediction model to estimatethe probability of being certiﬁed by the Department of Labor.

We will be looking at wage and application data from 2008 to 2015. The variables included are wage, date of application,employer name, location, economic sector, job title, and citizenship. After 2015, we have additional variables includingtotal number of employees, the founding year of the ﬁrm, education level, university, major, and prior workingexperiences.We ﬁrst do a box plot of wages by job level. There are four levels in total and the medium wage increases by joblevel. Level IV has the highest range and level II has the lowest range. In terms of sectors, people who work inhealthcare, retail, ﬁnance, and information technology earn the highest wages, while as people who work in agricultureand education service earn the lowest. People who work in healthcare sector also have the highest difference in wages.The wages increase by year and the biggest jump is from 2010 to 2011 as the country has just recovered from theﬁnancial crisis in 2008. It has grown steadily after 2011 and has been stable in recent years.Indians ﬁle the greatest number of applications and account for around 75 percent of the total applicants, followed byChineses, Canadians and Koreans. An US-based Indian technology company Cognizant Technology Solutions, alongwith Microsoft, Google, Intel, and Amazon ﬁle the most amount of applications. Correspondingly, California wheremost technology companies are situated is on top of number of applications by states. We see a drop in number ofapplications after the 2008 crisis. It resumes high in 2010, but then steadily increases until 2013. The recent peak is in2016, but the number again drops by almost twenty ﬁve percent in 2017 as the Trump administration issued tighterregulations on H-1B applications. Thus, the application is highly affected by economic cycle and policies. Differentsectors share similar pattern in increase and drop in number of applications each year.

The ﬁrst data analysis question is about the determinants of H-1B wages. The richness of cross-sectional features in theH-1B dataset provides us with great opportunities to understand how different employers set wages for employees withdifferent backgrounds. We are interested in extracting the ﬁrm-level and individual-level features that have a signiﬁcant2

PREPRINT - A

PRIL

25, 2019 (a) Box plot of wages by job level (b) Box plot of wages by sector(c) Box plot of wages by year

Figure 1: Box plot of wages3

PREPRINT - A

PRIL

25, 2019 (a) Number of applications by citizenship (b) Number of applications by ﬁrm(c) Number of applications by states (d) Number of applications by year

Figure 2: Box plot of wages4

PREPRINT - A

PRIL

25, 2019impact on the wages given by the H-1B employers. Although we can achieve this by a simple Ordinary Least Square(OLS) regression and simply compare the coefﬁcient for each feature, this may raise two potential issues.The ﬁrst issue is prediction accuracy: the OLS estimates often have a low bias (or no bias) but large variance. Predictionaccuracy can sometimes be improved by shrinking or setting some coefﬁcients to zero. By doing so we sacriﬁce a littlebit of bias to reduce the variance of the predicted values and hence may improve the overall prediction accuracy. Thesecond issue is interpretability: with a large number of predictors, we often would like to determine a smaller subsetthat exhibits the strongest effects. Hence, in order to preserve both model performance and the interpretability of themodel, for the wage regression, we propose to use Least Absolute Shrinkage and Selection Operator (LASSO) as ourmodel to perform feature selection task for H-1B wages.

For each H-1B case i , we deﬁne y i to be the wage of this H-1B case, x ji to be the value of feature j of it and x i to bethe vector of feature values of case i . Let p be the total number of features and let β = ( β , β , · · · β p ) . The modelsolves the following problem ([2], [3]) min ( β ,β ) ∈ R p +1 R α ( β , β ) = min ( β ,β ) ∈ R p +1 [ 12 N N (cid:88) i =1 ( y i − β − x Ti β ) + αP ( β )] (1)where P ( β ) is the regularization term. LASSO Regression is a speciﬁc case where the regularization term representsthe L1-norm: P ( β ) = || β || L = p (cid:88) j =1 | β j | (2)To solve the optimization problem in LASSO we will use Coordinate Descent (CD) algorithm. It is not the paper’smain objective to derive the functionality of CD algorithm, but the basic idea of Coordinate Descent is that we partiallyoptimize with respect to one coordinate, assuming other coefﬁcients are known at the optimum. Speciﬁcally, supposewe have estimates ˜ β and ˜ β l for l (cid:54) = j and we wish to partially optimize with respect to β j . We want to take the gradientat β j = ˜ β j . Because of the L1 penalty term, it only exists if ˜ β j (cid:54) = 0 . The gradient of R α ( β , β ) is ∂R α ∂β j | β = ˜ β = − N N (cid:88) i =1 x ji ( y i − ˜ β − x Ti ˜ β ) + α (3)if ˜ β j > , and ∂R α ∂β j | β = ˜ β = − N N (cid:88) i =1 x ji ( y i − ˜ β − x Ti ˜ β ) − α (4)if ˜ β j < . By setting the gradient to 0, we can solve for the update scheme for ˜ β j : ˜ β j ← S ( 1 N N (cid:88) i =1 x ji ( y i − ˜ y i ( j ) ) , α ) (5)where y i − ˜ y i ( j ) is the partial residual of ﬁtting β j and S ( z, γ ) is the soft-thresholding operator with value sign ( z )( | z | − γ ) + The beneﬁt of LASSO regression, as we can see from the update scheme, is that many features are set exactly at 0for updating, and therefore it automatically performs feature selection while solving the optimization problem. In ourproject, we will primarily rely on the scikit-learn package in Python ([4]), which has built-in functions for solvingLASSO problem using Coordinate Descent. 5

PREPRINT - A

PRIL

25, 2019

As mentioned in Section 2, the dataset contains more features for H-1B cases after 2015. Therefore, we will ﬁt twomodels. Model 1 ﬁts a LASSO regression using all the cases from 2008 to 2017 but only using a subset of all thefeatures which are presented before 2015. These features include the economic sectors of the employer, the state of theemployer, the citizenship of the applicants, the job level, the pay unit , and the year of the application. Model 2 ﬁtsa LASSO regression using the expanded feature space that is only presented in the dataset from 2015 to 2018. Thefeatures include all the features in Model 1 plus the major of the H-1B applicants in college, the education level, theownership interest of the applicant , prior job experience as the number of months worked, the number of foundingyears, and the employer’s total number of workers.As for data preprocessing, since many of the features are presented as categorical data, we will perform one-hot mappingfor all the categorical features. And in order to limit the number of features after one-hot mapping, we will limit theone-hot mapping to the top 100 categories in each feature in order to reduce the feature space . Also note that weare including the application year as a feature for wage determination because we observed wage trends in time, soincluding year as a feature will serve as a "ﬁxed time effect" for the model.For the LASSO regression, α > is the regularization parameter that controls the amount of shrinkage: the larger thevalue of α , the greater the amount of shrinkage. We use 10-fold Cross Validation to select α . Speciﬁcally, we dividethe dataset for Model 1 and Model 2 into 10 parts. For the k th part ( k ∈ { , , · · · } ) , we ﬁt the model to other 9parts of the data and calculate the prediction error of the ﬁtted model when predicting the k th part of the data. We dothis for k = 1 , , · · · and combine the K estimates of prediction error. We denote the ﬁtted function by ˆ f k ( i ) ( x i , α ) which is the ﬁtted value with the k th part of the data removed, and evaluate at x i ∈ R p . The cross-validation estimateof prediction error for this model is CV ( ˆ f , α ) = 1 n n (cid:88) i =1 ( y i − ˆ f k ( i ) ( x i , α )) (6)The function CV ( ˆ f , α ) provides an estimate of the test error given α and we hence ﬁnd the tuning parameter ˆ alpha using grid-search from α ∈ { α , α · · · α k } that minimizes it. Our ﬁnal chosen model is f ( x, ˆ α ) which we then ﬁt toall data.We also perform the out-of-sample analysis for the trained model. Speciﬁcally, we randomly split the dataset into twoparts: a training set which takes of the data, and a test set which takes the remaining of the data. We ﬁt the LASSORegression to the training data and tune the parameter α using the 10-fold Cross-Validation as described before. Afterwe train the model, we apply it to the test set and attain a predicted value ˆ y i . We then calculate the out-of-sample R (denoted as R OS , as the following: R OS = [ (cid:80) i ∈ I OS (( ˆ y i − ¯ˆ y i )( y i − ¯ y i )] (cid:80) i ∈ I OS ( ˆ y i − ¯ˆ y i ) (cid:80) i ∈ I OS ( y i − ¯ y i ) (7)The Cross-Validation result is shown in Figure 3. Using grid-search we ﬁnd the best tuning α = 1 . × − . We obtainan in-sample R of 0.54 and out-of-sample R of 0.54 for Model 1, and a in-sample R of 0.67 and out-of-sample R of 0.68 for Model 2. The similarity between in-sample and out-of-sample result shows no signiﬁcant heterogeneity incross-sectional prediction. Hourly, daily, weekly, monthly or annual payment. This is a reference for the length of the contract. A long-term job contractusually features a longer pay unit. whether the applicant owns the company This is especially relevant for LASSO regression since when p > n , LASSO selects at most n features. PREPRINT - A

PRIL

25, 2019 (a) Model1: Data for features exist before 2015 (b) Model1: Data for features exist after 2015

Figure 3: LASSO Cross-validation Result - CV Score ± Standard Error vs. Value of α Figure 4: Feature Importance for Model 1: Only Pre-2015 Features

The linear feature selection model helps us to select the most inﬂuential features that impact the wage of H-1B applicants.We look at both features that have the most positive coefﬁcients and features with the most negative coefﬁcients, so thatwe obtain a more comprehensive view of how different ﬁrm-level and individual-level features give both more andfewer wages. Figure 4 and Figure 5 plot respectively the 20 most positive and 20 most negative features for Model 1and Model 2.From the feature importance in Model 1 we conclude that H-1B applicants working in the healthcare industry usuallyreceive higher wages. Also, working in California, Washington, New York, and Massachusetts usually yields higher7

PREPRINT - A

PRIL

25, 2019Figure 5: Feature Importance for Model 2: Both Pre-2015 and Post-2015 Featureswages. A higher job level (i.e. Job Level IV) also has a positive impact on the wage H-1B applicants get. On the otherhand, lower job level (i.e. Job Level I and II) gives lower wages; working in education services gives lower wages, andhaving citizenship of Philippines or Japan also have a negative correlation with the wage. We also ﬁnd that Year-2018has a very negative coefﬁcient, which is consistent with the ﬁnding of an increasing trend in wage, and the particularnegative coefﬁcient for 2008 is most likely due to the ﬁnancial crisis.Model 2 considers the additional features added after 2015 and therefore only uses data from 2015 to 2018. From thefeature importance in Model 2 we ﬁnd that H-1B applicants with majors in Medicine, surgery or pharmacy have higherwages, which is correlated to the high earnings found in the healthcare industry in Model 1. In addition, petroleumengineering and law major applicants also tend to have higher wages. As for the negative territory, we ﬁnd majoring ingraphic design, social work, fashion design, or communication usually have a negative impact on the wage.

Another very relevant issue regarding H-1B application is whether the H-1B case gets approved by the U.S. Departmentof Labor. Each year the Department of Labor reviews the information of each application and decides if the case iscertiﬁed or not. After being certiﬁed, the H-1B case will enter a lottery pool where an ex-ante random selection isperformed to draw the cases to be ﬁnally approved. Our data from LCA covers only the ﬁrst stage of deciding whetherthe case gets certiﬁed into the lottery pool, and we cannot observe if the H-1B cases are getting ﬁnally approved from thelottery or not. However, the random selection happening during the lottery should be non-discriminating and thereforeshould not change the ex-post distribution of the approved H-1B cases. Therefore, we assume that the feature selections,as well as prediction of the certiﬁcation during the ﬁrst stage, will also speak to the ﬁnal approval of the H-1B case.We formulate this analysis as a classiﬁcation problem. For each H-1B case, we observe an outcome, whetherbeing positive (

Certified or Certified-Expired ) or negative (

Denied ). We will train a classiﬁcation model topredict whether a given H-1B case will be certiﬁed or not, and what are the features that contributed the most to thecertiﬁcation/denial decision of the H-1B case. Similar to our analysis for H-1B wages, we are concerned with bothmodel prediction and model interpretability. For better model prediction we need to account for the bias-variance8

PREPRINT - A

PRIL

25, 2019tradeoff in model selection, and for better interpretability, we need a parametric model whose estimated parameters canmake economic sense. As a result of such consideration, we propose to use Logistic Regression with L1-Penalty, acousin of the LASSO regression used in our analysis in Section 3, as our model for the classiﬁcation analysis.

For each case i , we deﬁne its case status be y i = 1 if it is certiﬁed or y i = 0 if it is not. Suppose x ji is the value of j th feature of case i and x i be the feature vector. The unpenalized logistic regression model takes the following functionalform P r ( y i = 1 | x i ) = 11 + e − ( β + x Ti β ) (8) P r ( y i = 0 | x i ) = 1 − P r ( y i = 1 | x i ) = 11 + e ( β + x Ti β ) (9)Let p ( x i ) = P r ( y i = 1 | x i ) , we want to maximize the log likelihood of the joint distribution. Similar to LASSOregression, we add an L1 penalty term to the log-likelihood function for regularization. The maximization problem is([2]) max ( β ,β ) ∈ R p +1 [ 1 N N (cid:88) i =1 { y i log p ( x i ) + (1 − y i ) log(1 − p ( x i )) } − λP L ( β )] (10)where P L ( β ) is the L1-norm. Again, there are many ways to solve this optimization problem and the way how itis implemented is not the main focus of our paper. However, it’s worth mentioning the basic numerical treatmentjust for reference, because it is very hard to perform the Coordinate Descent algorithm for this problem since thegradient solution yields no analytic solution. Instead, we focus on an approximated problem. Consider the unpenalizedlog-likelihood function (cid:96) ( β , β ) as (cid:96) ( β , β ) = 1 N N (cid:88) i =1 y i ( β + x Ti β ) − log (1 + e ( β + x Ti β ) (11)Denote ˜ β and ˜ β be the current estimates, if we perform Taylor expansion of the unpenalized log-likelihood functionaround ˜ β and ˜ β to the second order we will get a quadratic approximation (cid:96) Q ( β , β ) := − N N (cid:88) i =1 w i ( z i − β − x Ti β ) + O ( || β − ˜ β || ) (12)where z i = ˜ β + x Ti ˜ β + y i − ˜ p ( x i )˜ p ( x i )(1 − ˜ p ( x i )) (13) w i = ˜ p ( x i )(1 − ˜ p ( x i )) (14)Then we can write the approximated optimization problem for Logistic Regression with L1-Penalty to be9 PREPRINT - A

PRIL

25, 2019 min ( β ,β ) ∈ R p +1 [ − (cid:96) Q ( β , β ) + λP L ( β )] (15)We can see that such minimization problem is the same as the one in LASSO regression, and therefore by performingthe Coordinate Descent update scheme we can easily solve the problem. As another beneﬁt from this, we also obtainmany feature weights to be set exactly at zero, which performs feature selection. In our analysis this part, we also relyon the built-in functions in scikit-learn package to perform model training. In order to utilize the richness of cross-sectional features, we will perform the classiﬁcation analysis on the datasetfrom 2015 to 2018, where more features are added. Similar to Model 2 in the Section 3, the features we consider forour classiﬁcation model includes wage, economic sector, job level, pay unit, working state (location), education level,job experience, applicants college major, employer’s history, ownership interest, and the total number of employees ofthe ﬁrm. Similarly, we use one-hot mapping for the 100 most frequent categories for the categorical features. As formodel evaluation, we use the Area Under the Receiver Operating Characteristic Curve (AUC) as our metrics followingstandard literature for classiﬁcation problem [5]. The Receiver Operating Characteristic (ROC) is created by plottingthe true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, and a larger area under theROC curve means more distinguished classes and a better classiﬁer.When training the classiﬁer, we want to make sure that both positive classes and negative classes have enoughrepresentations in the sample. That is not the case for original sample, where among the total 191693 H-1B casesfrom 2015 to 2018, only 7385 - about 5% cases are denied. Since the dataset is very unbalanced, we perform 1)undersampling from positive class and 2) oversampling from negative class to generate the training set so that thenumber of two classes are the same.For the Logistic Regression with L1-Penalty, λ > is the regularization parameter that controls the amount of shrinkage.We use 10-fold Cross-Validation to select λ similar to what we did in LASSO regression in Section 3. We also performan out-of-sample test, a test on 20% randomly selected sample from original dataset, for prediction performance usingAUC as the measurement. The entire analysis scheme can be summarized as the following: • Randomly select test set and training set • Generate a list of { λ , λ , · · · , λ n }• For each λ i : – Generate oversampling from denied class in training set, call it Sample 1 – Generate undersampling from certiﬁed class in training set, call it Sample 2 – Use the original training set as Sample 3 – For each sample in Sample 1 to 3: ∗ Use coordinate descent method to compute MLE of penalized LogReg with quadratic approximationof each sample ∗ Calculate AUC for each model – Calculate average AUC and call it the score of λ i • Find the λ i with the highest average AUCThe grid-search gives optimal λ = 1 . Under this choice of λ , We obtained the highest in-sample AUC score of 0.674and out-of-sample AUC score of 0.676 for Sample 3 (original sampling). Figure 6 shows the out-of-sample performanceof the trained model under the best choice of λ . 10 PREPRINT - A

PRIL

25, 2019Figure 6: Out-of-sample ROC curve and Precision-Recall curve for the trained model with different random samplingschemeFigure 7: Feature Importance from Logistic Regression with L1-Penalty model, with λ = 1 at original training dataset We look at both the positive and negative features selected from the Logistic Regression with L1-Penalty. Figure 7 plotsthe coefﬁcients for the most positive and negative features. We ﬁnd that applicants having a Ph.D. degree, majoring inComputer Science or Electrical Engineering or Medicine, and working in Retail or ﬁnance sector are more likely toget their H-1B case certiﬁed, whereas applicants having no or only an Associate degree, majoring in Education, andworking in the education services sector will have a higher chance to have their cases rejected.11

PREPRINT - A

PRIL

25, 2019

One interesting point we notice during our classiﬁcation analysis for H-1B certiﬁcation is that despite we have a veryunbalanced dataset, it still gives the best performance compared with the dataset generated from random sampling thathas an equal number of representation from positive and negative class. This is contradicting our expectation that aclassiﬁer is trained to the best when the represented classes are of similar weights. To study the performance of modelswith different sampling frequency and to check the robustness of our trained model, we perform an additional test of themodel with different sampling frequencies and different hyperparameter λ . The procedure follows • Generate a list of hyperparameter

Λ = { λ , λ , · · · λ k }• For each λ ∈ Λ : – Generate a list of sampling frequencies γ that maps to [0 , , where means no random sampling (originalsample) and means / sampling (equal weights from two classes) – For each γ ∗ Construct oversampling sample from Denied class and undersampling sample from Certiﬁed class ∗ Together with the original sample, train the LogReg model with L1-Penalty with λ being the regular-ization parameter ∗ Calculate out-of-sample AUC score – Plot the AUC score against sampling frequency λ For simplicity we choose the set of hyperparameter

Λ = { − , . , , } . Figure 8 shows the plot of AUC underdifferent sampling frequencies (0 being no random sampling and 1 being perfect sampling). We see that the performanceof randomly sampled models changes dramatically for a different choice of hyperparameters, which challenges ournotion that we should do random sampling as a necessary step for data pre-processing. On the other hand, the highestscore obtained from the original sample in the λ = 1 plot checks robustness for our analysis in Section 4. Now we would like to discuss a little bit of the strength and weakness of our analysis for both H-1B wage and status.For both analyses, the biggest strength of our model is that it preserves the best interpretability when modifying modelperformance. Both LASSO regression and Logistic Regression with L1-Penalty gives feature coefﬁcients that canbe directly compared with each other and be interpreted with economic intuition. In both cases, there is only onehyperparameter to tune and therefore avoids the threat of over-ﬁtting.

Nevertheless, our models and analysis can also be improved from different perspectives. The ﬁrst and most obviousshortcoming is the lack of identiﬁcation strategy that identiﬁes a causal relationship. Our models do not rule outconfounding factors that are not presented in the feature space but could potentially inﬂuence the outcome. Thereforethe results presented in this paper can only be viewed as an application of machine learning that ﬁnds correlations inthe real world, rather than rigorous economic research that pins down causality of how wage is determined by H-1Bemployers or how certiﬁcation decision is made by the Department of Labor. A further improvement toward thisdirection is to implement less machine learning, but more econometric tools such as Instrumental Variables (IV) or,since we have a panel dataset, use Difference-in-difference (Diff-in-diff) method to identify causality.Another potential improvement to our models is to ﬁnd determinants of wage and H-1B status beyond cross-sectionalvariations. While it is very important to understand cross-sectional differences, people might also be interested inlearning predictions in time-series, for example, how would the trend of wage increase continue, or whether theDepartment of Labor will look more at some characteristics in the future. A major obstacle for conducting such12

PREPRINT - A

PRIL

25, 2019Figure 8: Model performance under different sampling frequencies. 0 being no random sampling and 1 being 50/50sampling frequencyanalysis is that H-1B applications are very subjective to exogenous policy shocks, and the distributions from year toyear could be completely different due to some immigration policies implemented by the government. In our analysis,we controlled time-ﬁxed effect, but a further step is, hopefully with more data, to look at how the implementation ofdifferent immigration policies can inﬂuence the H-1B applications.

In this paper, we apply machine learning approaches to study the H-1B application dataset. Use LASSO regressionmodel we studied the features that have the most signiﬁcant impact on the H-1B applicants’ wages. We ﬁnd thatapplicants working in the healthcare industry or majored in healthcare-related majors usually have the highest wages(detailed result in Section 3.3). We trained a Logistic Regression with L1-Penalty as a classiﬁer to extract features thathave the most impact on the likelihood of having H-1B application certiﬁed. We ﬁnd applicants with higher educationlevel and majoring in computer science/electrical engineering have higher possibility to have their application certiﬁed.We also ﬁnd that the performance of the classiﬁer is higher when not doing random sampling for the unbalanceddataset, which casts doubt on the widely held notion of using random sampling as a step of pre-processing and calls forcase-by-case analysis for different datasets. 13

PREPRINT - A

PRIL

25, 2019

References [1] Anna Maria Mayda, Francesc Ortega, Giovanni Peri, Kevin Shih, and Chad Sparber. The effect of the h-1b quotaon employment and selection of foreign-born labor.

NBER Working Paper No. w23902 , 2017.[2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models viacoordinate descent.

Journal of Statistical Software , 33(1):1–22, 2010.[3] Jerome Friedman, Trevor Hastie, Holger Hoﬂing, and Robert Tibshirani. Pathwise coordinate optimization.

TheAnnuals of Applied Statistics , 1(2):302–332, 2007.[4] Andreas C. Müller and Sarah Guido.

Introduction to Machine Learning with Python . O’Reilly Media, Inc, 2016.[5] Tom Fawcett. An introduction to roc analysis.