11 Designing an Algorithm that Detects Fake Amazon Reviews
Seung Ah Choi
Abstract
Often, there are suspicious Amazon reviews that seem to be excessively positive or have been created through a repeating algorithm. I moved to detect fake reviews on Amazon through semantic analysis in conjunction with meta data such as time, word choice, and the user who posted. I first came up with several instances that may indicate a review isn't genuine and constructed what the algorithm would look like. Then I (with the help from others) coded the algorithm and tested the accuracy of it using statistical analysis and also analyzed it based on the six qualities of code.
Introduction Semantic Evaluation
Here I propose several instances in which certain semantics can point to how the amazon review is not genuine that I will base the code off of. *Key For the Greek algorithms
1= the review is fake, 0= the review is not fake. 2)
Review 1= R1, Review 2=R2. 3)
The abbreviated term for each word is indicated in the box itself
1. Look for exaggerated words that may indicate that the review is fake a) Word Bin
Too positive(P)
Exceptional Outstanding Astonishing Amazing Phenomenal
Too negative(N)
Worst Terrible Appalling Disastrous Figure 1 b)
Confirmation method: If any of the exaggerated words are included in the review then it may indicate that the review is fake. c)
Algorithm: i)
Greek: PVN = 1 ii)
English: if the review includes either the too positive term or the too negative term it is fake.
2. If the review is the exact same as other reviews there is a high chance that the review was copied and pasted multiple times. (In the same way, if there is a way to access the user’s previous reviews to see if any are duplicates, that could also be an indicator to figure out what reviews are fake and which are genuine.) a) Confirmation method: If there are duplicates of the same review multiple times, it would probably be considered fake. b)
Algorithm: i)
Greek: -(R1 XOR R2) ii)
English: if the first review ran through the program is exactly the same as another program then both reviews are fake
3. Look for professional words that may indicate that the review is fake a) Word bin
Professions (Degree):
Ph.D. M.D. D.D.S. Mention of a Professional Degree including Masters, Post-Masters in fields such as Audiology, Chiropractic, Dentistry, Law, Education, Medicine etc
Honorifics that may be unnecessary to use for an Amazon review:
Dr. Mr. Mrs. Captain Coach Professor Reverend Figure 2 b)
Confirmation method: If the review states to be a profession of higher education that does not necessarily have to mention, the review might be fake because it might be trying to get a free degree. c)
Algorithm i)
Greek: -(R1 XOR R2) ii)
English: if the first review ran through the program is exactly the same as another program, then both reviews are fake.
4. Length of the Review a) Confirmation method: If the length of the review is less than two sentences or is a very short sentence/word, the review is probably fake. A good word count could be around more than 10 words and 50 characters in the string. b)
Algorithm i)
Greek: (Word Count) /\ (Review Length) = 0 ii)
English: If the review length meets the word count limit, then the review is probably not fake.
5. Number of helpful votes a) Confirmation method: If the number of helpful votes that the review gets is above a certain number, then the review is probably reliable. OR If the number of helpful votes that a certain user gets is significantly less than the number of reviews they gave, that may mean that they are just spamming reviews on Amazon. b)
Algorithm i)
Greek: (Number of Likes) /\ (Likes of the Review) = 0 ii)
English: If the review meets the number of likes limit, then the review is probably not fake.
6. See if any word from the product title or product category is mentioned in the review a) Confirmation method: If the number of helpful votes that the review gets is above a certain number, then the review is probably reliable. OR If the number of helpful votes that a certain user gets is significantly less than the number of reviews they gave, that may mean that they are just spamming reviews on Amazon. b)
Algorithm iii)
Greek: (the product is mentioned) = 1 iv)
English: If the review contains the name of the product or mentions the product, the review is probably not fake.
7. See if the review contains a relevant photo of the product a) Confirmation method: If the review includes photos of the product they bought themselves it probably indicates that the review is genuine. b)
Algorithm v)
Greek: picture exists = 1 vi)
English: If the review contains one or more pictures of the product, the review is probably not fake.
Overview of the Code
Link to code: https://github.com/sachoi613/AmazonReviews/tree/master Following are specific algorithms that were used in the code: 1.
Detect presence / absence of a word (or member of word group) & Count unique words:
We used an array list to contain parameters from a method named AmazonReviews. The parameters of the method include the product ID, product title, text, star ratings, helpful votes, and a boolean variable to determine if the user made the purchase of the specific object or not. This array list is also used to go through a tsv file, splitting it up by a tab. Using this the boolean variable to determine if it was a verified purchase which is then set to true. This is used to count the absences of words and the unique words used.
Figure 3 2.
Find most frequently occurring words:
Another skill we used was a txt and tsv file. The txt files were used to store the word bins listed above such as positive and negative words, amazon reviews, and a list of reviews indicating if it is false or not. A Scanner is used to look at each line of the amazon review txt file and trims it. By using the txt files the program is able to search for certain words, resulting in the product.
Figure 4 3.
Count word occurrences:
We also used an if- else statement to determine if the review is negative or positive. It also outputs the predicted rating of the review based on the information. The first if-else statement counts the number of positive and negative words and which also returns if it is extremely positive, negative, or neutral. It uses logical operators (<, >, &&, =) to check if it is higher or lower than a specific percentage. As it uses the logical operators it returns strings suggesting if it is positive or not. After the first if- else statement there is another one leading from it, which uses the specific string which was returned from the previous if- else statement. The if part of the structure uses a string variable called sentiment and checks if that is equal to the specific string returned from the previous if- else statement as it returns the assigned predicted rating for each string. This counts the most frequently used words such as a positive, negative, and neutral category.
Figure 5 4.
Find the “distance” between two words in a text:
In the code, a for loop was used to take out the punctuation for more efficiency for the code. It starts with the internet value i which is declared to 0. The logical operator < is used to compare the variable i to the word length. As the last part of the for loop, i is then incremented. In the for loop there is an if statement for if the word contains punctuation (not including specific punctuation and the alphabet). It is then run through more in depth and takes out the punctuation, causing the code to run more easily. This shows the distance between words, slitting them without a punctuation.
Figure 6 5.
Count unique words:
To open and close a file, a try- catch block is used to access the file. In the try block, it contains a set of statements where an exception can occur. In the catch block, it handles exceptions and errors that could occur in both blocks. In the code programmed, like in the second paragraph it uses Scanner to look at each line. This reads each line for specific and unique words and categories. Figure 7 Test Suit
Following is my thought process for finding a test suit or a Training Set for the algorithm on the internet.
Things that could potentially be good in a test suite: - Name of Product -
Text of the review -
Photos Included in the Review -
Information about the writer of the review -
The writers’ previous reviews -
The ratio of total helpful votes the write got versus the number of total reviews posted -
The number of helpful votes on a certain review -
Product number
Some example Amazon reviews that we found prior to writing the code: Bad Review:
Figure 8 -too short in length -The content is irrelevant to the product he/she purchased -There are not that many helpful votes and not a lot of comments Profile of a Bad Reviewer:
Figure 9 -The ratio of the number of helpful votes to the number of reviews is small -The profile user name is not credible at all -All the other reviews are also very short and not legitimate
Good Review: Figure 10 -lengthy in text -picture to support and verify the purchase of the item -talks about the specific item in the text -The person is also a top 500 reviewer -a lot of helpful votes and a decent amount of comments that follows
Profile of a Good Reviewer: Figure 11 -The ratio of helpful votes to the number of reviews is really high -All the previous reviews are the length and put in great depth of thought -The name of the profile seems legit, it does not seem like a fake account
Example of an actual test suite that I found online: Example 1: https://s3.amazonaws.com/amazon-reviews-pds/readme.html marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date US 18778586 RDIJS7QYB6XNR B00BY 7X8 122952789 Monopoly Junior Board Game Toys 5 0 0 N Y Five Stars Excellent!!! 2015-08-31 US 24769659 R36ED1U38IELG8 B00D7JFOPC 952062646 56 Pieces of Wooden Train Track Compatible with All Major Train Brands Toys 5 0 0 N Y Good quality track at excellent price Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends. 2015-08-31 US 44331596 R1UE3RPRGCOLD B002LHA74O 818126353 Super Jumbo Playing Cards by S&S Worldwide Toys 2 1 1 N Y Two Stars Cards are not as big as pictured. 2015-08-31 US 23310293 R298788GS6I901 B00ARPLCGY 261944918 Barbie Doll and Fashions Barbie Gift Set Toys 5 0 0 N Y my daughter loved it and i liked the price and it came ... my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop! 2015-08-31 US 38745832 RNX4EXOBBPN5 B00UZOPOFW 717410439 Emazing Lights eLite Flow Glow Sticks - Spinning Light LED Toy Toys 1 1 1 N Y DONT BUY THESE! Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds. 2015-08-31 US 13394189 R3BPETL222LMIM B009B7F6CA 873028700 Melissa & Doug Water Wow Coloring Book - Vehicles Toys 5 0 0 N Y Five Stars Great item. Pictures pop thru and add detail as & US 433677 R2B8VBEPB4YEZ7 B00FGPU7U2 780517568 Fisher-Price Octonauts Shellington's On-The-Go Pod Toy Toys 5 0 0 N Y Five Stars Children like it 2015-08-31 US 1297934 R1CB783I7B0U52 B0013OY0S0 269360126 Claw Climber Goliath/ Disney's Gargoyles Toys 1 0 1 N Y Shame on the seller !!! Showed up not how it's shown . Was someone's old toy. with paint on it. 2015-08-31 US 52006292 R2D90RQQ3V8LH B00519PJTW 493486387 100 Foot Multicolor Pennant Banner Toys 5 0 0 N Y Five Stars Really liked these. They were a little larger than I thought, but still fun. 2015-08-31 (...)
Example 2: http://jmcauley.ucsd.edu/data/amazon/ {"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName": "cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0, 0], "reviewText": "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,", "overall": 5.0, "summary": "good", "unixReviewTime": 1393545600, "reviewTime": "02 28, 2014"}
Example 3 (The one used for the actual code that includes a ground evidence) https://raw.githubusercontent.com/aayush210789/Deception-Detection-on-Amazon-reviews-dataset/master/amazon_reviews.txt?fbclid=IwAR326w3vt5n51dKP7jKcBT1NQPuEbyehyz_JL8JVbDPwaqKdPYOYrG_5--0
DOC_ID LABEL RATING VERIFIED_PURCHASE PRODUCT_CATEGORY PRODUCT_ID PRODUCT_TITLE REVIEW_TITLE REVIEW_TEXT *Label 1 means fake and Label 2 means true*
1 __label1__ 4 N PC B00008NG7N Targus PAUK10U Ultra Mini USB Keypad, Black useful When least you think so, this product will save the day. Just keep it around just in case you need it for something. 2 __label1__ 4 Y Wireless B00LH0Y3NM Note 3 Battery : Stalion Strength Replacement 3200mAh Li-Ion Battery for Samsung Galaxy Note 3 [24-Month Warranty] with NFC Chip + Google Wallet Capable New era for batteries Lithium batteries are something new introduced in the market there average developing cost is relatively high but Stallion doesn't compromise on quality and provides us with the best at a low cost.
There are so many in built technical assistants that act like a sensor in their particular forté. The battery keeps my phone charged up and it works at every voltage and a high voltage is never risked. 3 __label1__ 3 N Baby B000I5UZ1Q Fisher-Price Papasan Cradle Swing, Starlight doesn't swing very well. I purchased this swing for my baby. She is 6 months now and has pretty much out grown it. It is very loud and doesn't swing very well. It is beautiful though. I love the colors and it has a lot of settings, but I don't think it was worth the money. 4 __label1__ 4 N Office Products B003822IRA Casio MS-80B Standard Function Desktop Calculator Great computing! I was looking for an inexpensive desk calcolatur and here it is. It works and does everything I need. Only issue is that it tilts slightly to one side so when I hit any keys it rocks a little bit. Not a big deal. (...) 10501 __label2__ 5 Y Office Products B005VCNRA2 SafeT Sleeves RFID Protectors (Total of 8 Sleeves) Fits fine inside a money belt I purchased this product to separate my credit cards in my money belt for my upcoming trip to Europe. They fit just fine, and offer a peace of mind from electronic theft anywhere you go. Price was well worth it. 10502 __label2__ 5 N Toys B00ICAKJJW Power Wheels Nickelodeon Teenage Mutant Ninja Turtles Kawasaki KFX Great fun for little ones. But be sure you fully charge before use. This is probably one of the most exciting gifts you can give to a young child. Most kids are fascinated with the concept of driving. So having a little vehicle they can safely maneuver is awesome. We purchased this type of vehicle for all 3 of my kids. Everyone of them was thrilled. In fact, we parked them inside the garage facing forward as though they were & LOVE pink. I've gotten a ton of compliments on it and I use it everyday it's still in great shape. I love this bag and will easily use it after school for traveling.
Corpi Used (Algorithm) *For the Greek algorithms 1= the review is fake, 0= the review is not fake. Review 1= R1, Review 2=R2. The abbreviated term for each word is indicated in the box itself P = too positive N = too negative (Word Count) /\ (Review Length) = 0 (Number of Likes) /\ (Likes of the Review) = 0 -(R1 XOR R2) PVN = 1 I used algorithms related to the different factors that could lead to a review being fake or not.
Six Qualities of Code
In the code, the method “Amazon Reviews” shows polymorphism because various inputs can be put in and the code will still work. The parameters that can be put in this method are string, boolean, long, and integer values/ variables. For example the string values are used to print/show the output of the program being “positive, neutral, and negative”. The integer values are associated with the word bins as the percentage of specific negative and positive words there are in the code/review. Figure 12
The code shows completeness because the code is provable on paper. The code was written out and completed. The soundness in the code/program is logically entailed and provable. As shown beforehand, the code can be written out in algorithms that logically make sense. For example, the variable isVerifiedPurchase is a boolean which can be written out in a logical algorithm to have an output that makes sense.
Figure 13
The program is decidable because the code will only work when I input some variable. It will not show an infinite loop, it has selection and correct iteration that will give the right output when needed. Figure 14
The program is correct because the code does what it should do depending on what we wrote. For example, in the Strings below, and shown in the data set, the public voids returns the variables as they should. Figure 15
Iteration and selection is used for efficiency in the code as it shortens the code and runs through it a specific amount of times, allowing there to be more decision making in the code. Loops and if-else statements are used in the code to reduce the amount of time hard coding each specific method and code. This decreases the amount of time spent and helps the programmer include different logical ways to solve the problem. (Specific explanations of the code was elaborated earlier)
Figure 16
Final Algorithm
Number of likes = L Likes of the Review = R Word Count = W Review Length = S Too Positive = P Too Negative = N Review 1 = R1 Review 2 = R2. -(R1 XOR R2) (P V N) (L V R) (W V S)
Statistical Analysis of Results
Chi-squared test: n=1999 Random model Your results +/+ 999 447 +/- 999 552 -/+ 1000 249 -/- 1000 751
Figure 17. Evaluating test results (using graphs and high-level statistics (either z-test, chi-squared test, or linear regression).
Significance level: .05
The chi-square statistic is 180.2087. The p -value is < 0.00001. The result is significant at p < .05. This means that our code is reliable to use and does more than just guessing (when the significance value is α=0.5) Bibliography "Amazon Customer Reviews Dataset."
Amazon News , s3.amazonaws.com/amazon-reviews-pds/readme.html. Accessed 31 Oct. 2019. "LINENSPA Shredded Foam Pillow Reviews."
Amazon
UCSD Education , jmcauley.ucsd.edu/data/amazon/. Accessed 31 Oct. 2019. "Synonyms and Other Words Related to Food."
Rhyme Zone
Thesaurus
Rhyme Zone
Rhyme Zone