By Christian Prokopp on 2022-05-03
Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which they frequently see and buy, to be accurate. Bold Data, which is processing 100s of millions of products daily, has a unique ability to find hidden insights and issues. For example, active Amazon bestsellers with names resulting from data processing errors.
Amazon serves data of 100s of millions of products on its websites, often provided by marketplace sellers and many being seen and sold rarely. Over the last decade, Bold Data's founder has seen a lot of poor data in this long tail. This included gems like Amazon test data that somehow made it onto the public website.
Last week, while processing the bestsellers of amazon.co.uk, amazon.de, and amazon.com, something peculiar surfaced. Bestsellers are products ranked in the top 100 in at least one product category. We found that four bestsellers had no names. Based on how the Amazon bestseller website presents its data, that should not be possible. What happened?
Looking at the Amazon web pages for the nameless products, what happened quickly becomes apparent to people familiar with data or software engineering.
But even if you are not an engineer, you can see the names in the images sound strange. Computer systems use NULL, NaN, NA, and similar outputs to indicate no data for a field or attribute to a human user. In simple terms and making some inferences, the upload into Amazon contained a message of no data, e.g. NULL
. Instead of failing, the message was converted into text and was stored wrongly as the product name in Amazon's product catalogue. When Bold Data analysed the data, the error reappeared.
You may know the saying "garbage in, garbage out", which computer scientists use. In particular, data engineers and data scientists use it to highlight that if your foundation, the data, is imperfect, then so will be the outcome. Therefore, experienced data professionals prioritise sourcing accurate data and its processing instead of applying increasingly complex analytics or machine learning algorithms.
While this is unfortunate and surprising that these items made it into the bestsellers without names, it is not as bad as it seems. The dataset analysed comprised 4.88 million products from three Amazon websites, the United States, Great Britain, and Germany. So close to one in a million was wrong, a small number. However, it demonstrates that our systems have to expect and accommodate data errors. Mined data is only as good as its source system.
While the error rate is low, the product's name is prominent, and other attributes have not as much scrutiny. We will publish datasets, analyses and more findings in the future. Be sure to subscribe to our email list so as not to miss these updates.
The described challenge in this post is one of many that data mining and data engineering face daily. The Internet has an abundance of valuable data. Mining and processing data at scale, with low cost, high confidence and quality are complex, requiring decades of experience. This is precisely what Bold Data has focused on for our customers. Create affordable, reliable datasets and decision support Analytics so you can make better decisions daily.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-03-14
Tax Shrink is a new online tool that helps owner-operators of Limited companies in the UK calculate and visualise the ideal salary-to-dividend rati...
2023-11-07
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get i...
2023-04-05
Test-driven development in Javascript with ChatGPT-4 works. An example demonstrates it using a precise description and refined prompt engineering.
2022-05-30
Your hard work is not appreciated. So why should you still do it? There is a good reason.
2022-05-25
When I mentor university students or discuss careers with the people I lead, I often draw from four pieces of advice. I wish I had known these when...
2022-04-25
Public data has an enormous commercial and social impact. For example, in Ukraine, it affects war and peace, and with the Coronavirus, it involves...