Our data processing techniques
We automatically consume, parse, cleanse and index all your product data (product photos, unstructured data, structured data) whenever you update it. This is achieved using your standard product feed, often with little or no amendments (for example, using Google Shopping or Yandex standard feeds).
The data ingest module handles every aspect of taking your standard feed and preparing it for Hullabalook front end discovery experiences. The process starts by polling your server, then it downloads and compares the data to the most recent previous version (this allows us to derive any data modifications or insertions since the data was previously checked). It then processes each updated or new record through the data ingest pipeline.
We have three parallel data processing streams,
– one for images;
– one for text, and;
– one for structured data.
In each stream we run a configurable series of data ‘smarts’ which increase the value of your data, and improve its readiness for Hullabalook experiences. The smarts include computer vision (image stream), significance analysis (text stream) and population outlier analysis (structured data stream).
You can also read more about how our front end components make use of this data here.
Feature extraction identifies the hundreds or thousands of shapes, tones and groups of pixels which are notable about a particular image. The process of Dimensionality Reduction compares these hundreds or thousands of features per image across thousands of images, and transforms the resulting tens of thousands of image features into a smaller number which are most effective at identifying what is different or distinctive between images. Principal Component Analysis reduces these to a smaller number of independent abstract features, which are a combination of several image attributes. Neighbourhood embedding allows these principal components to be compared in two-dimensional space, so that images with similar visual features will be closely associated in the same ‘neighbourhood’, and those which are visually very different belong to different neighbourhoods.
Every image has thousands or tens of thousands of pixels, and each pixel can have one of almost 17m different colours. Unsupervised machine learning is used to bundle together sets of pixels with identical or very similar colours. Colours which are considered to be important to the product are indexed. By combining this approach with Object Detection it’s possible to understand the colours of specific objects within an image.
Automated Object Detection
Convolutional Neural Networks (CNNs) are machine learning models which learn like a human brain – by making connections and joining chains of abstract information spotted from sensory stimuli together to produce intelligent insight, in particular it identifies how these pieces of information are spatially related to each other. CNNs are ideally suited to learning about images and extracting features from them which can be used to make intelligent decisions, such as detecting specific objects which appear in a product image.
Bayesian classifiers take ‘features’ (in this case, words and groups of words) and estimate the probability of a particular outcome. Each time a new feature is encountered, the classifier judges whether it adds useful predictive information that informs the overall probability of different outcomes. Logistic regression, support vector machines and other techniques use different approaches to achieve the same end – taking input data and learning to classify which outcome is most likely. Ensemble models take the ‘best guess’ from several of these models and allow a weighted voting system to select the democratically most likely tag.
Automated cleansing and parsing of text takes blocks of free-form copy and other text and extracts useful insight. These techniques include:
– Regular expression pattern matching (identifying and classifying words and groups of words based on patterns – TV refresh frequency (eg. 60Hz) follows a specific pattern and so on),
– Fuzzy matching (standardising variations of the same word independent of punctuation, capitalisation and so on),
– Lemmatisation (recognising that swimmer and swimming come from the same root word),
– Numeric extraction (such as loading dimensions into a separate numeric field, and interpreting the units in the text),
– Wildcard matching (where parts of a word are specified, but variations in spelling and additional characters are permitted)
– Proximity analysis (where groups of words appearing together are used to understand context and meaning)
– Spelling distances (where misspellings are allowed for)
A single idea can be represented by many different words. What you call ‘floral,’ others call ‘flowery’. It is crucial that a single consistent idea is identified, no matter which of the many word options are used to describe it. Both general word networks and industry-specific thesauruses must be used together to be able to distil and consolidate the breadth of language used in product copy and other text-based information into a consistent, consolidated set of tags.
Significance analysis is the process of identifying the importance of a concept, by comparing its prevalence within a subset of documents to its prevalence within a wider corpus of documents. Those concepts which are much more prevalent are considered highly significant to the set of documents – this process can be run in real time to identify changes in significance seen during a stream of event data.
Semi-supervised learning is useful for data challenges where you are confident there is some inherent structure and classification, but you don’t have all of the labels or outcomes available to train a fully supervised model. These techniques- such as Learning Vector Quantization or Self-Organising maps – pay attention to the specific labels when you have them, but look for underlying structure or patterns in the data when you don’t. This allows new labels to emerge over time as your data evolves, without requiring human intervention.
Active Learning is a technique for capturing targeted outcome data, and using it to improve a system. Because in any machine learning environment the greater the scope and completeness of the outcome data, the more accurate the learning will be, this approach allows the system to create specific outcomes with the purpose of increasing coverage of the outcome data. In effect the system is allowed to ask questions of shoppers, in order to find out their preferences rather than relying on any outcome data which emerges organically – this accelerates the learning process compared to traditional methods, and allows it to become more personalised.
Outlier analysis considers all data available and identifies specific cases when something out of the ordinary occurs. It automatically takes into account how likely a particular set of circumstances (or combination of attribute values in product data) are to occur given the context of a body of real world evidence. A combination which has a very low probability is considered an ‘outlier’ and can be further analysed either in an automated or manual fashion depending on the scenario.