The Paint and Coatings industry tries to keep abreast of possibly deleterious effects that materials used in their products might have on humans, animals and plants. Consequently, the American Coatings Association (ACA) presented to CIMS its need for early identification of materials that might be regulated or have their use restricted because of harmful effects such as carcinogenic, birth defects, and so forth.
In his article below, CIMS lead data scientist Rakesh Ravi explains how CIMS responded with a project that brings out the true power of Unstructured Text Analytics.
Some of the questions the ACA wanted answered were:
- What CDC-listed materials of concern are used in coatings?
- Which of these CDC-listed materials are referenced in public forums as potentially harmful to human health?
- Which of the CDC-listed materials are being researched or considered for substitution?
Answering these questions would allow a company to change its manufacturing process and hence prevent supply disruptions. This would be possible by periodically monitoring materials information sources on the internet such as science journals, government websites and blogs. The ACA also wanted CIMS to present a trend analysis as to how the number of occurrences of these materials of concern in the information sources varied with time.
Applying IBM Watson
IBM Watson, with its ability to translate large amounts of unstructured data, was the tool we used to build and apply text analytics models as data filters. The data sources provided to us were diverse and hence before information could be crawled an initial model had to be built in order to equip Watson to recognize key terms and phrases. This was done by creating dictionaries of these key terms.
Examples of dictionaries created included the Materials dictionary, composed of the chemicals and materials to be monitored, and the Diseases dictionary made up of medical diseases and conditions.
Four dictionaries were created that were combined together as Parsing Rules (an IBM Content Analytic Studio feature that facilitates creation of annotations over textual patterns) to identify contextual relationships by looking for the occurrence of at least one term from each dictionary in a sentence or paragraph together.
An example of this is the materials_diseases rule, which when applied to the data collection would highlight the diseases associated with each material as present in the data sources and isolate the article or webpage that contained them.
Once the rules were formed, information sources were crawled using Watson Explorer. The domains included popular science journals, databases, public blogs, articles, and government websites. Because the initial step was to test our model, the number of sources was limited. After the sources were crawled, the model was applied over the crawled data from the Web; the results provided us with an insight into the accuracy of the model and its ability to minimize irrelevant data.
Refining the Search
After completing the initial model we improved it by adding more dictionaries and rules to further refine our search. We had 11 dictionaries and 18 rules in our final model.
Following inputs from ACA subject matter experts, we added more information sources that were likely to contain data relevant to the search. With these additional sources, the corpus consisted of about 200 sources, crawled over a million URLs and downloaded more than 700,000 documents. The dictionaries and rules were applied as filters to the downloaded data and the results were analyzed using the Watson Search Analytics Interface. The relationships were annotated by the model and extracted using the various dictionaries and rules.
The ACA also wanted us to model these relationships and conduct further analysis of the data based on additional metadata such as source weighting, temporal analysis and trending. Our plan for this is described in the Next Steps, below. An interesting and relevant observation we made while applying the initial model to the collection was that one of the materials of concern—Pthalate—was mentioned in public as well as professional sources and is currently being researched for human health effects of exposure to them.
Our next challenge is to show how the mention of these materials of concern in the information sources varies with time; that is, the ACA wants to know how the number of occurrences/mentions of these materials trended over a period of time. Our strategy to accomplish this will be to crawl the same information sources once every three months.
Once the crawling and indexing has been done by IBM Watson, we will download the documents from Watson and subject these data to scripting languages such as Python and Shell. The scripts will parse through the data to extract information pertaining to the rules created. The information extracted will then be inserted into a relational database.
Storing this information will allow CIMS to create a dashboard representation of the trends of various materials so that companies can easily trace the growing/diminishing interest of materials among research groups.
Power of Unstructured Text Analytics
This project brings out the true power of Unstructured Text Analytics, wherein a large corpus of web information is gathered and, by applying well-constructed rules, valuable information is extracted. Doing this by regular Google searches would be extremely inefficient, not to mention time consuming.
This is another example of how CIMS uses Big Data Analytics to provide companies with answers that are useful and could not be obtained by other means.
Rakesh Ravi, Lead Data Scientist at CIMS; firstname.lastname@example.org