Casting a Wide Web Net to Research Product Safety
Editor’s note: For industries concerned about both innovation and public health, knowing everything about what goes into their products is a no-brainer. This responsibility is especially important for manufacturers of chemical-based products such as coatings, including paint. Sometimes, harmful effects of compounds don’t materialize, or become known, until years after they’re in use. As the largest trade organization for the coatings industry, the American Coatings Association is focused on helping its members be proactive when it comes to health and safety. A member of CIMS, the ACA turned to us to learn how to employ Big Data Analytics to research these concerns. The following excerpt from an article by CIMS Data Scientist Rakesh Ravi in the January/February issue of the Innovation Management Report discusses the project and its results.
Do any materials used in our products pose a risk to humans, animals and plants? This was the big question the ACA presented to CIMS. It wanted to help its members be able to proactively identify compounds that might be regulated or have their use restricted because because they’re carcinogenic, cause birth defects, or have other deleterious effects.
Answering these questions would allow a company to change its manufacturing process and hence prevent supply disruptions. This would be possible by periodically monitoring materials information sources on the internet such as science journals, government websites and blogs. The ACA also wanted CIMS to present a trend analysis as to how the number of occurrences of these materials of concern in the information sources varied with time.
Applying IBM Watson
IBM Watson, with its ability to translate large amounts of unstructured data, was the tool we used to build and apply text analytics models as data filters. The data sources provided to us were diverse and hence before information could be crawled an initial model had to be built in order to equip Watson to recognize key terms and phrases. This was done by creating dictionaries of these key terms.
Examples of dictionaries created included the Materials dictionary, composed of the chemicals and materials to be monitored, and the Diseases dictionary made up of medical diseases and conditions.
Four dictionaries were created that were combined together as Parsing Rules (an IBM Content Analytic Studio feature that facilitates creation of annotations over textual patterns) to identify contextual relationships by looking for the occurrence of at least one term from each dictionary in a sentence or paragraph together.
An example of this is the materials-diseases rule, which when applied to the data collection would highlight the diseases associated with each material as present in the data sources and isolate the article or web page that contained them.
Once the rules were formed, information sources were crawled using Watson Explorer. The domains included popular science journals, databases, public blogs, articles, and government websites. Because the initial step was to test our model, the number of sources was limited. After the sources were crawled, the model was applied over the crawled data from the Web; the results provided us with an insight into the accuracy of the model and its ability to minimize irrelevant data.
Refining the Search
After completing the initial model, we improved it by adding more dictionaries and rules to further refine our search. We had 11 dictionaries and 18 rules in our final model.
Following inputs from ACA subject matter experts, we added more information sources that were likely to contain data relevant to the search. With these additional sources, the corpus consisted of about 200 sources, crawled over a million URLs and downloaded more than 700,000 documents. The dictionaries and rules were applied as filters to the downloaded data and the results were analyzed using the Watson Search Analytics Interface. The relationships were annotated by the model and extracted using the various dictionaries and rules.
The ACA also wanted us to model these relationships and conduct further analysis of the data based on additional metadata such as source weighting, temporal analysis and trending. An interesting and relevant observation we made while applying the initial model to the collection was that one of the materials of concern—Pthalate—was mentioned in public as well as professional sources and is currently being researched for human health effects of exposure to them.
Our next challenge is to show how the mention of these materials of concern in the information sources varies with time; that is, the ACA wants to know how the number of occurrences/mentions of these materials trended over a period of time. Our strategy to accomplish this will be to crawl the same information sources once every three months.
Once the crawling and indexing has been done by IBM Watson, we will download the documents from Watson and subject these data to scripting languages such as Python and Shell. The scripts will parse through the data to extract information pertaining to the rules created. The information extracted will then be inserted into a relational database.
Storing this information will allow CIMS to create a dashboard representation of the trends of various materials so that companies can easily trace the growing/diminishing interest of materials among research groups.
Power of Unstructured Text Analytics
This project brings out the true power of Unstructured Text Analytics; when a large corpus of web information is gathered and well-constructed rules applied, valuable information is extracted. Doing this by regular Google searches would be extremely inefficient, not to mention time consuming.
This is another example of how CIMS uses Big Data Analytics to provide companies with answers that are useful and could not be obtained by other means.—Rakesh Ravi