Pop quiz on statistics and data science (answers at the end of the article):
1) I have some data on accidents at railroad crossings. One variable indicates the compass direction a railroad crossing faces (North, Northwest, Northeast, and so on). This variable is a/an:
3) “ELT” in data science is the acronym for:
- Evaluate, Load, Transfer
- Extract, Load, Transfer
- Evaluate, Load, Transform
- Extract, Load, Transform
Tony Hirst, a Senior Lecturer at the Open University in the United Kingdom, recently wrote about the need to educate open data users. As he observed, many countries are making their data available for users to create visualizations and apps from the data. However, much of the data made available is “dirty” in that there are misspellings, missing values and inconsistencies. The user would need to be familiar with the various ELT tools to arrange the data into usable forms.
Once the data is in good shape, the user would need to know how to use the data. That involves understanding basic statistics so as to know the appropriate statistical techniques to apply for analysis and visualization. Alternatively, an app developer would need to know how to create a well-formed request to an API (application programming interface).
As I often tell my students, there are many good tools out there that allow one to create in-depth statistical analysis or build a sophisticated mobile app easily and with little knowledge. That is a benefit and a curse. A benefit is that the tools make it much easier to extract the value out of government datasets and a curse is that the wrong conclusions can be drawn from a badly-analyzed dataset. Alternatively, even worse, damages can result from relying upon a poorly-designed app that uses a government API.
Therefore, are federal agencies under an obligation to provide training on how to use the open data sources that they provide?
I do not know the answer to that question, but I do know some federal agencies do a wonderful job of educating the public on how to use the datasets the agency releases. For example, the Department of Health and Human Services (HHS) provides good documentation on HealthData.gov. In this beautifully designed site, users can easily search for the appropriate health datasets, view blog articles on how to use the health datasets and contact HSS with any questions concerning the datasets and APIs.
The Developer Portal at the Department of Labor (DOL) is also a well-designed site to help users more effectively use labor datasets and APIs. On the Developer Portal homepage, users can choose the “Beginner” or “Experienced” path through the DOL site. There is a tutorial on how the DOL APIs work and extensive documentation in eight different programming languages for the DOL APIs. The datasets catalog is well-organized with good descriptions of the datasets and how to use them.
There was, and still may be, an expectation that users who access federal government data would know how to use the data properly. However, this may no longer be true given the great value of federal data; the increasing number of datasets and APIs being published; and the new tools that make it easy to access datasets and APIs. In the interests of public safety, economic innovation and increasing trust in government, maybe federal agencies should increase their training efforts in how to use federal government open data.Answers to the pop quiz: 1.B; 2. B; 3. D.Each week, The Data Briefing showcases the latest federal data news and trends. _Dr. William Brantley is the Training Administrator for the U.S. Patent and Trademark Office’s Global Intellectual Property Academy. You can find out more about his personal work in open data, analytics, and related topics at BillBrantley.com. All opinions are his own and do not reflect the opinions of the USPTO or GSA._Edit