2008-04-15 EGU Mtg Vienna Tagging

From Datafedwiki

Revision as of 18:36, 7 October 2008 by WikiSysop (Talk | contribs)
Jump to: navigation, search

< Back to Reports | Edit with Form

Title: Harmonization and Integration of Semi-Structured Data Through Controlled Tagging
Date: 2008/4/14
Location: Vienna, AT
Report Formats:


Contents

Session ESSI6: Data and Metadata Models & Mark-up Languages

Conveners: Nativi, S. Woolf, A.; Domenico, B


Harmonization and Integration of Semi-Structured Data through Wikis and Controlled Tagging.

E. M. Robinson and R.B. Husar

Abstract

The contents of cyberspace are increasingly generated and distributed by individuals. This is manifested by the explosive growth of web-based social software like wikis, media-sharing services and blogs. This architectural, technological and cultural transformation of the Internet, commonly referred to as Web 2.0, is good news for the Earth Science community since it offers new possibilities for sharing and harvesting community-provided content as well as collaboratively creating new things. One key feature of all of these new softwares is the end-user's ability to add tags, adding value by extending the metadata of the particular object. Ad hoc tagging (folksonomy) gives a rich description of the internet resources, but it has the disadvantage of providing a fuzzy schema. The semantic uniformity of the internet resources can be improved by controlled tagging which apply a consistent namespace and tag combinations to diverse objects. We have used the above tagging approaches in order to gather internet resources pertaining to air quality events. Initial event analysis of the southern Georgia fires, which burned in April and May, 2007, began with filtering and harvesting user-contributed web content. The Google Blog Search of 'Florida smoke' returned several thousand entries, many of them unrelated to the wildfires. Visually scanning the blog entries yielded a number of interesting posts, which were given the controlled tags '070508+Florida+Smoke' in the social bookmarking tool del.icio.us. Additional smoke photos were found in the photo-sharing service, Flickr and given the same set of controlled tags. Together, these tools yielded a rich but only qualitative description of the Georgia Fires. Because of the common set of controlled tags these web objects (i.e. links and photos) were harvested in a wiki environment, which also contained the links to quantitative air quality analysis based on satellite and surface observations.

How do you use tagging in order to impose structure on the wiki? Semantic wiki example of controlled tagging - data spaces, data systems DataFed - views (datasets, date, bbox, event tags)

Poster Content

Problem

  • Earth Science Information is dynamic and distributed.
  • At any stage in analysis there are multiple distributed providers and users.
  • Earth Science meta-data has been separated from the data doesn’t have a standard form.
  • What is metadata to one person is data to another
  • Need a way to connect humanware level for people to do collaborative analysis with known metadata

Tags

Tags A tag is simply a word you use to describe a bookmark. Unlike folders, you make up tags when you need them and you can use as many as you like. The result is a better way to organize your bookmarks and a great way to discover interesting things on the Web.

Use tags for filter, search and future navigation

Using the ‘crowd’ to add tags is new

Allow anyone, especially consumers to add tags – not just authority

Tagging not exclusive or hierarchical. Tags can be used to identify who/what it is, what type of object it is (article, dataset, blog, …), who owns it, other refining tags that can’t be used independently, such as other identifiers (funny), task organization – to read, to do, finished

Currently, there are all kinds of web services popping up which allow you to create and tag particular types of web object. Flickr for pictures, YouTube for videos, Del.icio.us for links, Blogs for personal accounts, and wikis for collaborative content generation.

Flickr, YouTube and Del.icio.us allow only the tagging of one type of content and while you benefit from the crowd tagging, you can’t change some one else’s tags – you can only add your own.

Tagging

  • The topology (shape of the Internet as a network) of the Internet is normally defined through the hard links among web pages. Those connections, linking out, create the network topology. According to Barabassi, the Internet follows a Power Law.
  • Another way of structuring and organizing the web content is through tagging. Tags on web objects allow "characterizing" or "categorizing" the content.
  • Unlike hard links, tagging doesn't create or describe parent-child relationships, simply groupings of similar characteristics.
  • Creating category-based structures is much easier to do than hard linked, hierarchical structures. At the same time tagged structures are semantically less descriptive, however arbitrary connections can be made that only humans know the meaning of.
  • Tag-based connectivity is also more flexible because it is normally done by 3rd party entities that can see connections that neither node is aware of.
  • Larger fraction of web content is user-generated and it is done in the "cloud" or shared through social media applications (Blogger, flickr, YouTube...) which automatically tag with user, date, content type, location? plus user and community tags.
  • Both catalogs and tagging have a purpose to aid finding content. Tagging as a finding mechanism is more likely to succeed than registration into catalogs, since tagging is built into the social media.
  • User-created content is automatically tagged with the user (name+) and tagger is also tagged with user (name +) which connects people to people.
  • Tagging is taking place in multiple venues, however it is possible and desireable to aggregate web objects (URI) based on tags. Another way could be aggregating by user name for a particular user.

Wiki

Wiki pages are objects which originally were only used for collaborative writing. They have evolved into a workspace where all types of web content can be pulled together and given a context in the wiki.

The wiki pages don’t have any inherent structure it is up to the community to implement a structure. Sometimes this is the standard heirachy, however more often the structuring is done through tagging so that the pages can be reused. One key difference is that unlike Del.icio.us or Flickr, the wiki page only has one set of tags and the community decides what tags to add/take away by the fact that everyone can edit the page. The structure comes from having standard types of pages with a fixed set of tags. This is maintained by using a wiki page template for each page type that already has the required set of tags, beyond that the user can add other tags so that the wiki page can also be found by looking alternate directions.

Over the last two years we have developed several classes of pages: Reports/Presentations, Dataset Dataspaces, Air Pollution Event Spaces, DataFed Development Events, and Tasks. As other projects come up, we set up workspaces, designated to be the ‘front page’ for a given project with context for those working on the project. Through the use of tags, just because one page is tagged a report presentation, it isn’t excluded from also being about a particular dataset or Air Pollution event. Therefore it can be reused in all of these places. For those, finding information this ‘pivot’ is a benefit of tagging.

All of these objects have a URL, so they can be shared across platforms by tagging them in Del.icio.us. In Del.icio.us we have used groups of tags in order to specify what it is for, type in order to filter different types of objects used for one particular instance.

Dataspaces

Used wiki page to add metadata for air quality datasets registered in DataFed. Used semantic controlled tags for fixed meta data. This allows querying/filtering based on data type, bbox, time range, platform…

This type of catalog-type registration tagged the DataSpace dataset pages in a way that allowed for a query. Using the Semantic mediawiki each page has an RDF feed, which can be used to integrate the dataspace page into other catalogs as well as transform it into a snippet embedded when KML is created for dataset.

Other web content like papers about the dataset, DataFed data views that had been created with this dataset, websites can also be embedded in the wiki page. Finally, the dataset wiki page can be enhanced with discussion/feedback/FAQs about the dataset.

This system has worked better for us because the wiki is much easier to edit than the

Event Spaces

Air pollution events are inherently noticeable because of the intensity of the short-time emissions and due to the unusual impacts they have on the atmospheric environment. The recent proliferation of continuously recording webcams, individual digital photographs and home videos as well as personal blog reports now constitute a significant new information source. Most of these observations are almost immediately placed on the internet, shared into internet-based repositories like YouTube for videos, Flickr for images and blogs for personal accounts. Given the high density and short response of these sensors to the exceptional events it is said that the Earth, has now acquired a "skin" for the detection of changes in the environment.

For these events we create a unique event tag Date_LocationType (i.e. 0704_GeorgiaSmoke). Through automatic search in Flickr, YouTube and Blog searches we find some content that is applicable to the 0704 Georgia Smoke event. In Del.icio.us we tag flickr photos, youtube videos and other blogs with the unique tag (0704_GASmoke) and Blog, video, … to feed through RSS into the wiki. The reason for the unique tag is that then anyone can add content to the community currated lists for the event.

In the wiki we also tag the dataspace dataset pages with the event tag, so that a list of relevant datasets can be queried for each event.

DataSystems

Data System Spaces were wiki pages developed with metadata about the data system as a way to compare different Air Quality Data Systems. The profile was developed based on the agenda of the EPA Data Summit. Using Semantic tagging of fields allowed the dataset information to be reused in multiple ways based on a particular session. Data system profiles are on the ESIP wiki, a multi-agency federation, this could be a place were Data System providers feed updates to users about changes in the system, instructions on how to use the data system are harvested here.

Like the lower level data that is slowly becoming standardized and shared, the first step is opening the system up and with no effort from the user capturing what is already being done. If documents about a particular dataset they can be harvested and tagged by the community for reuse in the wiki.


http://datafedwiki.wustl.edu/index.php/2006-12-11_AGU%2C_San_Fransisco http://datafedwiki.wustl.edu/index.php/2006-12-06_UA_Huntsville_Seminar

Personal tools
Workspaces
Clicky Web Analytics