Big Data DWG
Herring, John (Oracle USA)
Baumann, Peter (Jacobs University Bremen GmbH)
Heazel, Charles (Heazel, Charles)
The purpose of the OGC Big Data DWG (BigData.DWG) is to provide an open forum for work on Big Data interoperability, access, and especially analytics. To this end, the open forum will encourage collaborative development among participants representing many organizations and communities, and will ensure appropriate liaisons to other Big Data relevant working groups, both inside and outside OGC.
The group will consolidate findings on a public wiki to inform OGC and the greater public and allow for feedback during the editing phase and after. Final report will be submitted to OGC for publication as a Best Practice papers.
“Big Data” is an umbrella term coined by Doug McLaney and IBM several years ago to denote data with following characteristics, summarized as the four Vs:
· Volume – the sheer size of “data at rest”
· Velocity – the speed of new data arriving (“data at move”)
· Variety – the manifold different Formats? Structures? Data types?
· Veracity – trustworthiness and issues of provenance
Since then, several additional Vs have been suggested, including value, verisimilitude, visualization. Generally it seems accepted that a core challenge is doing rapid, flexible analytics on Big Data.
Major efforts are being undertaken on Big Data visualization, analytics, and tools worldwide, involving and affecting science, industry, government, and citizens alike. Manifold research and development is going on, mobilizing both financial and people resources.
The OGC needs to make statements and provide guidance on the use of OGC standards – in particular, as location-based and geo data applications are major contributors to the Big Data deluge. Further, with the advent of increased machine-machine communication, interoperability is gaining even more importance. OGC, therefore, should establish a position addressing all levels, including – but not limited to – science, implementation, market value, and societal effects.
3.1 Charter Members
The initial membership of the BigData WG will consist of the following members and individuals with extensive education and experience in Big Data issues, namely:
· Peter Baumann, Jacobs University (co-chair)
· John Herring, Oracle (co-chair)
· Juergen Seib, Deutscher Wetterdienst
· Stan Tillman, Intergraph
· Marie-Francoise Voidrot, Meteo France
· Jeff de la Beaujardiere, NOAA
· Bruce Gritton, US Navy MetOc
· Chuck Heazel, WISC (co-chair)
· Mike McCann, MBARI
· Pedro Goncalves, Terradue
· Don Sullivan, NASA
· Ed Parson, Google
· Robert Gibb, Landcare Research New Zealand
· Jean Brodeur, Geoconnections, NRCAN
· Jinsongdi Yu, Fuzhou University
· Arnaud Cauchy, Airbus Defence & Space
3.2 Key Activities
The following activities of the BigData.DWG have been identified initially:
· Establish a working communication infrastructure, including a public wiki.
· Meet regularly at TC meetings and through telecons.
· Establish liaisons with relevant OGC WGs, such as WCS.SWG, and maintain exchange.
· Establish liaison with relevant OGC-external entities, such as RDA (Research Data Alliance), US NIST, ISO TC211 and ISO JTC1/SC32, and maintain exchange.
· Foster an agile, member-driven agenda of topics and facilitate information sharing and consolidation.
· Proactively publish discussion and findings through wiki and other appropriate channels.
The WG will identify additional activities as it sees fit.
3.3 Business Case
Big Data issues are seen as a main challenge in research, industry, and government, but also as an opportunity for new business and improved governance. Citing Wikipedia:
"Big data" has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, and HP have spent more than $15 billion on software firms only specializing in data management and analytics. In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.
Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.
In the realm of geo data, among the core contributors to the data deluge are spatio-temporal sensor, image, simulation, and statistics data.
· Sensors becoming ubiquitous the systematic collection of their data output is currently generating substantial new markets – as an example, availability of GPS data has stipulated a multi-billion US$ industry. The Petrol industry today has “more bytes than barrels”. Getting to grips with these (Peta)bytes is essential for discovering new resources and exploiting them.
· Remote sensing imagery is utilized in more and more research, business, and society applications; for example, NASA EOSDIS yields 5 TB per day, and in the course of the ngEO initiative 10^12 satellite images are planned to be held under ESA custody.
· Climate Modeling and Numerical Weather Prediction data volumes are expected to grow according to Moore’s Law for the foreseeable future. The main constraint is likely to be the cost of electricity to run the high performance computers required to process the data.
Ensemble modeling and reanalysis in climate research multiply the data volumes generated. A study of CCLRC (Central Council of all Research Centres) in the UK has observed that scientists download 10x more data than actually needed for their research, because of insufficient server-side search and extraction capabilities.
· In statistics, multi-dimensional data cubes are a common scheme for analyzing complex correlations. With the availability of significantly increased ground truth data, analytics has crossed today’s main memory limits and calls for scalable evaluation methods.
Beyond such data categories, however, there is a huge market and interest in Big Data issues in the areas of Social Networks, Business Intelligence, and others. While not location based per se, very often such data have a spatio-temporal aspect, too, and hence are of relevance for OGC.
3.4 BigData.DWG Business Goals
The OGC Big Data WG will specifically focus on spatio-temporal data, in line with OGC’s mission. With the same inaccuracy as the term “Big Data” itself, we give them the working title “Big Earth Data” for now
· What does Big Earth Data mean in an OGC context? What characterizes them? What new standards do we need?
· What are the challenges, if any, of Big Earth Data for OGC’s data and service interface specifications?
· What is the market value of Big Earth Data, and how can OGC support leveraging it?
· What alliances can be established by OGC (both profit and non-profit)?
Thus, the BigData.DWG will aim to clarify some foundational terminologies in the context of data analytics understanding differences/overlaps with terms like data analysis, data mining, etc. Further, a systematic classification of analysis algorithms, analytics tools, data and resource characteristics, and scientific queries will be established.