Last week at the SAP Sapphire conference, I was able to get some more insight into what SAP is doing around Big Data. They have been closely working with Intel to integrate Intel's Hadoop distribution with SAP HANA. If you are an SAP enterprise customer - and there are a lot of SAP enterprise customers - this could be a very powerful Big Data story going forward.
Basically, what SAP is doing is allowing you to use HANA as the hub for all your data. You can then use SAP Data Services to ETL your Hadoop data into HANA. Even more powerful, is the fact you can leave your data in Hadoop and use HANA to front it all and so have the full power of the SAP BI Suite against all your data whether it is in HANA or in Hadoop. Since HANA fronts all the data, the SAP BI Suite only needs to integrate with HANA. When HANA gets a query for data that exists in your Hadoop cluster, it will directly issue a query to Intel's Hive ODBC driver which will then push the query into the Hadoop cluster. All this is done from the SAP HANA Studio. This looks similar to what Microsoft does with Polybase but seems more geared towards the enterprise.
You can read more here:
Simba has been a leader in OLAP and MDX data connectivity for the last decade and a half. Most of the OLAP providers out there that support the MDX query language have been built using the SimbaProvider SDK - these include SAP BW, SAP HANA, Oracle, Teradata, Kognitio, and Exasol.
One question that often comes up is what BI or analytics tool can I use that will support multiple OLAP data sources? First up is always Microsoft Excel Pivot Tables. Many analysts know how to use Excel Pivot Tables and if the OLAP data source supports the OLE DB for OLAP (ODBO) protocol and the MDX query language, then this is a no-brainer.
Since all of the major BI vendors have been bought up by larger companies, it has always been a challenge to find a serious BI tool that support lots of different OLAP data sources. Most recently, Tableau is an up and comer but they are still limited in their OLAP support largely to Microsoft, SAP, and Teradata. Last week at the SAP Sapphire conference, I attended a session entitled "What's new with Analysis Office and OLAP". This session covered the new features in the SAP BusinessObjects Analysis for Office and Analysis for OLAP products. There were two specific slides regarding OLAP connectivity here. One slide covered all the different OLAP data sources that SAP supports and another slide showed that you could have simultaneous connection to multiple data providers. Also, unlike Excel, the SAP BusinessObjects Analysis product is part of a complete BI Suite and so is that much more powerful in any business enterprise. The different OLAP back ends that SAP supports are:
- SAP NetWeaver BW
- SAP HANA
- Microsoft SSAS
- Oracle OLAP and Oracle Essbase
- SAP EPM providers: SAP BusinessObjects Planning and Consolidation, SAP BusinessObjects Profitability and Cost Management, SAP BusinessObjects Financial Consolidation
Posted by Amyn Rajan on May 22, 2013 in Business Intelligence, Data Access, Data Analytics, Data Warehouses, Excel Pivot Tables, Interoperability, MDX Query Language, Multi-Dimensional Data Connectivity, ODBO, OLAP, Oracle Exadata, Oracle Exalytics, SAP BW, SAP HANA, XML for Analysis, XMLA | Permalink | Comments (0) | TrackBack (0)
| | | | |
Google IO has came and gone. It was a drink-from-the-fountain experience for this IO newbie. But apparently, day one's 3.5h keynote was similarly challenging for the seasoned. (Both Engadget and TheRegister have great live-blog of the keynote if you want to relive the marathon at your own pace. Engadget's even included pictures!)
BigQuery fits into IO's "Google Cloud Platform" track.
As usual, everything from the 2013 IO is available via the Google Developer channel.
For your convenience, here is my #IO13 playlist for the useful BigQuery-related sessions I found. I may have missed a session given the amount of material presented so do let me know if you saw something worth compiling here.
An interesting post by Alex Williams at techcrunch from this past March about "Oracle Is Bleeding At The Hands Of Database Rivals" is worth a quick read.
Most of the new Big Data vendors like DataStax use ODBC drivers from Simba. The reason they use our drivers is that besides being fully standards compliant (actually, Simba co-developed the ODBC specification with Microsoft back in the day), Simba's ODBC drivers have our SQL Connector technology which allows us to bridge between SQL and things like HiveQL. Therefore, if you are using a product like Tableau, SAP Lumira (fka Visual Intelligence), Excel, or Crystal Reports, everything just works.
So, why is Alex's article of interest? Well, we have so many partners in the Big Data space, that I am always curious how they are doing against the traditional database vendors like Oracle and Teradata. Alex's article helps give some insight into how seriously enterprises are looking at the new Big Data vendors like DataStax, Cloudera, Google, HortonWorks, MapR, Splunk, etc.
Yesterday, Teradata announced Teradata Intelligent Memory - part of the Teradata Unified Data Architecture. I previously wrote that Teradata found the following when studying data usage:
1. 43% of I/O uses 1% of cyclinders
2. 85% of I/O uses 10% of cyclinders
3. 94% of I/O uses 20% of cyclinders
Teradata has now extended their analysis and their technology to determine what data should be in-memory and so they are optimizing the memory of modern database systems. Therefore, unlike others who want to put everything in-memory, Teradata is using smart algorithms to determine what should be in-memory and hence doing the price/performance trade-off with a kick - significantly lower price because not everything has to be in-memory but almost similar performance. This is big news because as data volumes grow, cost becomes an issue and Teradata is able to keep costs down while maximizing performance. You can read more about it here: Teradata Intelligent Memory.
This will be the first IO I'm attending live. Beneath the froth of the devices both mainstream and otherwise (glasses!), there is a track for Google's cloud which include Google Compute Engine and Google BigQuery. There's also G+ community which has been helpful for orienting yourself to this mega-event .
There is a handful of sessions on BigQuery which I'll report back on. Watch for it.
The Hive User Group held its semi-regular meetings last week at Hortonwork's office.
This was a jam-packed evening as the Stinger initiative has started to deliver some of its firstfruits. Gunther's session on Hive 0.11 was the longest talk of the night and generated some of the liveliest discussion in the room.
A Webex recording of this meeting has been made available by Hortonworks. I offer the time marks as a public service for anyone wanting to review this :
With the increasing use of self-service BI tools, data exploration tools, and data discovery tools, end users have more options to analyze data coming directly from any data source. The other day I was asked why people should choose MDX (and constructing a cube) for analysis when there are so many other options. Here are some scenarios where MDX is advantageous, especially in Excel but also in other tools. This is a good starting point for evaluating MDX as an analytic solution.
ODBC/JDBC (usually SQL)
Large data volumes
Data stays on the server and the server calculates the summary values.
In Excel, the data is imported and stored in the workbook. Excel does the summary calculations, at performance expense.
Ease of use for BI client users
Point and click interface to connect to model, get summary values, slice and dice data.
End user must know SQL very well to join tables and perform anything other than basic summary analysis.
Reusable data model
One data model can be built and used with many different BI clients (Excel, Tableau, SAP BusinessObjects, Cognos, Microstrategy, etc). MDX calculations in the model can be reused.
The calculation logic must be repeated in every BI client connecting to the server.
MDX language has shortcuts for common analytic capabilities such as time-based calculations. Top 10 and other sets can be shown automatically. Can drill to details after viewing summary information.
Optimized for table and row operations, not for analytics. Calculations that are easy in MDX can be hard in SQL.
Hadoop is often taken to be synonmous with Big Data, the solution to an organization's decision-making ills, the cure for cancer, the caramilk secret and more. But this is, of course, not the case. Hadoop merely provides a foundation for working with much of the data that was previously unreachable or unuseable.
Similarly, the term data science is much bandied about.
So I attended the inaugural talk at Data Science Vancouver by Don Turnbull about this new discipline data science. The talk wasn't recorded but you can follow Don's exposition thru his slides. My 5-second summary is that data science is essentially a renaissance of science that pairs new techniques with (dare I say it) old-school scientific method. As Don puts it, it is science afterall.
Don concluded with a great reading list for someone either getting into or wanting/needing to deepen their understanding of this field. For your convenience, I've reassemble the list here with Amazon links. (I didn't know that Issac Asimov's Foundation is not available in ebook format. The other books in the series are... just not the first. Weird)
Knowledge Discovery and Data Mining (KDD)
WWGD. "What Would Google Do?" I've seen this refrain enough in the last few weeks but I didn't know that there's a book by that name. Fittingly, a couple of clicks in Google leads you to a nice summary of the book.
But today, I want to talk about Dremel and the post-MapReduce world. To put things in perspective, the following diagram (full disclosure: I grabbed it from Doug Cutting's October Strata NYC keynote) shows the lineage of some of today's Big Data technologies.
Everyone's likely familar with the first three rows. But Dremel, F1 and Spanner are new on the scene.
Let's consider Dremel first. (Refer here for Google's whitepaper.) Dremel is used extensively within Google. What is notable is that Google decided to commercialize it for external use via its BigQuery service. In typical Google fashion, it has been steady firming up for last couple of years. Google just announced several important new features (including big JOIN) for BigQuery last week. Previously, JOIN were only supported if all but one of the tables were "small" (small being at composed 8MB compresses). The new big JOIN feature lifts this restriction. Concurrently, Tableau is just on the verge of shipping support for BigQuery in it upcoming 8.0 release. And just in time, Tableau 8.0 which includes BigQuery as a first-class datasource just shipped. BigQuery is hitting it in strides.
So what is the open source community's response to Dremel? In the diagram above, Impala is highlighted in red because the diagram is from Cloudera who led the Impala effort. But if you casted your eyes wider, there's another project--Apache Drill--that purports to carry the flame of Google Dremel:
In fact, there's more... there's also OpenDremel. The good news is that OpenDremel and Drill decided to join force a little while back. So now there's only one project--Drill--which should simplify things for the future.
Drill is making progress. In fact, there was a design review just last week of its SQL parser. If you want to a summary of Drill's vision and architecture, Ted Dunning has been touring and making the case. The exciting point I see for Drill is that it anticipates many areas of extensibility including query language and file formats.
One vision that I foresee is implementing a language that is well established for analytics such as MDX on Drill. To give you a flavour for what this could look like, Simba has implemented an MDX/SQL translation on top of Impala to demonstrate the viability and performance of such a system. In the case of Drill, we can potentially do one better with tighter integration and, hopefully, better performance and functionality.
If you're sharp, you'll note that there two other Google projects--F1 and Spanner--that as yet have no open source equivalence. Based on material I've seen to date (including an excellent presentation in February at Strata Santa Clara), the hallmark of Spanner and F1 is Paxos/TrueTime--a radical new algorithm for achieving consistency using atomic clocks. Despite what Cloudera's slide above show, what I have seen of Impala to date is that it doesn't contain anything along that line. It's safe to say that F1 and Spanner is another step in the evolution of the database that Google has up its sleeve.
Microsoft and HortonWorks have announced Hadoop on Windows and Mark Smith at Ventana Research wrote a nice blog post giving an overview of Hadoop on Windows. Some key points:
1. Microsoft HDInsight Server and Windows Azure HDInsight Service are built on Hortonworks Hadoop Data Platform and make Hadoop readily available on Microsoft Windows. This strategic alliance helps IT organizations bring the power of Hadoop to Microsoft platforms. The Beta is now available.
2. Part of Hortonworks approach is that new advancements are contributed to the open source community where other developers and organization can contribute and help or use it once finalized and ready for distribution. This is a much different approach than others in the market who source Hadoop from the community and make proprietary extensions to it or embed it in their software and sell the license to the customer.
3. Hortonworks worked with Simba to provide a Hive ODBC connector to support SQL-92 access from business intelligence tools. Microsoft Excel can easily direct access to Hadoop through SQL which opens up further support for a large number of organizations.
On February 5 we delivered our most popular webinar series to date: Store once, read anywhere - using Oracle OLAP data in multiple BI clients. The videos are now available here:
One of the highlights of the webinar series is when Ed Seibold at Rovi Corporation talks about his experience with Simba’s MDX Provider for Oracle OLAP product and how it has changed the way Rovi works in BI. Launch the Excel video and scroll to about 12:30 to view his segment of the broadcast. Also, Deepa Sankar from SAP joined us to talk about SAP’s investment in vendor agnostic, platform-independent BI solution. Launch the SAP video and scroll to about 14:30 to view Deepa’s segment.
I have to thank Dan Vlamis and the team at Vlamis Software Solutions for helping us deliver such great content.
Day two of Strata Santa Clara turned out to be a day of sober second thought. Actually, there was sign of that in day 1 with Tim’s session on “why [we] can’t escape SQL”. But Kate Crawford’s keynote in the morning “Algorithmic Illusions: Hidden Biases of Big Data” is another articulate counterpoint to the boundless optimism surrounding Big Data. I am not a curmudgeon and I indeed applaud Oreilly’s bravery to consider and run these sessions. When I see that there is finally some dissenting voice about the state of Hadoop, then truly one can see that Hadoop has grown up.
The highlight for me has to be Google’s presentation on F1. I see F1 as a reasoned response to the BigTable-inspired technologies that are in bloom today. F1 is a *gasp* SQL RDBMS with ACID properties which nonetheless scales globally (literally). Even better, the slides are up already so have a look to see the future.
I provided a small scale defense of why SQL and query language matters at Intel’s booth in the mid-afternoon. I’ll share that with you once Intel post-production is complete. To summarize, the trends that I outlined yesterday regarding SQL hints at the future. Both Drill and Stinger/Tez have allowances for new query languages beyond SQL (MDX?). I believe this trend for extensibility shows a necessary maturing of the technology. Substituting query languages with APIs works in limiting cases; for integration and continuity with existing tools and constituencies, a query language—at least SQL—is necessary.
Day 1 of Strata Santa Clara continued the frentic pace of day 0. In light of Stinger and Hawq, I was particularly keen to see the SQLization trend play out. The day started with Rajat Taneja, CTO of EA, sharing how Hadoop and Big Data has changed EA approaches their entire portfolio of games. He summarized it as follows:
Right after, EMC’s Scott Yara stepped up but not to crow about Monday’s Pivotal HD but rather to recognize influential “data people”. But it was the third keynote that I thought was bold and ambitious: Stitch Fix pairs machine algorithm with human curation to deliver (literally, shipped to a customer/subscriber) apparel recommendations. The theme for the keynote is that people matter.
I began the day with Berkeley’s Spark/Shark. I had first seen at last year’s Hadoop Summit. Judging from the attendance which was the highest that I saw for the day, the demand and interest for interactive/low-latency SQL is huge.
I followed up with Tomer Shiran’s introduction to Apache Drill. Tomer and Jacques and the team (now comprising 6-7 companies) are making progress and are looking at a Q2 alpha and Q3 beta. Drill is of great interest to me because of its extensible front and back-end architecture that allows for additional query languages and operators. The questions from the audience suggest that SQL on a scaled-out architecture is definitely on top of mind.
EMC’s session on their Pivotal HD distribution, right after lunch, was no snoozer. The benchmarks number that they shared between Hawq & Hive and Hawq & Impala are amazing. In the former, accelerations of 19X to 648X. In the latter, accelerations was still from 9X to 69X. Of course, this isn’t too surprising given that they are leveraging Greenplum’s well-developed engine and optimizer. Hawq is a game-changer for the “SQL on HDFS” market because of EMC’s position in the storage market and the strength of Hawq.
Hortonwork’s Alan Gates followed with a survey of the various tools in the Hadoop ecosystem. He touched lightly on the new Tez project (pronounced “taze”). But, more importantly, I also got some longstanding misunderstanding about HBase and HCatalog cleared up. Thanks Alan!
Finally, it was the session that I had been waiting all day for: Tim O’Brien’s “The Future of Relational (or Why You Can’t Escape SQL)”. Tim did a breathless re-cap of the last 40 or so years of database technology. Starting with the pre-SQL world, moving onto the early SQL market, the mature SQL market, the rise of NoSQL in the mid-2000s, to today’s diverse Big Data technologies and to what I’ll offered as post-BigData NewSQL world of Impala/Drill/Spanner. From his experience as a database developer, he offers several choices statements about the need/motivation for many a team who choose NoSQL. (You can refer to theregister’s article for those…) But the core point that he offers is that SQL and relational are not joined at the hip. Relational for data modeling may be out. But SQL remains applicable and valuable: expressiveness, a corp of practitioners with a vast body of knowledge, and widely available and deployed tools. A final curious reason he offered is suggestively summarized as “What Would Google Do”: Google itself has come full-circle back to SQL with Spanner after inventing BigTable/MapReduce.
I ended the day with Citus Data’s Carl Steinbach. Citus took Postgres and tacked on an implementation of SQL/MED that they have prosaically named “foreign data wrapper”. Carl, being a Hive guy, described a foreign data wrapper as a Hive SerDe with extensions for predicate push-down and metrics. Works for me. In the end, it delivers huge improvement even on non-HDFS data.
Summarizing day 1, the themes surrounding SQL are:
The run-up to any conference can be rather feverish. This year's Strata is turing out to be one such.
Hortonworks finally announced their Stinger initiative last Friday (after Sanjay had leaked it earlier at a TheHive event). As I heard it from Alan Gates, it is a program of enhancements to bring Hive's query performance into "human use" realm.
Then, just before my flight down to San Francisco, EMC Greenplum announced their Pivotal HD ("Hawq") distribution. As usual, El Reg has a good summary of the details while Gigaom has some backstory. In short, the SQLization of NoSQL/BigData is real. The trend that Cloudera very publically kicked off at last October's Strata with Impala is now in full bloom.
Running down the list of vendors, just about everyone now has something in the burner now:
"Spanner". SQL. Transactions.
HiveQL; will be open sourced. Currently in beta release as part of CDH4; GA expected in CQ1.
SQL-2003; in progress; in development as an Apache incubator project.
Program of enhancements to existing projects (including Hive) and also proposes two new projects: Tez and Knox. Expected in March.
Pivotal HD “Hawq”
SQL-20XX. Source status unclear currently.
Contribution to Apache. No defined release vehicle yet.
Synchronicity-ly, there's a session Wednesday afternoon by Tom O'Brien aptly titled "The Future of Relational (or Why You Can't Escape SQL)". I'll be checking that out to take pulse on this trend.
On a broader note, Intel has just threw their hat into the Hadoop distro ring by announcing their IDH. (You can read more about the extensive Asian connections here.) I'll be giving a tech talk Thursday 2PM at Intel's booth (#101) about Hadoop's query language scene. Stop by to say hi if you are interested in this fast moving scene.
But onto day 0 events: the tutorials. Jonathan Hsieh did a great job with his morning tutorial on HBase for the app developers. He gave great examples and tips for one pondering the question of "Whither HBase". With O'reilly's move to putting all real content behind the paywall, the session won't be readily available but Jonathan has promised to post his slides. I'll post the link when that shows up.
For those of us who enjoyed Ryan Boyd & Michael Manoochehri's shows on the Google Developers channel, the dynamic duo joined up with Julia Ferraioli to present a tutorial to takes one thru the end-to-end process of collecting, persisting, processing, and visualizing data using collective might of Google App Engine, Compute Engine, BigQuery. Aside from the rough edges of getting all the relevant tools downloaded, installed and configured, the process went smoother than expected. Having worked with BigQuery previously, I'm familiar with that piece of the picture. I still have to finish the assignment (thanks Julia for keeping the group up!) due to a Cygwin problem during class but it brings out the kid in me to see everything run from soup to nuts.
And that's all for day 0! The formal Strata hasn't even started.
In talking with analysts and partners in the past week leading up to this week's Strata, it's safe to say that one trned is clearly in focus: the utility and urgency for a query language for Hadoop. Rick van der Lans and Gigaom have done an excellent job surveying the happenings here so I'll refer you to their posts (Rick's "The SQL-fication of NoSQL Continues" and Gigaom).
Simba's been also working on this query language question. If you look down the page, you'll note that Cathy has a writeup about our "MDX on Impala"; drop us a line to let us know what you think.
This work with Impala is inspired by what we've been doing for some of our customers. But others have been thinking about this query language situation too.
Notably, the Apache Drill project is looking to provide a pluggable query language front-end. It's possible to imagine Drill with direct support for MDX. That is, no translation from MDX to SQL; just direct execution plans built from MDX statements.
Strata is going to be awesome this year with the market coming on full steam behind SQL. I'll be reporting on Strata as it plays out.
It's gonna be exciting...
Last Thursday we sent the Impala user group a demonstration of some Big Data analytic technology in early development here at Simba. You can watch the video here: Microsoft Excel PivotTables on Cloudera Impala via Simba MDX Provider. In this video, I give a short demonstration of how to use core PivotTable functionality in Excel 2010. The video shows how to establish connectivity, build a pivot, sort, format, and slice.
What the video doesn’t show or describe is the technical implementation under the covers. Conceptually, this implementation is very similar to the Teradata OLAP Connector. The architecture is described in Simba’s case study on Teradata’s implementation, the architectural diagram is on page two. Simba’s MDX Provider is an ODBO provider installed on the same machine as Excel. Simba also has a tool for building cube definitions, which we call schemas. These schemas are saved in XML. Simba’s schema maps MDX metadata constructs to Impala table structures. When an ODBO compliant tool such as Excel issues an MDX query, Simba’s MDX Provider maps the MDX query to HiveQL, sends the HiveQL to Impala, collects the results and returns them to the end user.
The most important technical concept is that there is no intermediate server or cube structure that caches data, all queries go direct from Simba’s MDX Provider to the Cloudera Impala server in real time.
Because Simba already has an MDX engine that translates MDX to SQL, it was not overly difficult to adapt the engine to issue HiveQL. HiveQL supports almost all SQL constructs. For this reason, this demonstration provider supports the breadth of MDX functionality, including calculated members and measures in the cube (not shown). For those of you in the know about MDX, the breadth of MDX query support was shown when using slicers. The slicer issues some pretty complicated MDX queries under the covers, so it’s a pretty good representation of what you can do.
As the technology is in early development, it is not generally available for early testing. We put this demo out there so that we could get some feedback and validation of the design. You can leave your comments here or in the Impala user group thread. Specifically, we’d love to hear feedback on:
We are pretty excited about putting the world’s most popular BI tool on top of Big Data and using this as our first step towards a rich Big Data analytics story. Simba has a booth at Strata Santa Clara so if you’re at the conference, you can stop by and talk with us about Excel, Big Data, BI, and analytics and how you plan to put all these technologies into place in your organizations. Your stories and input are vital so that we can build out tools you can use.
The PASS BA conference is coming up in April and PASS ran their 24h of PASS event last week to whet our appetite.
The slides and recordings are now posted and there's a good number of stories on Excel, Power View and Big Data. Here's a list of notable sessions to catch up on the latest Microsoft happenings:
Join us for our webinar series on connecting to Oracle OLAP on Tuesday, February 5. You may have already read about this on Dan Vlamis’s blog. Vlamis and Simba will co-present.
We will show the following BI clients:
SAP BusinessObjects Analysis, edition for OLAP with Simba XMLA for Oracle
9:00 a.m PST (11:00 a.m. CST or 12:00 p.m. EST)
IBM Cognos with Simba XMLA for Oracle
9:30 a.m PST (11:30 a.m. CST or 12:30 p.m. EST)
Microsoft Excel with Simba XMLA for Oracle
10:00 a.m PST (12:00 p.m. CST or 1:00 p.m. EST)
You will hear how some major customers in manufacturing, telecoms, and entertainment industries implemented analytic solutions using Oracle OLAP and Simba's Solutions for Oracle. The webinars will include live demos and customer case studies.
Our customers have used Simba's solutions to capitalize on their existing investments in their front-end BI clients and their Oracle databases. Because the back-end is built on the secure and scalable Oracle database, and the front-end is built on users' existing tools, both IT and business analysts are happy.
Big Data ws undeniably top-of-mind for 2012 and will continue to in 2013. The market is growing but also maturing rapidly.
What caught my eyes last week was a session by Anand Venugopal ("AV") from Impetus. He presented use cases from their customer projects in the last four years. Drawing from these, AV categorized Big Data projects/initiatives into 3 types:
The percentage numbers were particularly revealing. The three types are close but natural language processing was the lead among the three.
There's a lot more material in the session that's worth studying especially if you are keen to learn and understand the whats and where they are being used.