Having new activities around big-data topics from this month, I'll publish here also my findings on this topic.
So the "Around the Web" edition should be still be published on every last wednesday of the month and the "Around the Data" series should be published every wednesday in the middle of the month.
- In a long interview splitted in two parts, "Where big data is headed and why spark is so big" and "Why NoSQL mattered and SQL still matters", the co-creator of AMPLab (the lab behind Spark at least and other big data tools) review what happend on the last decades with the NoSQL movement, how it enforced traditionnal database to evolve, how it enforced to change all the paradigms around data management and now all the big data evolution. And that SQL still matters A long read but with insights and good points.
- In the same kind of thoughts, there are some "big data" features in Postgres. Postgres has been used as datamart for a while (but not only) and can be used in some analytics / big data context. So you may start with Postgres first before going further (depend on your context)
- With "Entretise din't have big data, they have bad data" and "You may not need big data after all", First, it insits on the issue of bad data management both in quantity and accuracy. Then, providing the right data is nice but it's about to provide the right data to the right person to take a decision is better (cf 7-Eleven Japan use case). It's also about clearly defining busines rules but also about more human being skills like coaching around data usage and culture shift / change management to adopt a culture of evidence-based decision making.
- Beyond batch : Streaming 101 : introduction to streaming principles, concepts and methods.
- Iin the same way NoSQL movment tends to answer to points that traditionnal database could not face to some extend, there is the same movment regarding ETL (Extract Transform and Load) tools. Instead of ETL, they promote CTP (Consume, Transform, Produce) concept.
- Current "pitfalls" of ETL are identified as data duplication, possible data loss, costs, complexity and slowliness. Idea is also to remove this intermediary step of the ETL which makes the bridge between two systems.
- New challenge would be to rely first on strong API to avoid the extract phase and data loss/duplication, then new processing tools to allow close to real-time processing and which will produce outcomes, without requiring this intermediary step represented by the ETL. I requires you swith from a batch logic (processed at a given time) to a flow mechanism.
- Idea behind NoETL is interesting to review the way you manage and process your data. But it has strong requirements / pre-requisites. It requires your applications, systems and infrastructures being well structured and adapted to such needs.