16 Mar 2016, 09:30

Around the Data - March 2016 - Kafka Connect and Streams, Hortonworks HDP 2.4, Elasticsearch cluster sizing, Spark Graph Frames, Flink 1.0


  • Kafka and Confluent 2.0.1 were released (bugfix release)
  • A more detailled presentation about Kafka Connect.
  • Confluent released a custom version of what will be in Kafka 0.10 and it's called Kafka Streams ; it aims to manipulate data within Kafka without requiring any external system such as Spark & co. Interesting move to make it feature richer and more autonomous to enirch data but challenge may be that kafka was performant due to it's low level of features. Will it remain performant with such additions ?

Hortonworks (Hadoop Distribution)



  • Databricks introduces Graph Frames ; it"s built on top of Spark Data Frames (and related APIs) and aim to ease manipulating graph data.


10 Feb 2016, 09:30

Around the Data - February 2016 - ELK, Kafka, Flink, Spark





  • Spark 2015 year in review : Databricks (core developper of Spark) made a review of 2015 : the 4 release, the features, how spark is used, etc. I was surprised to see that majority of spark usage was as a standadlone cluster and not in an Hadoop context.

13 Jan 2016, 09:30

Around the Data - January 2016 - Spark & Elasticsearch


  • Spark-TS is an addon to spark by Cloudera to ease working with time-series data : announce ; code ; website
  • SparkR is the ability to use Spark with R (programming language for statistical computing and graphics) ; a blog post in French to discover this API.
  • Spark 1.6 released (via); beyond bugfixes, improvements and performance:
    • New DataSet API (flagged as experimental) : it brings SparkSQL Engine on top of RDD ; Changes on the API between Dataset and Dataframes will be made post 1.6 release.
    • New models/algorythm for MLlib
  • Introducing Spark Datasets : the blog post presents Datasets and how they are more performant on structured data than RDD or Dataframes.




09 Dec 2015, 09:30

Aroun the Data - December 2015
  • HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
  • For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
  • Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers.
  • Kudu is to be released as an apache projet and Impala should become an apache project too.
  • Kafka 0.9 is released :
    • Better securtity : SSL certificates, kerberos, wired encryption, improved permissions
    • "Kafka connect" to ease pushing/pulling data from/to kafka. Kafka will include a file connector, Confluent platform will have database & hadoop connector.
    • User defined quota to throttle connections & bandwith
    • New consumer
    • Confluent, the core contributor of Kafka releases their distribution Confluent Platform 2.0, with all features above and the schema registry which allow versionning at least of your message schemas (and compatibility for what I understood). This platform is open-source too with paid support if needed.
    • How to Build a Scalable ETL Pipeline with Kafka Connect : a sample to use Kafka Connect and Schema Registry to pull data from MySQL to HDFS/Hive via Kafka.



    16 Sep 2015, 09:30

    Around the Data - September 2015 - SQL, NoSQL, BigData and streaming

    Having new activities around big-data topics from this month, I'll publish here also my findings on this topic.

    So the "Around the Web" edition should be still be published on every last wednesday of the month and the "Around the Data" series should be published every wednesday in the middle of the month.

    (No)SQL/Big Data

    • In a long interview splitted in two parts, "Where big data is headed and why spark is so big" and "Why NoSQL mattered and SQL still matters", the co-creator of AMPLab (the lab behind Spark at least and other big data tools) review what happend on the last decades with the NoSQL movement, how it enforced traditionnal database to evolve, how it enforced to change all the paradigms around data management and now all the big data evolution. And that SQL still matters :-) A long read but with insights and good points.
    • In the same kind of thoughts, there are some "big data" features in Postgres. Postgres has been used as datamart for a while (but not only) and can be used in some analytics / big data context. So you may start with Postgres first before going further (depend on your context)
    • With "Entretise din't have big data, they have bad data" and "You may not need big data after all", First, it insits on the issue of bad data management both in quantity and accuracy. Then, providing the right data is nice but it's about to provide the right data to the right person to take a decision is better (cf 7-Eleven Japan use case). It's also about clearly defining busines rules but also about more human being skills like coaching around data usage and culture shift / change management to adopt a culture of evidence-based decision making.


    • Beyond batch : Streaming 101 : introduction to streaming principles, concepts and methods.
    • NoETL
      • Iin the same way NoSQL movment tends to answer to points that traditionnal database could not face to some extend, there is the same movment regarding ETL (Extract Transform and Load) tools. Instead of ETL, they promote CTP (Consume, Transform, Produce) concept.
      • Current "pitfalls" of ETL are identified as data duplication, possible data loss, costs, complexity and slowliness. Idea is also to remove this intermediary step of the ETL which makes the bridge between two systems.
      • New challenge would be to rely first on strong API to avoid the extract phase and data loss/duplication, then new processing tools to allow close to real-time processing and which will produce outcomes, without requiring this intermediary step represented by the ETL. I requires you swith from a batch logic (processed at a given time) to a flow mechanism.
      • Idea behind NoETL is interesting to review the way you manage and process your data. But it has strong requirements / pre-requisites. It requires your applications, systems and infrastructures being well structured and adapted to such needs.

    24 Jun 2015, 09:30

    Around the Web - June 2015 - Microservices & Big Data


    • A complete description of what are microservices, its challenge, etc.
    • Another well illustrated introduction to microservices.
    • Pinricples of micro-services : introduction (in french) to micro-services but slides are in english.
      • Microservices are defined as small autonomous services that work together.
      • 7 principles for microservices : Culture of automation, Hide implementation designs, Decentralise all the things, Deploy independantly, Isolate failure, Highly observable, Modelled around business domain.
    • Xebian published a series of articles on microservices (in French) : Microservices: the concepts ; Microservices : architectures and the most interesting is about pitfall and/or antipatterns of microservices.
    • Summary of Microxchg Day 1 : Microxchg was a conference about microservices. This is a summary of Day 1 with lots of insights. So I'll not sum up there.
    • Domain service aggregator: a structured approach to microservice composition (par Caoilte O’Connor) : Summury from a developer from ITV (VoD in UK) about how they rewirte part of the services using microservices and the issue they met.
    • Micro-services at BlaBlaCar : Tech team introduces the principles used to switch from a single monolithic application to microservices by using "Event sourcing", "Specialised Layers" and "Think API".
      • I found interesting the idea that data layers are not to communicate among them ; it's about the business layer to fetch and aggregate data to be autonomous.
      • Event sourcing requires a broker/messaging system ; seems a kind of new SPOF to some extend (what happens if this system is down) and also how to manage application awareness regarding messages and related behaviour.
    • Microservices – The One with Polyglot Portfolio ; a summary in French about the journey of a company from a monolythic application to microservices architecture and the challenges they met.
      • Microservices architecture "enforce" the paradigm "the right tool to do the job" but with the challenge of havingthe right skills on the long term.
      • CQRS paradigm which distinguish the way you insert/update content and the way you read it. You no longer have a single model for all your CRUD operations. Interesting but challenging also in terms of coherence.
    • Adopting Microservices at Netflix: Lessons for Architectural Design - a feedback from Netflix about their best practices regarding microservices.

    Big Data

    • Summary (in French) about the Strata+Hadoop World in London back in May with two insights :
      • Seems there is a switch in the way data are provided from batch to real-time
      • Spark seems the new product of the year in the hadoop ecosystem to ease data manipulation
    • Summary of the Spark meetup in Paris (in French) which covered an "introduction" to Spark, 2 ways to use Spark for data scence purpose and last the link between search and recommendation.