Genesis of M6's Datalake

Genesis of M6's Datalake

At M6 we work hard to leverage Data to help our ad sales team, our CRM teams and our product innovation process. Over the past 2 years, we have gone from using a market DMP to creating our second Hadoop platform. We now feel that our stack is stable, reliable and scalable so it feels like the right time to share our experience with the community.

![Cerebro]”./logo_cerebro.png)

Step 1: embracing the DMP

Our first use case was to monetize data through targeted publicity. We decided to start by installing a DMP (Data Management Platform) because it was a very fast solution to deliver our major needs, in particular :

The match seemed quite obvious and there’s a good reason for that: DMPs are designed for this use case above all others.

We chose Krux (now Salesforce) and deployed it over our ~30 sites and applications. Installing Krux on our network and plugging it to our video and display adservers ended up taking a few months and a decent effort. Convincing all our teams that the increase in ad revenue would make it worth the development time and the negative impact on webperf wasn’t trivial, but got through thanks to our top management sponsoring. Once on the job, the deployment was quite smooth on the web and mobile apps, but validating the quality of the ingested data turned out to be an endless project.

At the end of the day, Krux’s DMP did the job. In November 2015 we launched Smart6tem, our Data platform & advertisement offer based on segments (announcement here, articles here here or here). This move had a very positive effect on our advertisement market, and allowed to start making Data mean something at M6. To give some detail of our use of the DMP, it turned out building our own segments was very successful, but we didn’t use any 3rd party interconnection because we didn’t find any valuable Data to buy and didn’t want to reduce the value of our own Data by sharing it out.

Krux’s segment builder

Krux’s segment builder

Once the Advertisement use case was out in the market, we moved our efforts towards leveraging the DMP for our CRM teams. The rationale was simple: targeted emails are more efficient than newsletters. We were hoping to reduce the email pressure on our users while increasing the performance both for revenue and traffic.

Step 2: first round testing Hadoop

Having a DMP is both a great accomplishment and frustrating. It’s great because you can start to combine the use of your service with the user profiles to produce segments and activate use cases to address them. But for CRM, the workflows to plug segments into our emailing systems weren’t native and we needed to build some custom workflows. No rocket science, but when we first received a 2 Billion line file for the user/segment map that we needed to filter and convert into another format, our developers went grumpy. We also got frustrated very fast because we wanted to start to extract some unpreceded analytics insights combining our user knowledge (our major service, 6play had just switched to fully logged-in users) with usage stats or with external sources like our adserver logs. Advanced analytics was clearly not the field of Krux. Last but not least came some limitations (either due to the design of Krux or the pricing):

At this point, Hadoop came in as an evidence, so we created our first cluster. The process of creating this proof of concept cluster was pretty much a black box for us since we charged a partner with the job. We ended up with the following setup, all hosted by AWS :

Nothing crazy but that brought us into the world of Hadoop, and that was a major move. We also staffed our first Data Scientist to start to explore our Data and imagine use cases.

Our first steps in Hadoop were hesitant, but within a few months, we had created our first Data Lake, our targeted CRM was live and we had produced a few dozen dashboards providing unpreceded insights throughout the company. From the business perspective, it was a success. For the people who got their hands on a Data Lake for the first time the experience was ground breaking. For the first time, we could connect information from half a dozen different tools seamlessly. An example: finding how many ads were seen by women from 25 to 49 years old during the NCIS TV show. Before the Data Lake, this would have been impossible. The closest we could get would take the following process :

As you can see, combining information between our ecosystems involved some very manual processes and could only lead to approximations, so basically we never did them.

With our Data in Hadoop, all this turns out to be a simple SQL query in Hue (a PhpMyAdmin style interface for Hadoop):

SELECT COUNT(*) FROM adserver_logs A
   JOIN users U ON A.uid = U.uid
   JOIN programs P on A.pid = P.id
WHERE A.type = 'impression'
   AND U.age >= '25'
   AND U.age <= '49'
   AND U.gender = 'F'
   AND P.name = 'NCIS'

Hadoop and our Data Lake, we could just jump over the barriers between tools and ecosystems within seconds. Combined with the ability to code in various languages, we could instantly start to industrialize such insights and start going further.

We convinced our top management very fast about the value of having our own Hadoop cluster, and since it was very (VERY) expensive, we decided to internalize it.

Step 3: building our internal Hadoop cluster

So there we were with a quite simple roadmap: replace our v1 Hadoop cluster to reduce costs and improve performance as much as possible. We managed to divide the price by 3 while multiplying the resources by 8.

The first step on this road was to staff a tech team to design and create our platform. That ended up being very tricky and finally took us 10 months to complete.

Once the team was staffed, we got onto the job. We had 5 steps :

a) Hosting platform

This stage of the project was a very religious one. Many people at M6 had a very strong desire to go towards cloud and managed services, others were totally in favor of Hadoop and have full in house control over the platform. The major options were:

We spent 3 months talking to the different vendors and considering options.

The first decision we took was to use Hadoop instead of managed services. The AWS and Google sales teams were very convincing, but we finally declined for 2 main reasons:

The next step was to choose between the 3 hosting options. On a side note, we compared the price for x4 and x10 resources compared to our v1 platform. At the end of the process we wrote up an evaluation grid. Here is the summary version.

Hosing platform comparison

The decision was there, we went for a fully on premise stack with Odiso. To detail some of that evaluation, here’s a few insights on what it came down to.

b) Hardware

Going on premise means buying physical servers and building them. Our goal here was to massively upgrade our current platform to scale with the company’s usage of Big Data. Since the price was very reasonable, we settled down to x8 on CPU, RAM and storage compared to our initial Hadoop cluster. Here’s the stack we bought:

4 KVM servers:

15 Data Nodes:

Building and racking the servers was quite straight forward, there’s nothing special about Hadoop in this process except the high quality network connectivity.

c) Software stack

Designing the software stack was very straight forward. We had the desire to stay as close as possible to the stack our users were getting used to, and it was pretty much a standard Cloudera stack. That suited us very well because our first priority was to avoid any regression, both for the projects (during this period, they had massively multiplied as we’ll detail in the migration part below) and for the users. Another early choice was to use virtual machines with Proxmox and not dive into the Kubernetes + Docker adventure. Although that was tempting and will probably be an option in future, we considered mastering the Hadoop stack was enough on our plate for the moment, we needed to reduce risk.

Here’s the stack we chose:

d) Install Hadoop and all our tools

One of the fun parts of our design process was to choose a name for our new cluster. We called it Cerebro (in reference to X-Men and the global view of Professor Xavier), and created a logo :)

![Cerebro]”./logo_cerebro.png)

Setting this stack up felt very simple from my perspective, but that’s surely because our awesome team overcame the issues silently. On the timeline, the biggest part of the setup was receiving the physical servers. That took about 3 months because some parts (SSD disks) were out of stock for a long time. We received a first part of the Data Nodes a couple of months before the rest of the servers, so we decided to start building the cluster with temporary Name Nodes and services, and migrate them after.

We deployed Cloudera Hadoop via KVM servers (managed with Puppet) and the Cloudera Manager. Very straightforward. We used Ansible to install our stack, manage all our configuration files and user access.

e) Migrate our projects and Data

Migration was a project in the project. Between the day we decided to build our internal platform and the day we delivered, 20 months had gone by. During all that time, Big Data had been going through high pace growth inside M6. We scaled from ~1 to ~25 users, from 0 to ~200 Dashboards and ~60 projects. All of this relying on our “Proof Of Concept” platform created with a partner. To be honest, it was an utter mess in any Software Engineer’s eye. Imagine: no version control, a unique user hosting all the projects and executing 6000 crontab lines each day. No job optimisation whatsoever. Moreover, most of our users had no developpement process knowledge, so they didn’t see any problem with all this and weren’t all in favour of any change. The context was challenging.

The first step of our migration project was to bring all this back into a “migratable” state. To do that, we went through the following steps :

We reached this milestone after 4 months of a large rework of all our projects by all our teams. The collective investment in this process was a real team success.

The second step was to rebuild all the projects and databases on the new platform. Thanks to our new backup system that copied all our Data to S3, rebuilding databases was easy. Basically it took creating a script to restore the backup in the new platform, and we could start checking integrity by querying the datasets. Rebuilding projects was a similar process, we just had to deploy each project and it was ready to test. Everything went fast and easy, proving that all the preparation moves we made were very valuable.

The third step was to double run all our projects so we could be sure everything worked on the new platform while not breaking production. There’s a tricky part to this because a fair amount of our projects include an output towards external servers (either other teams within M6 or 3rd parties). For this we had to add an “only run on” logic. That lead us to create a unified configuration and a library for exports. We also had to distinguish all our code execution monitoring so we could keep an eye on what each workflow was producing, both in production and on the new platform. For this we added the platform name to all our Graphite nodes and updated all our dashboards to filter by platform. With those 2 moves, most projects managed to run “out of the box”. Some needed some refactoring, mostly for parts that had been forgotten in the first step.

The fourth step was validating that our double run was working well. The theory of this validation was quite elaborate. For each table or output job we would count the number of lines in each partition produced, run checksums, dive into the details of the monitoring, and run manual tests. In practice, that part cracked up quite fast because our v1 platform was being totally outscaled and therefore all our users really didn’t want to look back. We checked that the backups were good with file sizes and line counts, and for the rest we relied on our monitoring to be sure that the jobs runned and produced the same output volumes. For the most critical production jobs we went into some detailed manual checking, but we took the jump very fast.

The fifth and last step was migrating all our Tableau Dashboards to Tableau online. We needed all our ingestion and treatment jobs to be up and running before we could migrate our 200+ Dashboards. Once that was done, most dashboards took nothing more that being opened in Tableau Desktop and published to Tableau Online. The only exceptions were the bunch of users who had missed some tables out in step 1. Those had to run through the whole process at fast speed… Not very pleasant for them.

So there we are, we now have our 2 feet in our second Hadoop platform. Now we’re looking forwards, both on how we make this platform evolve to empower our future use cases, and to raise our innovation pace for Big Data to count much more within M6. By all means stay posted, we’ll update you on some of the awesome projects we’ve been working on!

Take away