Product ID and the size details per product were erroneously concatenated. Next we need to extract the size details from the Product field in the other Input step, a lookup table, to be able to join it later with the ‘Percentage of Sales’ table. It can only be applied on a single table and then requires manual filtering operations, which is not scalable.Step 2: Extracting size information from the lookup table Unfortunately, the phonetic grouping feature offered by Tableau Prep Builder is not adapted to the creation of a join. To conclude briefly, let’s say that the comparison of the 2 ETLs turned to Anatella’s advantage for my case study (joining 2 tables with Fuzzy Matching). The join part (up to step 1) runs in 1.58 seconds where Tableau Prep Builder takes 10 seconds. The entire process runs in 14.84 seconds (including fuzzy matching). A comparison of these different methods will be the subject of a specific article.Īnother advantage of Anatella is speed. The different algorithms available are Dice’s similarity coefficient, Damereau LevenStein’s similarity method, Jareau Winkler‘s similarity method and Damereau Levenstein’s distance calculation. As Anatella returns the similarity coefficient, all you have to do is choose a threshold and use it. You can indeed choose the algorithm to apply to calculate the similarity between two fields. This feature is exciting because it allows you to go much further than Tableau Prep for example, but also than an ETL like Talend. In step 2, I apply the famous fuzzy matching join tool.In step 1 I separate the entries for which the join went well (lower arm) from the entries for which no match was found (upper arm).I’m not going to detail the whole process, but I’ll focus on the essential parts and of course, on the fuzzy matching. This is what the set up to solve the problem looks like (click on it to enlarge it). For me, the significant advantage of Anatella is the richness of the proposed features (+/-300). The look & feel is pretty much the same (boxes, arrows, parameters). If you know how to work under Tableau Prep (or any other ETL), you should not be out of place. The advantage of this solution (besides being free for small setups) is that it has a join tool coupled with fuzzy matching. To solve my fuzzy matching problems AND to have a scalable solution, I then turned to Anatella. It’s now time to tell you about the second solution. This is possible when you have a few entries, but the method would not have been scalable. I should have made a join, then a grouping, and finally a manual deduplication. The concern is that groupings are only possible on a single table. The algorithm used is called Metaphor 3 (open source, see here). Groupings are useful within the same table to detect variations. Some research later, I came across this article describing a method of grouping by pronunciation. It remains to be seen if a fuzzy join between tables was possible. I have highlighted in yellow the ones I was talking about above. The advantage of Tableau Prep Builder is that the ETL process is graphical and problems are evident (see below, Tableau Prep automatically puts them in red). But I needed a more elegant solution.Īs I intended to visualise my data in Tableau, the first solution I tried was to use Tableau Prep Builder. In short, as you will have understood, I could have spent a few hours cleaning my database and having the entries “checked” so that the join could work. “Central African Republic” and “Central African Rep.”īefore I was hounded by the “search and replace” fanatics in Excel, I was looking for a solution that would be more economical in terms of transformations.“Antigua and Barbuda” in one file, “Antigua & Brabuba” in the other.“Cabo Verde” in the national language of the country, “Cap Verde” in English.The problem is that a country name is far from being a constant. I found this database on the website of the World Trade Organisation. I preferred to look for a database of the different countries and the “official” region to which they are attached. I could have made groups of countries directly in Tableau, but when you have 200 entries, it is tedious (and not necessarily error-free). It, therefore, seemed appropriate to me to visualise these flows at a higher level of granularity: the region of origin. So, on one side, you have about 200 countries of origin, and on the other side about 30 countries of destination. The database details the number of migrants according to their country of origin and their country of destination.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |