Analyse tables with unknown structure and fault tolerance

up vote
1
down vote

favorite

I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.

For collecting all the data from the tables summing them up I face several problems.

Step 1: I look for the header keywords. Searching for if "cars==cars" is not possible, because the header may appear as "car", "Car" or "Cars". There is also the possibilty that there is a spelling mistake in the word. So iterating through all possibilites can also result in false.
When I search for solutions to this problem I found out about the fuzzy logic, but I would be thankful about other approaches.

Step 2: I found the desired keyword in the table, but how do I know where the related data is placed? It can be below it, but also right cell next to it. Are there approaches to get information about the general structure of the table?

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

add a comment |

up vote
1
down vote

favorite

I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.

For collecting all the data from the tables summing them up I face several problems.

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

add a comment |

up vote
1
down vote

favorite

I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.

For collecting all the data from the tables summing them up I face several problems.

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.

For collecting all the data from the tables summing them up I face several problems.

algorithm data-analysis tabular

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

edited Nov 8 at 16:43

Brian Tompsett - 汤莱恩

4,153133699

asked Nov 8 at 9:34

thohemp

asked Nov 8 at 9:34

thohemp

asked Nov 8 at 9:34

thohemp

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)

Step a (part 2) - use synonym database / thesaurus to find similarly named columns

Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well

Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204936%2fanalyse-tables-with-unknown-structure-and-fault-tolerance%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)

Step a (part 2) - use synonym database / thesaurus to find similarly named columns

Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well

Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

add a comment |

up vote
0
down vote

Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)

Step a (part 2) - use synonym database / thesaurus to find similarly named columns

Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well

Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

add a comment |

up vote
0
down vote

Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)

Step a (part 2) - use synonym database / thesaurus to find similarly named columns

Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well

Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)

Step a (part 2) - use synonym database / thesaurus to find similarly named columns

Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well

Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

answered Nov 8 at 17:03

Adam Kotwasinski

1,818521

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk