Analyse tables with unknown structure and fault tolerance
up vote
1
down vote
favorite
I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.
For collecting all the data from the tables summing them up I face several problems.
Step 1: I look for the header keywords. Searching for if "cars==cars" is not possible, because the header may appear as "car", "Car" or "Cars". There is also the possibilty that there is a spelling mistake in the word. So iterating through all possibilites can also result in false.
When I search for solutions to this problem I found out about the fuzzy logic, but I would be thankful about other approaches.
Step 2: I found the desired keyword in the table, but how do I know where the related data is placed? It can be below it, but also right cell next to it. Are there approaches to get information about the general structure of the table?
algorithm data-analysis tabular
add a comment |
up vote
1
down vote
favorite
I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.
For collecting all the data from the tables summing them up I face several problems.
Step 1: I look for the header keywords. Searching for if "cars==cars" is not possible, because the header may appear as "car", "Car" or "Cars". There is also the possibilty that there is a spelling mistake in the word. So iterating through all possibilites can also result in false.
When I search for solutions to this problem I found out about the fuzzy logic, but I would be thankful about other approaches.
Step 2: I found the desired keyword in the table, but how do I know where the related data is placed? It can be below it, but also right cell next to it. Are there approaches to get information about the general structure of the table?
algorithm data-analysis tabular
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.
For collecting all the data from the tables summing them up I face several problems.
Step 1: I look for the header keywords. Searching for if "cars==cars" is not possible, because the header may appear as "car", "Car" or "Cars". There is also the possibilty that there is a spelling mistake in the word. So iterating through all possibilites can also result in false.
When I search for solutions to this problem I found out about the fuzzy logic, but I would be thankful about other approaches.
Step 2: I found the desired keyword in the table, but how do I know where the related data is placed? It can be below it, but also right cell next to it. Are there approaches to get information about the general structure of the table?
algorithm data-analysis tabular
I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.
For collecting all the data from the tables summing them up I face several problems.
Step 1: I look for the header keywords. Searching for if "cars==cars" is not possible, because the header may appear as "car", "Car" or "Cars". There is also the possibilty that there is a spelling mistake in the word. So iterating through all possibilites can also result in false.
When I search for solutions to this problem I found out about the fuzzy logic, but I would be thankful about other approaches.
Step 2: I found the desired keyword in the table, but how do I know where the related data is placed? It can be below it, but also right cell next to it. Are there approaches to get information about the general structure of the table?
algorithm data-analysis tabular
algorithm data-analysis tabular
edited Nov 8 at 16:43
Brian Tompsett - 汤莱恩
4,153133699
4,153133699
asked Nov 8 at 9:34
thohemp
63
63
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)
Step a (part 2) - use synonym database / thesaurus to find similarly named columns
Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well
Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)
Step a (part 2) - use synonym database / thesaurus to find similarly named columns
Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well
Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.
add a comment |
up vote
0
down vote
Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)
Step a (part 2) - use synonym database / thesaurus to find similarly named columns
Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well
Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.
add a comment |
up vote
0
down vote
up vote
0
down vote
Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)
Step a (part 2) - use synonym database / thesaurus to find similarly named columns
Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well
Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.
Step a (part 1) - naive implementation would be dictionary distance (as you want to handle typos)
Step a (part 2) - use synonym database / thesaurus to find similarly named columns
Step b (part 1) - data is aligned the same way the headers are - so if headers are aligned vertically, then data will be as well
Step b (part 2) - similar data will have the similar data type (raw string, number, zip-code), by checking to the right and to downwards you can detect which is the real direction.
answered Nov 8 at 17:03
Adam Kotwasinski
1,818521
1,818521
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204936%2fanalyse-tables-with-unknown-structure-and-fault-tolerance%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password