codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Finding lines in .txt file that contain keywords from two different set()


On 9/09/19 4:02 AM, A S wrote:
> My problem is seemingly profound but I hope to make it sound as simplified as possible.....Let me unpack the details..:

...

> These are the folders used for a better reference ( https://drive.google.com/open?id=1_LcceqcDhHnWW3Nrnwf5RkXPcnDfesq ). The files are found in the folder.


The link resulted in a 404 page (for me - but then I don't use Google). 
So, without any sample data...

 > 1. I have one folder of Excel (.xlsx) files that serve as a data 
dictionary.
 >
 > -In Cell A1, the data source name is written in between brackets
 >
 > -In Cols C:D, it contains the data field names (It could be in either 
col C or D in my actual Excel sheet. So I had to search both columns
 >
 > -*Important: I need to know which data source the field names come from
 >
 > 2. I have another folder of Text (.txt) files that I need to parse 
through to find these keywords.


Recommend you start with a set of test data/directories. For the first 
run, have one of each type of file, where the keywords correlate. Thus 
prove that the system works when you know it should.

Next, try the opposite, to ensure that it equally-happily ignores, when 
it should.

Then expand to having multiple records, so that you can see what happens 
when some files correlate, and some don't.

ie take a large problem and break it down into smaller units. This is a 
"top-down" method.


An alternate design approach (which works very well in Python - see also 
"PyTest") is to embrace the principles of TDD (Test-Driven Development). 
This is a process that builds 'from the ground, up'. In this, we design 
a small part of the process - let's call it a function/method: first we 
code some test data *and* the expected answer, eg if one input is 1 and 
another is 2 is their addition 3? (running such a test at this stage 
will fail - badly!); and then we write some code - and keep perfecting 
it until it passes the test.

Repeat, stage-by-stage, to build the complete program - meantime, every 
change you make to the code should be tested against not just 'its own' 
test, but all of the tests which originally related to some other 
smaller unit of the whole. In this way, 'new code' can be shown to break 
(or not - hopefully) previously implemented, tested, and 'proven' code!

Notice how you have broken-down the larger problem in the description 
(points 1 to 5, above)! Design the tests similarly, to *only* test one 
small piece of the puzzle (often you will have to 'fake' or "mock" 
data-inputs to the process, particularly if code to produce that unit's 
input has yet to be written, but regardless 'mock data' is thoroughly 
controlled and thus produces (more) predictable results) - plus, it's 
much easier to spot errors and omissions when you don't have to wade 
through a mass of print-outs that (attempt to) cover *everything*! (IMHO)

Plus, when a problem is well-confined, there's less example code and 
data to insert into list questions, and the responses will be 
equally-focussed!


Referring back to the question: it seems that the issue is either that 
the keywords are not being (correctly) picked-out of the sets of files 
(easy tests - for *only* those small section of the code!), or that the 
logic linking the key-words is faulty (another *small* test, easily 
coded - and at first fed with 'fake' key-words which prove the various 
test cases, and thus, when run, (attempt to) prove your logic and code!)


-- 
Regards =dn