codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Finding lines in .txt file that contain keywords from two different set()


My problem is seemingly profound but I hope to make it sound as simplified as possible.....Let me unpack the details..:

1. I have one folder of Excel (.xlsx) files that serve as a data dictionary.

-In Cell A1, the data source name is written in between brackets

-In Cols C:D, it contains the data field names (It could be in either col C or D in my actual Excel sheet. So I had to search both columns

-*Important: I need to know which data source the field names come from

2. I have another folder of Text (.txt) files that I need to parse through to find these keywords.

These are the folders used for a better reference ( https://drive.google.com/open?id=1_LcceqcDhHnWW3Nrnwf5RkXPcnDfesq ). The files are found in the folder.

This is the code I have thus far...:

import os, sys
from os.path
import join
import re
import xlrd
from xlrd import open_workbook
import openpyxl
from openpyxl.reader.excel import load_workbook
import xlsxwriter


#All the paths
dict_folder = 'C:/Users/xxxx/Documents/xxxx/Test Excel' 
text_folder = 'C:/Users/xxxx/Documents/xxxx/Text'

words = set()
fieldset = set()
for file in os.listdir(dict_folder):
if file.endswith(".xlsx"):
    wb1 = load_workbook(join(dict_folder, file), data_only = True)
    ws = wb1.active
   #Here I am reading and printing all the data source names set(words) in the excel dictionaries:
    cellvalues = ws["A1"].value
    wordsextract = re.findall(r"\((.+?)\)", str(cellvalues))
    results = wordsextract[0]
    words.add(results)
    print(results)

    for rowofcellobj in ws["C" : "D"]:
        for cellobj in rowofcellobj:
           #2. Here I am printing all the field names in col C & D in the excel dictionaries:
            data = re.findall(r"\w+_.*?\w+", str(cellobj.value))
            if data != []:
                fields = data[0]
                fieldset.add(fields)
                print(fieldset)
                #listing = str.remove("")
                #print(listing)               


#Here I am reading the name of each .txt file to the separate .xlsx file:
for r, name in enumerate(os.listdir(text_folder)):
    if name.endswith(".txt"):
        print(name)

#Reading .txt file and trying to make the sentence into words instead of lines so that I can compare the individual .txt file words with the .xlsx file 
txtfilespath = os.chdir("C:/Users/xxxx/Documents/xxxx/Text")


#Here I am reading and printing all the words in the .txt files and compare with the excel Cell A1:
for name in os.listdir(txtfilespath):
    if name.endswith(".txt"):
        with open (name, "r") as texts:
            # Read each line of the file:
            s = texts.read()
            print(s)


            #if .txt files contain.....() or select or from or words from sets..search that sentence and extract the common fields

            result1 = []
            parens = 0
            buff = ""
            for line in s:
                if line == "(":
                    parens += 1
                if parens > 0:
                    buff += line
                if line == ")":
                    parens -= 1
               if not parens and buff:
                    result1.append(buff)
                    buff = ""
                    set(result1)

#Here, I include other keywords other than those found in the Excel workbooks 
   checkhere = set()               
   checkhere.add("Select")
   checkhere.add("From")
   checkhere.add("select")
   checkhere.add("from")
   checkhere.add("SELECT")
   checkhere.add("FROM")
   # k = list(checkhere)
   # print(k)  

   #I only want to read/ extract the lines containing brackets () as well as the keywords in the checkhere set. So that I can check capture the source and field in each line:
   #I tried this but nothing was printed......
   for element in checkhere:
       if element in result1:
        print(result1)


My desired output for the code that could not be printed when I tried is:

(/* 1.select_no., biiiiiyyyy FROM apple_x_Ex_x */ 
 proc sql; "TRUuuuth")

(/* 1.xxxxx FROM xxxxx*/ 
proc sql; "TRUuuuth")

(SELECT abc AS abc1, ab33_2_ AS mon, a_rr, iirir_vf, jk_ff, sfa_jfkj
    FROM &orange..xxx_xxx_xxE
 where (asre(kkk_ix as format 'xxxx-xx') gff &bcbcb_hhaha.) and 
  (axx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.)
 )

 (/* 1.select_no. FROM apple_x_Ex_x */ 
 proc sql; "TRUuuuth")

 (SELECT abc AS kfcccc, mcfg_2_ AS dokn, b_rr, jjhj_vf, jjjk_hj, fjjh_jhjkj
    FROM &bfbd..pear_xxx_xxE
 where (afdfe(kkffk_ix as format 'xxxxd-xx') gdaff &bcdadabcb_hdahaha.) and 
  (axx(xx_ix as format 'xxxx-xx') lec &jgjsdfdf_vnv.)
 )



After which, if I'm able to get the desired output above, I will then compare these lines against the word set() and the fieldset set().

Any help would really be appreciated here..thank you