text-extractor

Extract text from multiple Word documents in a folder, parse identified sections into separate columns of an excel spreadsheet output.

Define where each section starts and ends with start and stop words being the first word of a paragraph or newline. Script will go through all .docx files and copy text starting from each start_word_x variable to its stop_word_x or conditional_word variables. The body of text between the start and stop words will comprise of each column in the excel output. Configure your start and stop words identifying the body of text for each column at the top of the script, as well as the columns of your output file.

Example:

if:

start_word = 'Startword'

stop_word = 'Stopword'

then in Word doc sample:

this text wont be copied

°Startword bla bla bla

This text will be copied including the bla'blas above^.

°Stopword: this text wont be copied

note:
to add more words tags add the coinciding if and while loop, as well as the column headers of the capturing DataFrame.

Required Modules:

os
pandasdocx
re
docx
xlsxwriter

#whosawme

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
text_extraction_w.config.py		text_extraction_w.config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-extractor

Example:

if:

About

Releases

Packages

Languages

License

whosawme/text-extractor

Folders and files

Latest commit

History

Repository files navigation

text-extractor

Example:

if:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages