:::::
A
nother task which can be described here is: we have big text file (containing, say, 1,500,000 lines) that we need to split into several smaller ones following certain rule.The rule is: there’s an account number at the certain field in each line (line is delimited with semicolons). There can be many lines in the file contaning the same account number, and these lines are not necessarily located one right after another. We need to split this big file into three smaller files but the particular account number shouldn’t appear in more than one resulting file. And – last but not least – these files should contain about the same number of lines.
What that might be needed for? Well, for example, to start several parallel sessions of creating accounting entries in order to avoid locks.
Trying to resolve this task I went through the following approaches:
• Wrote a program in jBC that proceeds the file line by line and stores account numbers in a dynamic array, keeping in another array the number of times that particular account appears in the file. This number was then used to distribute accounts among resulting files.
And on the second run through the big file – create smaller files. Ended up with very low speed of find when dynamic array is quite big.
• Having python installed on my PC, wrote a python program to do the same. Python is much better in handling big amount of data in memory (for this task I used a key-value array type which is called a dictionary ), so it took less than a minute to proceed. Anyway, the task that seemed a one-time one became a regular chore. Understanding that people who might need to perform it probably don’t have python, I reverted to jBC with a thought to create a temporary hashed file that can be used to keep the necessary work information.
• But finally a simple thought came to my mind: I had Unix as jBASE server OS and – since the order of lines didn’t matter – I sorted the file by account numbers using appropriate Unix command, so I simplified jBC program and put the following comment to the beginning:
Comment
1 *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 * Split big file account-wise (for parallel load). |
3 * By V.Kazimirchik. |
4 * Firstly you have to sort this file: |
5 * bash-3.00$ sort -t ";" +6 My.File > My.File.sorted |
6 *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Then I was able to create smaller files one by one – read all lines belonging to particular account from a big file and then decide whether all of them go to the currently written smaller file or we have to create the next one and write them there.
This issue shows us that though jBC is more or less universal language, not necessarily we use only it to achieve our targets. Another example of using an external tool: once I was
12 PROCEEDING TEXT FILES - BIG FILE SPLITTING; USAGE OF EXTERNAL TOOLS
asked if it’s possible to list not all values of a field in a list but only the particular. See the usual output of listing fields day.no and turnover.debit in FBNK.ACCT.ACTIVITY (with our day of choice as 16):
Listing output C:\sa-tafc> jrunT24.cmd
jsh ~ --> LIST FBNK.ACCT.ACTIVITY DAY.NO TURNOVER.DEBIT WITH DAY.NO EQ "16"
@ID... DAY.NO TURNOVER.DEBIT...
43214-201007 05
16
26 -50
28
45748-201007 05 -3302.08
06 -10115
12165-201007 05 -61441.62
06 -57390.85
What if we want only data for day.no = 16 to be shown (with the total turnover being calculated)? Hopefully without programming. Whatever I tried using LIST still includes all values to the output. To add some difficulty, the position of the necessary value in a field could be any (so standard methods of extraction which shall know the exact position wouldn’t work).
The answer could be – if we don’t need @ids we can (under Unix) filter the output via grep command using something like (note that the command should be entered at one line):
Filtering output jsh ~ --> bash
bash-3.00$ jsh -c "LIST FBNK.ACCT.ACTIVITY ID.SUPP DAY.NO TURNOVER.DEBIT
12 PROCEEDING TEXT FILES - BIG FILE SPLITTING; USAGE OF EXTERNAL
This method doesn’t give us totals though and involves an external tool (under “standard”
Windows there’s no grep though the port of this utility exists for this platform). Somewhat better results we can get with SELECT...SAVING EVAL with saving the resulting list (again it’s a long command shown at two lines just to fit the screen):
SELECT comes to the scene C:\sa-tafc> jrunT24.cmd
jsh ~ --> SELECT FBNK.ACCT.ACTIVITY WITH DAY.NO EQ "16" SAVING EVAL "DAY.NO:’;’:TURNOVER.DEBIT"
1216 Records selected
> SAVE.LIST report.csv
1216 record(s) saved to list ’report.csv’
jsh ~ --> EDIT.LIST report.csv .jBASE.el.4
12 PROCEEDING TEXT FILES - BIG FILE SPLITTING; USAGE OF EXTERNAL TOOLS
016 13;-38532.18 017 16;-21278.11 018 24;-52575 019 26;-46553.79 020 28;-46553.79 021 29;-210061.02 022 05;-1500000 .EXIT
Record ’.jBASE.el.4’ exited from file ’.’
List ’report.csv’ exited jsh ~ -->
Furthermore, we can sort this list using SORT.LIST command. Then we can take it into a tool like MS Excel or Openoffice to proceed further. Of course it adds additional layer of manual work. Maybe it’s a better idea to write a program after all...
Use external tools with caution – think about portability, usability, development time and performance.