| |
|
6.5 String Tokenizer
When reading text files line-by-line, it is usually desirable to tokenize the data. Tokenizing refers to the splitting of data on some common separator, e.g. commas. The following line of text will then be split into different fields:
10/10/2003,New York,60,55
|
Tokenized on the comma results in fields:
- 10/10/2003
- New York
- 60
- 55
The first field can be further tokenized on delimiter slash (/), resulting in:
Tokenizing is commonly performed on CSV (comma separated value) files, e.g. those for spreadsheets.
Java includes a string tokenizer named
StringTokenizer. It is available in package java.util, and hence to use it, we must import that package:
To instantiate a StringTokenizer, we pass a line of text to StringTokenizer, and also specify the delimiter:
StringTokenizer st = new StringTokenizer("Tokenize,me,on,commas,because,I,have,so,many,of,them", ",");
|
Once the StringTokenizer has been set with a String of text and a delimiter, it is possible to iterate over all tokens to extract each one:
while (st.hasMoreTokens()) { System.out.println(st.nextToken()); }
|
We can combine the process of reading from a text file with tokenizing each line. Given the following dataset in a file, we can read each line, then tokenize it, and store it in some data structure.
Contents of file
dataset.csv:
1994-11-28,1P,Region,11,120.8,4,4,1994 1994-12-05,1P,Region,12,118.3,4,1,1994 1994-12-12,1P,Region,12,116.0,4,2,1994 1994-12-19,1P,Region,12,114.1,4,3,1994 1994-12-26,1P,Region,12,113.4,4,4,1994 1994-11-28,1B,Region,11,126.2,4,4,1994 1994-12-05,1B,Region,12,124.9,4,1,1994 1994-12-12,1B,Region,12,123.0,4,2,1994 1994-12-19,1B,Region,12,121.5,4,3,1994 1994-12-26,1B,Region,12,121.1,4,4,1994 1994-11-28,2P,Region,11,112.2,4,4,1994 1994-12-05,2P,Region,12,108.6,4,1,1994 1994-12-12,2P,Region,12,105.7,4,2,1994
|
Read file, tokenize, and print out:
StringTokenizer st; String line; try { BufferedReader bufferedReader = new BufferedReader(new FileReader(new File("dataset.csv"))); while ((line = bufferedReader.readLine()) != null) { st = new StringTokenizer(line, ","); while (st.hasMoreTokens()) { System.out.print(st.nextToken() + " "); // or put in data structure } System.out.println(); } bufferedReader.close(); } catch (Exception e) { e.printStackTrace(); }
|
Output:
1994-11-28 1P Region 11 120.8 4 4 1994 1994-12-05 1P Region 12 118.3 4 1 1994 1994-12-12 1P Region 12 116.0 4 2 1994 1994-12-19 1P Region 12 114.1 4 3 1994 1994-12-26 1P Region 12 113.4 4 4 1994 1994-11-28 1B Region 11 126.2 4 4 1994 1994-12-05 1B Region 12 124.9 4 1 1994 1994-12-12 1B Region 12 123.0 4 2 1994 1994-12-19 1B Region 12 121.5 4 3 1994 1994-12-26 1B Region 12 121.1 4 4 1994 1994-11-28 2P Region 11 112.2 4 4 1994 1994-12-05 2P Region 12 108.6 4 1 1994 1994-12-12 2P Region 12 105.7 4 2 1994
|
Instead of printing the data to the screen, we could have cast the individual fields to different data types (e.g. int, double, etc.), and put them in a data object for later usage.
|
|