Friday, 9 August 2013

Scanner's nextLine(), Only fetching partial

Scanner's nextLine(), Only fetching partial

So, using something like:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " +
files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet =
object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
as one goes over a Hadoop output file, one of the JSON objects in the
middle is breaking... because scan.nextLine() is not fetching the whole
line before it brings it to split. ie, the output is:
->0
{"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{
...
"sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!-
{"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{
...
"sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
Most of the above data has been sanitized (not the URL (for the most part)
however... )
and the URL continues as:
$(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007 in the
file....
So its slightly miffing.
This also is entry #112, and I have had other files parse without
errors... but this one is screwing with my mind, mostly because I dont see
how scan.nextLine() isnt working...
By debug output, the JSON error is caused by the string not being split
properly.
And almost forgot, it also works JUST FINE if I attempt to put the
offending line in its own file and parse just that.
EDIT: Also blows up if I remove the offending line in about the same place.
Attempted with JVM 1.6 and 1.7

No comments:

Post a Comment